Working with harmonized Variables

This exercise shows you how to work effectively with the versioned and harmonized variables of SOEP. Please note that the new SOEP versioning and harmonizing oncept has only been available since SOEP-Core v34 and only applies to the original SOEP-Core data in long format.

Create an exercise path with four subfolders:



  • H:/material/exercises/do
  • H:/material/exercises/output
  • H:/material/exercises/temp
  • H:/material/exercises/log

These are used to store your script, log files, datasets and temporary datasets. Open an empty do file and define your created paths with globals:

* Set relative paths to the working directory
global AVZ 	"H:\material\exercises"
global MY_IN_PATH "\\hume\rdc-gen\consolidated\soep-long\soep.v34"
global MY_DO_FILES "$AVZ\do\"
global MY_LOG_OUT "$AVZ\log\"
global MY_OUT_DATA "$AVZ\output\"
global MY_OUT_TEMP "$AVZ\temp\"

The global „AVZ“ defines the main path. The main paths are subdivided using the globals “MY_IN_PATH”, “MY_DO_FILES”, “MY_LOG_OUT”, “MY_OUT_DATA”, “MY_OUT_TEMP”. The global “MY_IN_PATH” contains the path to your ordered data.

1.) Differences in Response Options

Variables are versioned and harmonized because the response options have changed over time.


The variable plb0038_v1 was asked as a simple yes/no question between 1992 and 2004. Since 2005 the response options have been extended. The individual questionnaires of 2004 and 2005 show these differences. By versioning the variable plb0038 this difference is recognizable for the data user when tabulating. The variable label also shows the period of time from when to when the question was asked differently.

use "$MY_IN_PATH\pl.dta"
tab plb0038_v1
tab plb0038_v2

The variable plb0038_v1 is recoded during the harmonization process and written into a new variable plb0038_h together with plb0038_v2. The harmonized version of the variable should cover the survey period from 1992 to 2014 and should be usable.

tab plb0038_h
tabstat plb0038_v1 plb0038_v2 plb0038_h, by(syear)

2.) Differences in Coding of Response Options

Variables are versioned and harmonized because the coding of the response options has changed over time. Since the values of certain response options can change, it is not possible to easily integrate the various wave-specific variables into a variable in long format. The variable must be appropriately harmonized to be useable.


In the years from 1994 to 2004, the question of occupational change was asked as a category question with six response options in the individual questionnaire. The order of the response options changed in 2005

tab plb0284_v1
tab plb0284_v2

In addition to the different order of the response options, the coding order also changed. The data was stored in the wave-specific “raw” data sets with different coding and was written in the variables plb0284_v1 and plb0284_v2. In order to use the variable for all survey years, it is necessary to harmonize the different versions. The variable plb0284_v1 is recoded (recode (1=1)(2=2)(3=3)(4=6)(5=4)(6=5)) and then written together with plb0284_v2 in plb0284_h. The new variable plb0284_h is created by the harmonization process.

tab plb0284_h
tabstat plb0284_v1 plb0284_v2 plb0284_h, by(syear)

3.) Content Differences in the Questions.

Variables are versioned because the questions were asked differently in different years, but the content belongs together. If the content or wording of the question changes, the wave-specific variables cannot easily be integrated into a long variable.


In this question of the individual questionnaire from 2001, the question was asked whether the respondent had ever received an inheritance. In 2017 this question was asked in a modified form. The individual questionnaire asked for an inheritance in the last 15 years. The questions are similar, but cover different time periods. Therefore, the variable is not harmonized but made available as version variables. The data users themselves have to decide whether they want to use the variables for them in the same way or not.

tab plc0375_v1 
tab plc0375_v2

4.) Change of Question Type.

Variables are versioned and harmonized because the questions were asked differently in different years, for example as a question with multiple response options and later as a question with a single response option. A possible multiple answer in certain years makes it difficult to easily integrate the wave-specific variables into a variable in long format..


The comparison of the two questions on a possible scholarship from the individual questionnaire for the years 2011 and 2012 shows, that there should be no differences in the variables. Nevertheless, the two questions seem to have been asked methodically differently and were stored differently in the raw data sets. This results in several version variables.

tab plg0015_v1
tab plg0015_v2
tab plg0015_v3
tab plg0015_v4

As you can see, the variable was asked from 2007 to 2011 as a category question with three response options. As a result, respondents could only give one answer. Since 2012 the items were asked as binary items. It is quite possible that one respondent gave more than one answer. The harmonized version of the variable integrates the binary items from plg0015_v2, plg0015_v3, and plg0015_v4 into the harmonized version plg0015_h. The coding of the variable plg0015_v1 is used as the generation framework. In addition, the harmonization proposal takes into account the problematic multiple answers with the value four.

tabstat plg0015_v1 plg0015_v2 plg0015_v3 plg0015_v4 plg0015_h, by(syear)

5.) Euro harmonisation

Variables are versioned and harmonized because they are metric and were asked as DM amounts before the introduction of the euro. For the long version of the variable, metric variables based on different currencies in different years are harmonized as euro amounts.

Most of the variables harmonized in the long datasets are amounts of money. Before the introduction of the euro, such information was collected in DM.


Euro harmonisation involves DM amounts being multiplicated by the exchange rate so that the harmonized version of the variable represents euro amounts.

list pid syear plc0013_v1 plc0013_h if pid==7006001 & syear==2001
tabstat plc0013_v1 plc0013_v2 plc0013_h, by(syear)