Generating a Longitudinal Dataset

This example focuses on generating a dataset to analyze determinants of health satisfaction. You can either use the syntax generator in paneldata.org or write a syntax file yourself. You can search for variable names in Paneldata.org.

In the previous examples, you created an exercise path with four subfolders as well as corresponding globals in the STATA do-file. You can use the same folders and globals for this exercise.

1.Generate an unbalanced panel dataset for the years 2006 to 2008 using paneldata.org if you wish. The dataset should contain all respondents in private households:

The data set should contain the following variables of interest:

In addition, the dataset should include the following additional information for analysis from 2006 to 2008:

If you need detailed instructions on how the script generator works in paneldata.org, you can find them in the chapter Syntax Generator on paneldata.org.

If you would like to assemble your dataset yourself, you can do this with the datasets you have assembled. From the previous exercise with tracking data, you may already have an idea where to get most of the variables.

Since we want to have an unbalanced panel set, the $netto variable for the years 2006 to 2008 must also be used. In addition, our analysis must limit population membership, as we are only interested in household respondents.

Tip

If a dataset is created from several variables of different datasets, it is worth sorting the individual identifier before saving the individual data sets in order to be able to merge the data sets more easily afterwards.

1.1. Create a Master File

Use ppfad as the source file together with the required variables that you may have already found in Paneldata.org or identified from the variable label in the dataset. Note that only variables from the years to be analyzed should be used.

1
2
3

use hhnr persnr sex gebjahr psample xhhnr xnetto xpop yhhnr ynetto ypop whhnr wnetto wpop using "${MY_PATH_IN}ppfad.dta"

Since we want to obtain an unbalanced data set, i.e., individuals who have completed an individual questionnaire at least once within the last three years, you must restrict the variable $netto (survey status). Also, we only want to analyze private households, so we need a further restriction of the $pop (sample membership) variable.

1
2
3
4
5
6
7
8

keep if ( (xnetto >= 10 & xnetto < 20) | (ynetto >= 10 & ynetto < 20) | (wnetto >= 10 & wnetto < 20) )


* * * PRIVATE VS ALL HOUSEHOLDS * * *

keep if ( (xpop == 1 | xpop == 2) | (ypop == 1 | ypop == 2) | (wpop == 1 | wpop == 2) )

Then we sort the persnr (individual identifier) in the datasets and save it.

1
2
3
4
5

sort persnr
save "${MY_PATH_OUT}ppfad.dta", replace
clear

What is still missing is the cross-sectional weighting factor and the variables of interest for the analysis. To apply the weighting factors to the dataset, open the weighting dataset for the person-level phrf, sort it, and save it again.

1
2
3
4
5
6

use persnr wphrf xphrf yphrf using "${MY_PATH_IN}phrf.dta"
sort persnr
save "${MY_PATH_OUT}phrf.dta", replace
clear

Now we come to the content variables. In order not to have to click through all of the datasets in the data release, it is recommended that the label be entered for the variable of interest from paneldata.org.

Use the filter to narrow your search. Select our main study SOEP-Core, the search type “variable”, the analysis unit “p” or “h” and the corresponding year. Once you have clicked on the year of interest, a variable history is displayed. You can use this to see which years the variable was collected and what the variable is called.

Example: Variable Label “Satisfaction Health”

../_images/satisfaction_health.PNG

Example: Variable Label “currently smoking yes/no”

../_images/currently_smoke.PNG

Example: Variable Label “current employment status”

../_images/employment_status.PNG

Example: Variable Label “monthly net household income”

../_images/household_income.PNG

To merge the data, you can either use the script generator on paneldata.org or write the syntax manually into a do-file.

We now have all the information we need to create a master file. As already mentioned with TIP!, it is recommended to save the datasets sorted by the persnr (individual identifier) before merging.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
use persnr wp0101 wp9301 using "${MY_PATH_IN}wp.dta"
sort persnr
save "${MY_PATH_OUT}wp.dta", replace
clear

* * * Persons 2007 * * *
use persnr xp0101 using "${MY_PATH_IN}xp.dta"
sort persnr
save "${MY_PATH_OUT}xp.dta", replace
clear

* * * Persons 2008 * * *
use persnr yp0101 yp10601 using "${MY_PATH_IN}yp.dta"
sort persnr
save "${MY_PATH_OUT}yp.dta", replace
clear

With the help of a unique identifier, which is either the household ($hhnr) or individual identifier (persnr), you can now merge all datasets or individual variables to ppfad. Which identifier to use when depends on the unit of analysis. Since we are on the individual level, our indicator is persnr (individual identifier).

We load the dataset ppfad and merge our datasets or variables to ppfad.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

merge 1:1 persnr using "${MY_PATH_OUT}phrf.dta", keep(master match) nogen


* merge data from $p.dta 
merge 1:1 persnr using "${MY_PATH_IN}/wp.dta", keepus(wp0101 wp9301)  keep(master match) nogen // health & smoking
merge 1:1 persnr using "${MY_PATH_IN}/xp.dta", keepus(xp0101) 		  keep(master match) nogen // health
merge 1:1 persnr using "${MY_PATH_IN}/yp.dta", keepus(yp0101 yp10601) keep(master match) nogen // health & smoking

* merge data from $pgen.dta 
local y = 6
foreach wave in w x y {
	merge 1:1 persnr using "${MY_PATH_IN}/`wave'pgen.dta", keepus(emplst0`y')nogen keep(master match) 
	local y = `y' + 1
}

* merge data from $hgen.dta 
local y = 6
foreach wave in w x y {
	merge m:1 `wave'hhnr using "${MY_PATH_IN}/`wave'hgen.dta", keepus(hinc0`y') nogen keep(master match) 
	local y = `y' + 1
}

2. Encode missing values in system failings (STATA)!

After the master file has been created with all required information, the missing values, which can take between -1 to -8 in SOEP, must be recoded to missings. This step is important for converting a wide-format data set to a long format.

1
2
3
4
5
6
********************************************************************************
*** Task 2) ***
* Encode missing values in systemmissings (STATA)!
********************************************************************************

	mvdecode _all, mv(-1=. \ -2=.t \ -3=.x \ -5=.y \ -8=.z)

3. The data set is in “wide” format, i.e., additional years are displayed as additional variables (columns). For many analyses, it makes sense to convert datasets into the “long” format. In long format, additional years are displayed as additional lines. If the dataset covers three years, as in this example, there are three lines for each person. Convert the data set to long format using the STATA command reshape.!

Since these are cross-sectional variables, it can be assumed that each variable has at least one wave abbreviation, which makes the variable unique. Conversely, this means that the variables must be renamed before the reshape command.

Before renaming all original variables (e.g., from $P data sets) it must be checked whether the question and the answer categories were the same in all years (you can also look up the exact wording of the question in the corresponding questionnaire). If changes are made, the variables may have to be recoded.

1
2
3
4
*Check if original variable have changed over time
	tab1 wp0101 xp0101 yp0101
	tab1 wp9301 yp10601
	/*additionally check questionaires for exact wording*/

How you rename the variables is largely up to you. However, you should ensure that the name remains consistent over time and that the variable only differs according to the year (variable name + four-digit year suffix, e.g., zufr2006, zufr2007, zufr2008). You can rename the variables either manually, line by line, or for advanced users using a loop.

Example of manual renaming:

1
2
3
4
5
6
7
8
*rename time-variant variables
*with examples how to use loops (but can also be done "manually")
	rename wp9301 smoke2006
	rename yp10601 smoke2008
	rename wp0101 health2006
	rename xp0101 health2007
	rename yp0101 health2008
	...

Example of a loop:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
	foreach  x in 6 7 8 {
		rename hinc0`x' hinc200`x'
		rename emplst0`x' emplst200`x'
		}

		
	local y=2006
	foreach w in w x y {
		rename `w'hhnr hhnrakt`y'
		rename `w'netto netto`y'
		rename `w'pop pop`y'
		rename `w'phrf phrf`y'
		local y=`y'+1
		}

3.1. The reshape command

Now that we have made all relevant preparations, you can start to convert the dataset. If you want to convert a dataset, you can do this in both directions:

../_images/aufgabe_3_reshape.PNG

In our case, we reshape from wide to long. This means that a new variable name must be assigned for the year of the survey (j). The variable is then generated automatically. Currently, each person is assigned a line in Stata.

persnr

hhnr

wave

sex

smoke2006

smoke2008

12345

123

x

m

yes

yes

54321

211

x

m

no

no

1
2
3
4
*reshape dataset to long-format
	reshape long health smoke emplst hinc netto pop hhnrakt phrf, i(persnr) j(year)
	bys persnr: gen waves=_N		/*additional information: count number of waves per person*/
	tab waves

After the reshape command, you have one line per year for each person:

persnr

hhnr

wave

year

sex

smoke

12345

123

x

2006

m

yes

12345

123

y

2007

m

.

12345

123

z

2008

m

yes

4. Perform analyses based on the data. Try to answer the following questions:

a. Has men’s and women’s average satisfaction with health changed over the three years?

Satisfaction with health was measured on a scale from 1 to 10, with a value of 10 representing the highest possible level of satisfaction. To compare the average satisfaction with health between women and men, you should display the mean value for both sexes. The mean value is displayed weighted here.

1
2
3
4
*a) Has the average satisfaction with men's health and women changed 
*   over the three years?

	  mean health [pw=phrf], over(sex year)
../_images/mean_health.PNG

The output shows the average values for men and women for all three years. The first three values show men’s average satisfaction with health between 2006 and 2008, while the last three values show women’s average satisfaction with health.

b. What is the proportion of people for whom health satisfaction has increased from 2006 to 2007?

To answer this question, the difference between 2006 and 2007 should be displayed. You should make sure that the analysis is conducted only within one persnr (individual identifier) and only for satisfaction in the following year.

1
2
3
4
5
*b) What is the proportion of people for whom health satisfaction has increased 
*   from 2006 to 2007?? 
	sort persnr year
	gen diff=health-health[_n-1] if persnr==persnr[_n-1] & year==year[_n-1]+1
	tab diff if year==2007				/*unweighted*/
../_images/compare_health_unweighted.PNG

Since you have previously added the SOEP weighting factors to the dataset for your analysis, you should use the weighting for a representative analysis.

1
	tab diff if year==2007 [aw=phrf]	/*weighted*/
../_images/compare_health_weighted.PNG

The values less than 0 show a deterioration in health satisfaction. The value 0 means constant health satisfaction, and all values above 0 show a positive change in satisfaction with their health. With a value of 10, it can be assumed that these people were interviewed for the first time in 2007 or 2008.

c. In what direction and how much has satisfaction with health changed from 2006 to 2008 among people who quit smoking after 2006?

The procedure is similar to the previous question, except that the element “smoke yes/no” is added.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
*c) In what direction and how much has satisfaction with 
*   health changed from 2006 to 2008 among people who quit smoking after 2006?

	gen diff2=health-health[_n-2] if persnr==persnr[_n-2] & year==year[_n-2]+2 & year==2008
	gen quit=.
	replace quit=0 if smoke==1 & smoke[_n-2]==1 & persnr==persnr[_n-2] & year==year[_n-2]+2 & year==2008
	replace quit=1 if smoke==2 & smoke[_n-2]==1 & persnr==persnr[_n-2] & year==year[_n-2]+2 & year==2008
	replace quit=2 if smoke==2 & smoke[_n-2]==2 & persnr==persnr[_n-2] & year==year[_n-2]+2 & year==2008
	replace quit=3 if smoke==1 & smoke[_n-2]==2 & persnr==persnr[_n-2] & year==year[_n-2]+2 & year==2008
	label define quit 0 "smoker" 1 "quit" 2 "non-smoker" 3 "begin"
	label values quit quit
	tabstat diff2, by(quit)
../_images/smoke_vs_health.PNG

To obtain a weighted mean value, address the analysis weight after the generated variable.

1
	tabstat diff2 [aw=phrf], by(quit)	/*weighted*/
../_images/smoke_vs_health_weight.PNG

This illustration shows the mean of the health variable under the condition of the quit variable that we generated beforehand. With a mean of -0.24 (weighted -0.35), the biggest change in health satisfaction is seen in people who quit smoking after 2006. For example, if a person smoked in 2006 and indicated a satisfaction value of 8, the person indicates a satisfaction value of 7.76 after he/she stopped smoking in 2008. So you can assume that when a person stops smoking, their perceived health state deteriorates. Now we have to test if the assumption is correct.

d. Does quitting smoking make your health worse? To what extent could the result of the analysis “stop smoking” be distorted?

In order to establish a connection between health satisfaction and stopping smoking, one should use the t-test or to be more specific, the one-sample t-test. It checks whether the mean value of a sample deviates significantly from a known expected value (specified in the null hypothesis).

1
2
3
4
5
*d) Does quitting smoking make your health worse? To what extent can the 
*   result of the analysis "Stop smoking" be distorted?
	
	* Notes: So far we have not tested whether the difference is statistically significant
		ttest diff2==0 if quit==1 		
../_images/ttest.PNG

H0 Hypothesis: If one stops smoking, it has no effect on health.

For this test we assume a 95% probability. What we want to check now is whether the H0 hypothesis can be rejected or not. If you look at the output of the test, you first see the mean value of 1 (quit smoking) of the variable quit. The last line of the output shows the significance level. If it falls below the value 0.05, one can speak of a statistically significant result. In our example, the null hypothesis can be discarded because its value is less than 0.05 percent. So quitting smoking has a significant impact on a person’s perceived health.

Last change: Nov 12, 2019