Generating a longitudinal Data Set

This example is about generating a data set to analyze determinants of health satisfaction. You can either use the syntax generator of paneldata.org or write a syntax file yourself. You can search for variable names in Paneldata.org.

In the previous examples you have already created an exercise path with four subfolders, as well as corresponding globals in the STATA do-file. You can use the same folders and globals for this exercise.

1.Generate an unbalanced panel dataset for the years 2006 to 2008 using paneldata.org if you wish. The data set should contain all respondents in private households:

The data set should contain the following variables of interest:

In addition, the data set should include the following additional information for analysis from 2006 to 2008:

If you need detailed instructions on how the script generator works in paneldata.org, you can find them in the chapter Syntax Generator on paneldata.org.

If you would like to assemble your data set yourself, you can do this with the data sets you have supplied. From the previous exercise with tracking data, you may already have an idea where to get most of the variables.

Since we want to have an unbalanced panel record, the $netto variable for the years 2006 to 2008 must also be used. In addition, our analysis must limit population membership, as we are only interested in household respondents.

Tip

If a data set is created from several variables of different data sets, it is worth sorting the person number before saving the individual data sets in order to be able to merge the data sets more easily afterwards.

1.1. Create a Master-Files

Use ppfad as the source file together with the required variables that you may have already researched in Paneldata or identified from the variable label of the data set. Note that only variables of the years to be analyzed should be used.

1
2
3

use hhnr persnr sex gebjahr psample xhhnr xnetto xpop yhhnr ynetto ypop whhnr wnetto wpop using "${MY_PATH_IN}ppfad.dta"

Since we want to receive an unbalanced data set, i.e. persons who have completed a personal questionnaire at least once within the 3 years, you must restrict the variable $netto (survey status). Also, we only want to analyze private households, so we need a further restriction of the $pop (sample membership) variable.

1
2
3
4
5
6
7
8

keep if ( (xnetto >= 10 & xnetto < 20) | (ynetto >= 10 & ynetto < 20) | (wnetto >= 10 & wnetto < 20) )


* * * PRIVATE VS ALL HOUSEHOLDS * * *

keep if ( (xpop == 1 | xpop == 2) | (ypop == 1 | ypop == 2) | (wpop == 1 | wpop == 2) )

Then we sort the persnr (personal number) of the data record and save it.

1
2
3
4
5

sort persnr
save "${MY_PATH_OUT}ppfad.dta", replace
clear

What is still missing is the cross-section weighting factor and the variables of interest in terms of content. To apply the weighting factors to the data set, open the weighting data set for the person level phrf, sort it and save it again.

1
2
3
4
5
6

use persnr wphrf xphrf yphrf using "${MY_PATH_IN}phrf.dta"
sort persnr
save "${MY_PATH_OUT}phrf.dta", replace
clear

Now we come to the variables of content. In order not to have to click through all delivered data sets, it is recommended to enter the label of the variable of interest on paneldata.org.

Use the filter to narrow your search. Select our main study SOEP-Core, the search type “variable”, the analysis unit “p” or “h” and the corresponding year. Once you have clicked on the year of interest, a variable history is displayed. You can use this to see in which years the variable was collected and what the variable is called.

Example: Variable Label „Satisfaction Health“

../_images/satisfaction_health.PNG

Example: Variable Label „currently smoking yes/no“

../_images/currently_smoke.PNG

Example: Variable Label „current employment status“

../_images/employment_status.PNG

Example: Variable Label „monthly net household income“

../_images/household_income.PNG

To merge the data you can either use the script generator on paneldata.org or write the syntax manually into a do-file.

We now have all the information we need to create a master file. As already mentioned with TIP!, it is recommended to save the data records sorted by the persnr (person number) before merging.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
use persnr wp0101 wp9301 using "${MY_PATH_IN}wp.dta"
sort persnr
save "${MY_PATH_OUT}wp.dta", replace
clear

* * * Persons 2007 * * *
use persnr xp0101 using "${MY_PATH_IN}xp.dta"
sort persnr
save "${MY_PATH_OUT}xp.dta", replace
clear

* * * Persons 2008 * * *
use persnr yp0101 yp10601 using "${MY_PATH_IN}yp.dta"
sort persnr
save "${MY_PATH_OUT}yp.dta", replace
clear

With the help of a unique indicator, which is either the household number ($hhnr) or the person number (persnr), you can now merge all data records or individual variables to ppfad. Which indicator to use and when depends on the unit of analysis. Since we are on the person level, our indicator is persnr (person ID).

We load the dataset ppfad and merge our datasets or variables to ppfad.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

merge 1:1 persnr using "${MY_PATH_OUT}phrf.dta", keep(master match) nogen


* merge data from $p.dta 
merge 1:1 persnr using "${MY_PATH_IN}/wp.dta", keepus(wp0101 wp9301)  keep(master match) nogen // health & smoking
merge 1:1 persnr using "${MY_PATH_IN}/xp.dta", keepus(xp0101) 		  keep(master match) nogen // health
merge 1:1 persnr using "${MY_PATH_IN}/yp.dta", keepus(yp0101 yp10601) keep(master match) nogen // health & smoking

* merge data from $pgen.dta 
local y = 6
foreach wave in w x y {
	merge 1:1 persnr using "${MY_PATH_IN}/`wave'pgen.dta", keepus(emplst0`y')nogen keep(master match) 
	local y = `y' + 1
}

* merge data from $hgen.dta 
local y = 6
foreach wave in w x y {
	merge m:1 `wave'hhnr using "${MY_PATH_IN}/`wave'hgen.dta", keepus(hinc0`y') nogen keep(master match) 
	local y = `y' + 1
}

2. Encode missing values in system failings (STATA)!

After the master file has been created with all required information, the missing values, which can take between -1 to -8 in SOEP, must be recoded into missings. This step is important for converting a wide-format data set to a long format.

1
2
3
4
5
6
********************************************************************************
*** Task 2) ***
* Encode missing values in system failings (STATA)!
********************************************************************************

	mvdecode _all, mv(-1=. \ -2=.t \ -3=.x \ -5=.y \ -8=.z)

3. The data set is in wide-format, i.e. additional years are displayed as additional variables (columns). For many analyses it makes sense to convert data sets into the long format. In long format, additional years are displayed as additional lines. If the data record covers three years, as in this example, there are three lines for each person. Convert the data set to long format using the STATA command reshape.!

Since these are cross-section variables, it can be assumed that each variable has at least one wave abbreviation, which makes the variable unique. Conversely, this means that the variables must be renamed before the reshape command.

Before renaming all original variables (e.g. from $P data records) it must be checked whether the question and the answer categories were the same in all years (you can also look up the exact wording of the question in the corresponding questionnaire). If changes are made, the variables may have to be recoded.

1
2
3
4
*Check if original variable have changed over time
	tab1 wp0101 xp0101 yp0101
	tab1 wp9301 yp10601
	/*additionally check questionaires for exact wording*/

How you rename the variables is largely up to you. However, you should ensure that the name remains consistent over time and that the variable only differs according to the year (variable name + four-digit year suffix, e.g. zufr2006, zufr2007, zufr2008). You can rename the variables either manually, line by line, or for advanced users using a loop.

Example of manual renaming:

1
2
3
4
5
6
7
8
*rename time-variant variables
*with examples how to use loops (but can also be done "manually")
	rename wp9301 smoke2006
	rename yp10601 smoke2008
	rename wp0101 health2006
	rename xp0101 health2007
	rename yp0101 health2008
	...

Example of a loop:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
	foreach  x in 6 7 8 {
		rename hinc0`x' hinc200`x'
		rename emplst0`x' emplst200`x'
		}

		
	local y=2006
	foreach w in w x y {
		rename `w'hhnr hhnrakt`y'
		rename `w'netto netto`y'
		rename `w'pop pop`y'
		rename `w'phrf phrf`y'
		local y=`y'+1
		}

3.1. The reshape-command

Now that we have made all relevant preparations, you can start to convert the data set. If you want to convert a data set, you can do this in both directions:

../_images/aufgabe_3_reshape.PNG

In our case we reshap from wide to long. This means that a new variable name must be assigned for the year of the survey (j). The variable is then generated automatically. Currently, each person is assigned a line in Stata.

persnr hhnr wave sex smoke2006 smoke2008
12345 123 x m yes yes
54321 211 x m no no
1
2
3
4
*reshape dataset to long-format
	reshape long health smoke emplst hinc netto pop hhnrakt phrf, i(persnr) j(year)
	bys persnr: gen waves=_N		/*additional information: count number of waves per person*/
	tab waves

After the reshape command you have one line per year for each person:

persnr hhnr wave year sex smoke
12345 123 x 2006 m yes
12345 123 y 2007 m .
12345 123 z 2008 m yes

4. Perform analyses based on the data. Try to answer the following questions:

a. Has average satisfaction with men’s and women’s health changed over the three years?

Satisfaction with health was measured on a scale of 10, with a value of 10 representing an extraordinarily high level of satisfaction. To compare the average satisfaction with health between women and men, you should display the mean value for both sexes. The mean value is displayed weighted here.

1
2
3
4
*a) Has the average satisfaction with men's health and women changed 
*   over the three years?

	  mean health [pw=phrf], over(sex year)
../_images/mean_health.PNG

The output shows the average values for men and women for all three years. The first three values show average satisfaction with men’s health between 2006 and 2008, while the last three values show average satisfaction with women’s health.

b. What is the proportion of people for whom health satisfaction has increased from 2006 to 2007?

To answer this question, the difference between 2006 and 2007 should be displayed. You should make sure that only within one persnr (person ID) and the satisfaction of the following year should be analyzed.

1
2
3
4
5
*b) What is the proportion of people for whom health satisfaction has increased 
*   from 2006 to 2007?? 
	sort persnr year
	gen diff=health-health[_n-1] if persnr==persnr[_n-1] & year==year[_n-1]+1
	tab diff if year==2007				/*unweighted*/
../_images/compare_health_unweighted.PNG

Since you have previously added the SOEP weighting factors to your analysis data set, you should use the weighting for a representative analysis.

1
	tab diff if year==2007 [aw=phrf]	/*weighted*/
../_images/compare_health_weighted.PNG

The values less than 0 show a deterioration in health satisfaction. The value 0 means a constant health satisfaction and all values above 0 show a positive change in satisfaction with their health. With a value of 10, it can be assumed that these people were interviewed for the first time in 2007 or 2008.

c. In what direction and how much has satisfaction with the health of people who quit smoking after 2006 changed from 2006 to 2008?

The procedure is similar to the previous question, except that the element “smoke yes/no” is added.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
*c) In what direction and how much has satisfaction with the health of 
*   people who quit smoking after 2006 changed from 2006 to 2008?

	gen diff2=health-health[_n-2] if persnr==persnr[_n-2] & year==year[_n-2]+2 & year==2008
	gen quit=.
	replace quit=0 if smoke==1 & smoke[_n-2]==1 & persnr==persnr[_n-2] & year==year[_n-2]+2 & year==2008
	replace quit=1 if smoke==2 & smoke[_n-2]==1 & persnr==persnr[_n-2] & year==year[_n-2]+2 & year==2008
	replace quit=2 if smoke==2 & smoke[_n-2]==2 & persnr==persnr[_n-2] & year==year[_n-2]+2 & year==2008
	replace quit=3 if smoke==1 & smoke[_n-2]==2 & persnr==persnr[_n-2] & year==year[_n-2]+2 & year==2008
	label define quit 0 "smoker" 1 "quit" 2 "non-smoker" 3 "begin"
	label values quit quit
	tabstat diff2, by(quit)
../_images/smoke_vs_health.PNG

To obtain a weighted mean value, address the analysis weight after the generated variable.

1
	tabstat diff2 [aw=phrf], by(quit)	/*weighted*/
../_images/smoke_vs_health_weight.PNG

This illustration shows the mean of the health variable under the condition of the variable quit we generated beforehand. With a mean of -0.24 (weighted -0.35) the biggest change in health satisfaction is seen in people who quit smoking after 2006. For example, if a person smoked in 2006 and indicated a satisfaction value of 8, the person after he/she stopped smoking in 2008 indicates a satisfaction value of 7.76. So you can assume that when a person stops smoking, the state of health that a person perceives deteriorates. Now we have to test if the assumption is correct.

d. Does quit smoking make your health worse? To what extent can the result of the analysis “Stop smoking” be distorted?

In order to establish a connection between health satisfaction and stopping smoking, one should use the ttest or to be more specific, the one-sample t test. It checks whether the mean value of a sample deviates significantly from a known expected value (specified in the null hypothesis).

1
2
3
4
5
*d) Does quitting smoking make your health worse? To what extent can the 
*   result of the analysis "Stop smoking" be distorted?
	
	* Notes: So far we have not tested whether the difference is statistically significant
		ttest diff2==0 if quit==1 		
../_images/ttest.PNG

H0 Hypothesis: If one stops smoking it has no effect on health.

For this test we assume a 95% probability. What we want to check now is whether the H0 hypothesis can be rejected or not. If you look at the output of the test, you first see the mean value of value 1 (quit smoking) of the variable quit. The last line of the output shows the significance level. If it falls below the value 0.05, one can speak of a statistically significant result. In our example, the null hypothesis can be discarded because its value is less than 0.05 percent. So quitting smoking has a significant impact on a person’s perceived health.