Generating a Cross-Sectional Dataset

This example involves generating a dataset to analyze health satisfaction determinants in 2008, and you can either use the Paneldata.org syntax generator or write your own syntax file to perform this task. You can search for the variable names in Paneldata.org (or use the variables below directly).

1. Generate a cross-sectional dataset for the year 2008, which should contain all persons with the following characteristics:

  • Respondents in 2008 "ynetto"

  • Lived in a private household in 2008 "ypop"

The dataset should contain the following variables of interest.

In addition, the dataset should contain the following additional information for a 2008 cross-sectional analysis (these variables are automatically generated by paneldata.org):

Create an exercise path with four subfolders:

../_images/uebungspfade.PNG

Example:

  • H:/material/exercises/do

  • H:/material/exercises/output

  • H:/material/exercises/temp

  • H:/material/exercises/log

These are used to store commands, log files, datasets, and temporary datasets. Open an empty do file and define your created paths with globals:

1
2
3
4
5
6
7
8
9
***********************************************
* Set relative paths to the working directory
***********************************************
global AVZ 	"H:\material\exercises"
global MY_IN_PATH "\\hume\rdc-prod\complete\soep-core\soep.v33.2\stata_en\"
global MY_DO_FILES "$AVZ\do\"
global MY_LOG_OUT "$AVZ\log\"
global MY_OUT_DATA "$AVZ\output\"
global MY_OUT_TEMP "$AVZ\temp\"

The global “AVZ” defines the main path. The main paths are subdivided using the globals “MY_IN_PATH”, “MY_DO_FILES”, “MY_LOG_OUT”, “MY_OUT_DATA”, “MY_OUT_TEMP”. The global “MY_IN_PATH” contains the path to your data.

Use ppath as the source file together with the required variables. Keep all cases with completed interviews. In addition, your dataset should only contain respondents who can make a statement on the content of the question. For example, you can use the net code to identify and remove children from your dataset.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
* * * PFAD * * *

use hhnr persnr sex gebjahr psample yhhnr ynetto ypop using "${MY_IN_PATH}ppfad.dta"


* * * BALANCED VS UNBALANCED * * *

keep if ( (ynetto >= 10 & ynetto < 20) )


* * * PRIATVE VS ALL HOUSEHOLDS * * *

keep if ( (ypop == 1 | ypop == 2) )


* * * SORT PFAD * * *

sort persnr
save "${MY_OUT_TEMP}ppfad.dta", replace
clear

Attention

Please note that since version 34 (v34), PPFAD can be found in the subdirectory “Raw” of the data distribution file. The following exercises are done with version 33.1 (v33.1), where the tracking file was named PPFAD.

Save the modified data temporarily. Now link your dataset with the weights of the SOEP and save your dataset as a master file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
* * * HRF * * *

use "${MY_IN_PATH}phrf.dta"
sort persnr
save "${MY_OUT_TEMP}hrf.dta", replace
clear


* * * CREATE MASTER * * *

use "${MY_OUT_TEMP}ppfad.dta"
merge 1:1 persnr using "${MY_OUT_TEMP}hrf.dta"
drop if _merge == 2
drop _merge
sort persnr
save "${MY_OUT_TEMP}master.dta", replace
clear

Now prepare the content variables. Search for the content variables you are looking for from the various datasets and temporarily save the datasets you have created.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
* * * READ DATA * * *

use hinc08 yhhnr using "${MY_IN_PATH}yhgen.dta"
sort yhhnr
save "${MY_OUT_TEMP}yhgen.dta", replace
clear


use yp10601 yhhnr yp0101 persnr using "${MY_IN_PATH}yp.dta"
sort persnr
save "${MY_OUT_TEMP}yp.dta", replace
clear


use emplst08 yhhnr persnr using "${MY_IN_PATH}ypgen.dta"
sort persnr
save "${MY_OUT_TEMP}ypgen.dta", replace
clear

Link the datasets you have created to your master file and save for analysis.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
* * * MERGE DATA * * *

use   "${MY_OUT_TEMP}master.dta"

sort yhhnr
merge yhhnr using "${MY_OUT_TEMP}yhgen.dta"
drop if _merge == 2
drop _merge

sort persnr
merge persnr using "${MY_OUT_TEMP}yp.dta"
drop if _merge == 2
drop _merge

sort persnr
merge persnr using "${MY_OUT_TEMP}ypgen.dta"
drop if _merge == 2
drop _merge


* * * DONE * * *

save "${MY_OUT_DATA}my_dataset.dta", replace
desc

You have successfully created a cross-sectional dataset for the year 2008.

2. Encode missing values into system missings (STATA)!

In SOEP, the missing codes of variables are described in detail with the values -1 to -8. To learn more about missing codes, see the section Missing Conventions. For content analysis, it is not always necessary to differentiate missing codes. Therefore you should be able to convert missing codes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
use "$MY_OUT_DATA\my_dataset.dta", clear


********************************************************************************
*** Exercise 2) ***
* Encode missing values into missing values in system missings (STATA)!
********************************************************************************

* mvdecode = Change missing values to numeric values and vice versa
	mvdecode _all, mv(-1=. \ -2=.t \ -3=.x \ -5=.y \ -8=.z)

Open the dataset for your analysis and summarize all missing codes.

3. How does average health satisfaction differ a) by gender

Satisfaction was measured on a scale of 1 to 10. To compare average satisfaction with health between women and men, you should display the mean value for both genders.

1
2
	*unweighted*
	tabstat yp0101, by(sex)
../_images/quer_06.PNG

Since you have previously added the SOEP weighting factors to the dataset for your analysis, you should use the weighting for a representative analysis.

1
2
	*weighted* 
	tabstat yp0101 [aw=yphrf], by(sex)		
../_images/quer_07.PNG

b) Employment status

Now proceed in a similar way when comparing satisfaction with health and employment status. Compare the mean values again:

1
2
3
*b) by job status:
	*unweighted*
	tabstat yp0101, by(emplst08)
../_images/quer_08.PNG

Since you have previously added the SOEP weighting factors to the dataset for your analysis, you should use the weighting for a representative analysis.

1
2
	*weighted*
	tabstat yp0101 [aw=yphrf], by(emplst08)
../_images/quer_09.PNG

c) Age

Since you do not have a variable that represents age, you must generate a suitable age variable using the birth year variable. The year of birth is metric and should be categorized for analysis. Define categories for your age variable and assign suitable labels.

1
2
3
4
5
6
7
*c) by age in 2008 (<30, 30-64, 65+)
	
	gen age=2008-gebjahr
	gen age_3=age
	recode age_3 (17/29=1) (30/64=2) (65/120=3)
	label define age_3 1 "17-29" 2 "30-64" 3 "65+"
	label values age_3 age_3

Create a mean value comparison with your age variable and health satisfaction in weighted and unweighted form.

1
2
	*unweighted*
	tabstat yp0101, by(age_3)
../_images/quer_11.PNG
1
2
	*weighted*
	tabstat yp0101 [aw=yphrf], by(age_3) 
../_images/quer_12.PNG

d) Income

As with age, generate a categorized version of income for household net income:

1
2
3
4
5
*d) by monthly houshold net income (-1.999, 2.000-3.999, 4000+ Euro)
	gen hinc08_3 = hinc08
	recode hinc08_3 (0/1999=1) (2000/3999=2) (4000/99999=3)
	label define hinc08_3 1 "<2000 Euro" 2 "2000-<4000 Euro" 3 "4000+ Euro"
	label values hinc08_3 hinc08_3

Display the mean values in weighted and unweighted form:

1
2
	*unweighted*
	tabstat yp0101, by(hinc08_3)
../_images/quer_14.PNG
1
2
	*weighted*
	tabstat yp0101 [aw=yphrf], by(hinc08_3)
../_images/quer_15.PNG

e) Smoking

Since this variable is nominal, adjustments to this variable are not necessary. Display average satisfaction with health for smokers and non-smokers in weighted and unweighted form:

1
2
3
4
*e) by smoking yes/no

	*unweighted*
	tabstat yp0101, by(yp10601)
../_images/quer_16.PNG
1
2
	*weighted*
	tabstat yp0101 [aw=yphrf], by(yp10601)  
../_images/quer_17.PNG

Last change: Nov 12, 2019