Generating a Cross-Sectional Dataset¶

This example involves generating a dataset to analyze health satisfaction determinants in 2008, and you can either use the Paneldata.org syntax generator or write your own syntax file to perform this task. You can search for the variable names in Paneldata.org (or use the variables below directly).

1. Generate a cross-sectional dataset for the year 2008, which should contain all persons with the following characteristics:

Respondents in 2008 "ynetto"
Lived in a private household in 2008 "ypop"

The dataset should contain the following variables of interest.

satisfaction with health "yp0101"
smoking currently yes/no "yp10601"
current employment status "emplst08"
monthly household net income "hinc08"

In addition, the dataset should contain the following additional information for a 2008 cross-sectional analysis (these variables are automatically generated by paneldata.org):

current cross-section weighting factor "yphrf"
personal number "pid"
original household number "cid"
current household number "hid_2008"
sample affiliation "psample"
gender "sex"
year of birth "gebjahr"

Create an exercise path with four subfolders:

Example:

H:/material/exercises/do
H:/material/exercises/output
H:/material/exercises/temp
H:/material/exercises/log

These are used to store commands, log files, datasets, and temporary datasets. Open an empty do file and define your created paths with globals:

***********************************************
* Set relative paths to the working directory
***********************************************
global MY_PATH_IN   "\\hume\rdc-prod\distribution\soep-core\soep.v37\eu\Stata\raw\"
global MY_PATH_OUT  "H:\Exercise\"
global MY_FILE_OUT  "${MY_PATH_OUT}cross-sectional-exercise.dta"
global MY_LOG_FILE  "${MY_PATH_OUT}cross-sectional-exercise.log"
capture log close
log using "${MY_LOG_FILE}", text replace

The global “AVZ” defines the main path. The main paths are subdivided using the globals “MY_IN_PATH”, “MY_DO_FILES”, “MY_LOG_OUT”, “MY_OUT_DATA”, “MY_OUT_TEMP”. The global “MY_IN_PATH” contains the path to your data.

Use ppath as the source file together with the required variables. Keep all cases with completed interviews. In addition, your dataset should only contain respondents who can make a statement on the content of the question. For example, you can use the net code to identify and remove children from your dataset.

* * * PFAD * * *

use ypop cid sex hid_2008 pid psample ynetto gebjahr using "${MY_PATH_IN}ppfad.dta", clear

* * * BALANCED VS UNBALANCED * * *

keep if ( (ynetto >= 10 & ynetto < 20) )

* * * PRIATVE VS ALL HOUSEHOLDS * * *

keep if ( (ypop == 1 | ypop == 2) )

* * * SORT PFAD * * *

save "${MY_PATH_OUT}pfad.dta", replace 
clear

Save the modified data temporarily. Now link your dataset with the weights of the SOEP and save your dataset as a master file.

* * * HRF * * *

use cid pid yphrf prgroup using "${MY_PATH_IN}phrf.dta" 
sort pid
save "${MY_PATH_OUT}hrf.dta", replace 
clear


* * * CREATE MASTER * * *

use "${MY_PATH_OUT}pfad.dta", clear
merge 1:1 pid cid using "${MY_PATH_OUT}hrf.dta", keep(master match) nogen
save "${MY_PATH_OUT}master.dta", replace
clear

Now prepare the content variables. Search for the content variables you are looking for from the various datasets and temporarily save the datasets you have created.

* * * READ DATA * * *

use emplst08 pid using "${MY_PATH_IN}ypgen.dta", clear
save "${MY_PATH_OUT}ypgen.dta", replace

use yp0101 pid yp10601 using "${MY_PATH_IN}yp.dta", clear
save "${MY_PATH_OUT}yp.dta", replace

use hinc08 hid_2008 using "${MY_PATH_IN}yhgen.dta", clear
save "${MY_PATH_OUT}yhgen.dta", replace

Link the datasets you have created to your master file and save for analysis.

* * * MERGE DATA * * *

use   "${MY_PATH_OUT}master.dta", clear
merge 1:1 pid using "${MY_PATH_OUT}ypgen.dta", keep(master match) nogen
merge 1:1 pid using "${MY_PATH_OUT}yp.dta", keep(master match) nogen
merge m:1 hid_2008 using "${MY_PATH_OUT}yhgen.dta", keep(master match) nogen


* * * DONE * * *

label data "paneldata.org"
save "${MY_FILE_OUT}", replace
desc

log close

You have successfully created a cross-sectional dataset for the year 2008.

2. Encode missing values into system missings (STATA)!

In SOEP, the missing codes of variables are described in detail with the values -1 to -8. To learn more about missing codes, see the section Missing Conventions. For content analysis, it is not always necessary to differentiate missing codes. Therefore you should be able to convert missing codes:

use "${MY_FILE_OUT}", clear


********************************************************************************
*** Exercise 2) ***
* Encode missing values into missing values in system missings (STATA)!
********************************************************************************

* mvdecode = Change missing values to numeric values and vice versa
	mvdecode _all, mv(-1=. \ -2=.t \ -3=.x \ -5=.y \ -8=.z)

Open the dataset for your analysis and summarize all missing codes.

3. How does average health satisfaction differ a) by gender

Satisfaction was measured on a scale of 1 to 10. To compare average satisfaction with health between women and men, you should display the mean value for both genders.

	*unweighted*
	tabstat yp0101, by(sex)

Since you have previously added the SOEP weighting factors to the dataset for your analysis, you should use the weighting for a representative analysis.

	*weighted* 
	tabstat yp0101 [aw=yphrf], by(sex)		

b) Employment status

Now proceed in a similar way when comparing satisfaction with health and employment status. Compare the mean values again:

*b) by job status:
	*unweighted*
	tabstat yp0101, by(emplst08)

Since you have previously added the SOEP weighting factors to the dataset for your analysis, you should use the weighting for a representative analysis.

	*weighted*
	tabstat yp0101 [aw=yphrf], by(emplst08)

c) Age

Since you do not have a variable that represents age, you must generate a suitable age variable using the birth year variable. The year of birth is metric and should be categorized for analysis. Define categories for your age variable and assign suitable labels.

*c) by age in 2008 (<30, 30-64, 65+)
	
	gen age=2008-gebjahr
	gen age_3=age
	recode age_3 (17/29=1) (30/64=2) (65/120=3)
	label define age_3 1 "17-29" 2 "30-64" 3 "65+"
	label values age_3 age_3

Create a mean value comparison with your age variable and health satisfaction in weighted and unweighted form.

	*unweighted*
	tabstat yp0101, by(age_3)

	*weighted*
	tabstat yp0101 [aw=yphrf], by(age_3) 

d) Income

As with age, generate a categorized version of income for household net income:

*d) by monthly houshold net income (-1.999, 2.000-3.999, 4000+ Euro)
	gen hinc08_3 = hinc08
	recode hinc08_3 (0/1999=1) (2000/3999=2) (4000/99999=3)
	label define hinc08_3 1 "<2000 Euro" 2 "2000-<4000 Euro" 3 "4000+ Euro"
	label values hinc08_3 hinc08_3

Display the mean values in weighted and unweighted form:

	*unweighted*
	tabstat yp0101, by(hinc08_3)

	*weighted*
	tabstat yp0101 [aw=yphrf], by(hinc08_3)

e) Smoking

Since this variable is nominal, adjustments to this variable are not necessary. Display average satisfaction with health for smokers and non-smokers in weighted and unweighted form:

*e) by smoking yes/no

	*unweighted*
	tabstat yp0101, by(yp10601)

	*weighted*
	tabstat yp0101 [aw=yphrf], by(yp10601)  

Last change: Apr 07, 2026