Working with SOEP Regional Data¶

SOEP offers diverse possibilities for regional and spatial analysis. With the anonymized regional information on SOEP respondents’ (households’ and individuals’) place of residence, it is possible to link numerous regional indicators on the levels of the federal states (Bundesländer), spatial planning regions, districts, and postal codes with the data on the SOEP households. However, specific security provisions must be made due to the sensitivity of the data under data protection law. Accordingly, data users are not allowed to give any information in their analyses that could indicate, for instance, the city or district in which respondents reside. The data nevertheless provide valuable background information for regional analysis.

For more information and to access the data, see Regional Data

Assume that for your research project, you want to measure current (2016) urban-rural differences in the population. You are particularly interested in the differences in interest in politics and the different satisfaction variables provided by the SOEP. You also want to take into account demographic differences in gender and age. To be able to evaluate the potential of the data for your project, you first need an overview. For regional analysis, for example, the municipal size classes from the regional data are suitable.

Create an exercise path with four subfolders:

Example:

H:/material/exercises/do
H:/material/exercises/output
H:/material/exercises/temp
H:/material/exercises/log

These are used to store your script, log files, datasets, and temporary datasets. Open an empty do-file and define your paths with globals:

***********************************************
* Set relative paths to the working directory
***********************************************
global AVZ 	"H:\material\exercises"
global MY_IN_PATH "\\hume\rdc-prod\complete\soep-core\soep.v33.2\stata_en\"
global region "\\hume\soep-region\DATA\soep33_de\"
global MY_DO_FILES "$AVZ\do\"
global MY_LOG_OUT "$AVZ\log\"
global MY_OUT_DATA "$AVZ\output\"
global MY_OUT_TEMP "$AVZ\temp\"

The global “AVZ” defines the main path. The main paths are subdivided using the globals “MY_IN_PATH”, “MY_DO_FILES”, “MY_LOG_OUT”, “MY_OUT_DATA”, “MY_OUT_TEMP”. The global “MY_IN_PATH” contains the path to the data you ordered.

a) Prepare a dataset for cross-sectional analysis covering the survey year 2016 (wave bg).

To perform your analysis, you need different SOEP variables. The SOEP offers various options for a variable search:

Search the questionnaires for useful variables (for more information, see the section Variable Search with Questionnaires)
Find a suitable variable in the topic list on paneldata.org (for more information, see the section Topic Search with paneldata.org)
Search for a suitable variable using a search term in paneldata.org (for more information, see the section Variable Search with paneldata.org)
Use the documentation provided by the generated variables (for more information, see the section Documentation on Generated Data)

Your source file should contain the following variables:

Permanent Individual ID "persnr"
Original Household Number "hhnr"
Current Wave Household Number "bghhnr"
The Sex of the Person "sex"
Year of Birth "gebjahr"
Survey Status 2016 "bgnetto"
Sample Membership 2016 "bgpop"
Weighting Factor 2016 "bgphrf"
Satisfaction With Health "bgp0101"
Satisfaction With Sleep "bgp0102"
Satisfaction With Work "bgp0103"
Satisfaction With Housework "bgp0104"
Satisfaction With Household Income "bgp0105"
Satisfaction With Personal Income "bgp0106"
Satisfaction With Dwelling "bgp0107"
Satisfaction With Amount Of Leisure Time "bgp0108"
Satisfaction With Child Care "bgp0109"
Satisfaction With Family Life "bgp0110"
Satisfaction With Social Life "bgp0111"
Satisfaction with Democracy "bgp0112"
Political Interest "bgp143"
Current Sample Region "bgsampreg"
Federal State "bgbula"
Spatial Category by BBSR "bgregtyp"
Municipal Class Sizes “ggk”

Use the key variables from the ppath.dta dataset as your starting file.

use hhnr persnr bghhnr sex gebjahr bgnetto bgpop using ${MY_IN_PATH}\ppfad.dta, clear

Attention

Please note that since version 34 (v34), PPFAD can be found in the subdirectory “Raw” of the data distribution file. The following exercises are done with version 33.1 (v33.1), where the tracking file was named PPFAD.

Keep people who completed a questionnaire in 2016 and lived in a private household.

* Keep people who completed a questionnaire in 2016 and live in a private household
keep if bghhnr>0 & inrange(bgnetto, 10, 29) & inlist(bgpop, 1, 2)
keep hhnr persnr bghhnr sex gebjahr bgnetto bgpop
merge 1:1  persnr using ${MY_IN_PATH}\phrf.dta, keep(match master) keepusing (bgphrf) nogenerate
tempfile ppfad
save `ppfad'

Prepare the different datasets bgp, bghbrutto, regionl

* Prepare dataset bgp
use ${MY_IN_PATH}\bgp.dta, replace
keep persnr hhnr bghhnr bgp01* bgp143
tempfile bgp
save `bgp'

* Prepare dataset bghbrutto
use ${MY_IN_PATH}\bghbrutto.dta, replace
keep hhnr bghhnr bgsampreg bgbula bgregtyp
tempfile bghbrutto
save `bghbrutto'

* Prepare dataset regionl
use ${region}\regionl_v33.dta, replace
keep if syear==2016
keep syear hhnr hhnrakt ggk
rename hhnrakt bghhnr
tempfile regionl
save `regionl'

Merge all datasets.

* Merge all datasets
use `ppfad'
merge 1:1 persnr using `bgp', keep(match master) nogenerate
merge m:1 bghhnr hhnr using `regionl', keep(match master) nogenerate
merge m:1 bghhnr hhnr using `bghbrutto', keep(match master) nogenerate

Recode negative values as missings.

* Recode negative values into missings
mvdecode sex gebjahr bgp01* bgp143,mv(-5/-1)

Categorize the municipal class sizes from the SOEP regional dataset.

* Categorize community class size
gen ggk_cat=.
replace ggk_cat=-1 if ggk==-1
replace ggk_cat=1 if ggk==1 | ggk==2
replace ggk_cat=2 if ggk==3
replace ggk_cat=3 if ggk==4 | ggk==5
replace ggk_cat=4 if ggk>5 & ggk<=7

lab var ggk_cat "Community Size categorised"
lab def ggk_cat -1 "No information" 1 "<=5000" 2 "5001 - 20000" 3 "20001 - 100000" /// 
4 ">100000"
lab val ggk_cat ggk_cat

Generate an age variable.

* Generate age variable
gen alter= 2016-gebjahr if gebjahr > 0
gen alter_cat=1 if alter<=20
replace alter_cat=2 if alter>20 & alter<=30
replace alter_cat=3 if alter>30 & alter<=65
replace alter_cat=4 if alter>65 & alter<=120

lab var alter "age"
lab var alter_cat "age categorized"
lab def alter_cat 1 "<=20" 2 "21-30" 3 "31-65" 4 ">65"
lab val alter_cat alter_cat 

Categorize a federal states variable.

* Categorize federal states
gen bgbula_cat=.
* Schleswig-Holstein + Hamburg
replace bgbula_cat=1 if bgbula==1 | bgbula==2
* Lower Saxony + Bremen
replace bgbula_cat=2 if bgbula==3 | bgbula==4
* Mecklenburg Western Pomerania + Brandenburg
replace bgbula_cat=3 if bgbula==13 | bgbula==12
* Saarland + Rhineland Palatinate
replace bgbula_cat=4 if bgbula==7 | bgbula==10
* Northrhine-Westphalia
replace bgbula_cat=5 if bgbula==5
* Hesse
replace bgbula_cat=6 if bgbula==6
* Baden-Württemberg
replace bgbula_cat=7 if bgbula==8
* Bavaria
replace bgbula_cat=8 if bgbula==9
* Berlin
replace bgbula_cat=9 if bgbula==11
* Saxony
replace bgbula_cat=10 if bgbula==14
* Saxony-Anhalt
replace bgbula_cat=11 if bgbula==15
* Thuringia
replace bgbula_cat=12 if bgbula==16

lab var bgbula_cat "Federal states categorized"
lab def bgbula_cat 1 "Schleswig-Holstein/Hamburg" 2 "Lower Saxony/Bremen" 3 "Mecklenburg Western Pomerania/Brandenburg" /// 
4 "Saarland/Rhineland Palatinate" 5 "Northrhine-Westphalia" 6 "Hesse" /// 
7 "Baden-Wuerttenberg" 8 "Bavaria" 9 "Berlin" 10 "Saxony" 11 "Saxony-Anhalt" 12 "Thuringia"
lab val bgbula_cat bgbula_cat
drop bgbula
rename bgbula_cat bgbula

Put the variables in your preferred order and save your dataset.

* Order demography and identifiers first
order persnr hhnr bghhnr syear sex gebjahr alter alter_cat bgsampreg bgbula ggk /// 
ggk_cat bgregtyp  

save ${MY_OUT_DATA}\zeit_online.dta, replace

b) You want to get an initial overview of regional differences in satisfaction with various aspects of life. Use the variable bgsampreg and cross-stabilize the variable with all satisfaction variables to identify differences between East and West Germany, display the absolute and relative frequencies.

To save the tables, save them in a log file.

********************************************************************************
capture log close
log using "${MY_LOG_OUT}\satisfaction.log", replace

* Life satisfaction

local varlist bgp0101 bgp0102 bgp0103 bgp0104 bgp0105 bgp0106 bgp0107 bgp0108 /// 
bgp0109 bgp0110 bgp0111 bgp0112
foreach x of local varlist {
tab bgsampreg `x' [aw= bgphrf] , row
}

To view all tables, look at your generated log file.

c) Now take a closer look at satisfaction with various aspects of life with the help of SOEP regional data. Use the municipal size classes. Create a table showing satisfaction with different aspects of life and highlighting differences by sex, age, municipal size class, and federal state.

foreach x of local varlist {
* Tabulation of satisfaction by municipal size class and federal state
table `x' sex alter_cat, by(bgbula ggk_cat) contents(freq) column row stubwidth(20) cellwidth(8) csepwidth(2) nomissing
* Tabulation of satisfaction by municipal size class
table `x' sex alter_cat, by(ggk_cat) contents(freq) column row stubwidth(20) cellwidth(8) csepwidth(2) nomissing
* Tabulation of satisfaction by federal state
table `x' sex alter_cat, by(bgbula) contents(freq) column row stubwidth(20) cellwidth (8) csepwidth(2) nomissing 
}

To view all tables, look at your generated log file. As you can see, SOEP regional data can be used to analyze variables at the lowest regional levels.

d) Create a table that shows political interest differentiated by age, sex, and municipal size class in Bavaria

********************************************************************************
capture log close
log using "${MY_LOG_OUT}\political_interest.log", replace

* Political interest
* Tabulation of political interest by municipal size class for Bavaria
table bgp143 sex alter_cat if bgbula==8, by(ggk_cat) contents(freq) column row stubwidth(20) cellwidth (8) csepwidth(2) nomissing

As you have seen here, the SOEP offers a wide range of possibilities for regional analysis. It is possible to allocate a multitude of regional indicators at the level of federal states, regional planning regions, districts, and postal codes.

Last change: Jul 14, 2025