Data Structure of SOEP-Core

Principles of Data Analysis

The data structure for panel data consists of three dimensions. At first, the respective examination units (n) and a matrix of dependent and independent variables (y,x) are completely analogous to a cross-sectional design. Another level is the dimension of time (t), whereby a distinction is made between two data formats for panel data structures - “wide” or “long” (with wide format the variable matrix is indexed with the dimension of time and with long format the respective examination units). Regardless of the selected data format, when using panel data with several survey waves, the data matrices are often not completely provided with information due to the panel mortality of individual survey units or because data from new panel members are only collected at a later point in time. In both cases, the term “unbalanced panel data” is used. In contrast, the classical panel data structure, on the other hand, is “balanced”, i.e. as many observations of dependent and independent variables are available for all study units as there are waves of data collection. The data of social science panel data often show a data structure, which is characterized by many investigation units (large n) as well as, in relation to it, few waves and therefore measuring time (small t). When data from a panel study are available, even descriptive forms of data analysis are often of particular interest, since the identification of changes in a variable over time and the corresponding separation of interindividual and intraindividual changes can represent important social facts, particularly in the case of generalizable samples. It is of social scientific interest whether a constant 15 % proportion of people whose income is below the poverty risk level is repeatedly found in the same person over time, or whether there was a even balance of increases and decreases in poverty risks and only half of the population was permanently exposed to the risk. The choice of complex analysis methods for panel data depends first and foremost on the respective measurement level of the dependent and independent variables, but also on whether they are time-constant variables (such as gender or migration background) or time-invariant variables. The statistical analysis models of panel data range from structural equation models, various regression models, event analysis, sequence data analysis, latent growth models to causal analyses using matching methods. A particular advantage of panel data is that the chronological sequence of changes can be modelled and calculated and the problem of unobserved heterogeneity, which is often encountered in the social sciences, can be significantly reduced, at least in comparison with cross-sectional data.

Cross-sectional data structure (CS)

Cross sectional data is a type of data, which observes many subjects at the same point of time. Each person is assigned a row in the data set and is only included once in such a data set. By merging cross-sectional SOEP data across waves you receive a dataset in wide-format.

Row ID wave sex income
1 1 2015 m 1500
2 2 2015 m 1000
3 6 2015 f 2000
4 8 2015 m 5500

Data Structure in wide-format (wide)

The SOEP data is offered in different data structures. In wide format, a respondent’s repeated responses are displayed in a single row and each response in a separate column. Each column represents a variable. We provide four datasets in wide-format: ppath, phrf, hpath, hhrf

Row ID sex income2015 income2016 income2017
1 1 m 1500 1500 2000
2 2 m 1000 1200 1200
3 6 f 2000 2000 2000
4 8 m 5500 6000 6500

Data Structure in long Format (long)

The long format is a compressed and user-friendly data set structure for longitudinal section analysis. Here, each person has one line per survey year. This means that you do not have several data sets for the different waves, but a data set in which all survey waves are represented. A person can occur more than once in such a data set. In long format, one line describes a person-year combination.

Row ID syear sex income
1 1 2010 f 1500
2 1 2011 f 1500
3 1 2012 f 2000
4 2 1999 m 1000
5 2 2000 m 1200

Data Structure in spell format (spell)

In the strict sense of the word, spell data are about time periods with a defined start and end. When handling spell data it is necessary to take potential censoring into account. Censoring denotes that the beginning (left censored) or ending (right censored) of a spell is imprecise because of missing information or the beginning or ending of a spell is outside of the period of observation. It is quite conceivable that a person has only one spell over a given period, such as a male who is full-time employed. For a ten year period, there may be just the one spell “full-time employed”. In panel data, the same person would have 10 observations, one per year. A person may have many spells over a time period, and even have overlapping spells, like working part-time and receiving a disability pension. Spell data is useful for looking at stays in a certain state, and transitions in and out of that state.

Row ID spellnr spelltype begin end censored
1 1 1 Retired 1983 2007 left and right censored
2 1 2 Housewife/husband 1983 1984 left censored
3 1 3 Housewife/husband 1994 1994 uncensored
4 1 4 Housewife/husband 1998 1998 uncensored
5 2 1 Full-Time Employment 1984 1984 left censored
6 2 2 Full-Time Employment 1985 1985 uncensored

Data Distribution File

In the SOEP, each survey year is allocated to a data wave, which is abbreviated with the letters of the alphabet. The current data wave can contain several versions, which are displayed in SOEP with a “v” for version and the respective version number. The version number represents the survey years since the beginning of the survey. The SOEP has recently published the 34th version since the survey began in 1984. Within a data wave, updates may occur over time, such as v34.1. If updates have been carried out, users are informed about them via various information channels and asked to order the data again. After ordering the data, the data will be sent to you as a zip-file.

../_images/SOEP_1.PNG

Within this zip file you will find various data sets, a “RAW” subdirectory and the “EU-SILC Clone” subdirectory.

../_images/SOEP.PNG

The data sets above the “RAW” subdirectory are highly compressed and an easy to analyze version of the SOEP data.

Note

SOEP strongly recommends that users use the data above the “RAW” subdirectory.

../_images/SOEP_2.PNG

The data in SOEP-Core are no longer only provided as wave-specific individual files but rather pooled across all available years (in “long” format). In some cases, variables are harmonized to ensure that they are defined consistently over time. For example, the income information provided up to 2001 is given in euros, and categories are modified over time when versions of the questionnaire have been changed. The longitudinal nature is one of the biggest assets of the SOEP. That’s why we provide longitudinal data sets, such as PL or HL. The advantage of such a data set is that longitudinal analyses can be carried out without great effort.

If you need more information about the long data structure visit the chapter Data Structure in long Format (long).

Core Data Sets

The data sets above the “RAW” subdirectory:

Tracking Data Original Data Survey Data Generated Data Spell Data
ppathl pl csamp pgen artkalen
hpathl hl design hgen biocouplm
pbrutto biol exit bioage17 biocouply
hbrutto jugendl   bioagel biomarsm
pbr_exit plueckel   kidlong biomarsy
  abroad   pequiv einkalen
  vpl   biobirth lifespell
      bioedu migspell
      bioimmig pbiospe
      biojob refugspell
      bioparen sozkalen
      bioresid  
      biosib  
      biosoc  
      biotwin  
      camces  
      cogdj  
      cognit  
      gripstr  
      hconsum  
      health  
      hwealth  
      interviewer  
      mihinc  
      pflege  
      pkal  
      pwealth  
      timepref  
      trust  

Raw Data Sets

In the “RAW” directory you will find all wave-specific data sets that were used to generate the long data sets on the previously presented level.

../_images/SOEP_4.PNG

Attention

Please note that the data sets above the RAW subdirectory are completely sufficient for your data analysis. The data sets used to generate the SOEP-Core data can be found in the RAW subdirectory. Detailed information about the RAW Data Sets can be found here Raw Data

../_images/SOEP_3.PNG

Within this “RAW” Directory each wave is identified by letters of the alphabet: the first wave in 1984 is wave “A”, 1985 is wave “B”, and so on. To simplify the notation, the “$” sign is used, when all waves of one group of datasets are referred to. For example, $H refers to all household level datasets AH to now. For each year of SOEP data, there are single data files for households (e.g. $H) as well as for individual respondents (e.g. $P) and children (e.g. $KIND) based on interview information. These observations make up the “net” population, with each of these files containing as many records as interviews could be conducted. Additional data files with a limited number of variables based on the “address log” constitute the “gross” number of households and persons, i.e. all households and their members which were eligible for an interview in any given year. Within the “RAW” directory, the data sets are stored on a wave-specific basis and are the generation basis for the majority of the long data sets described above. In addition to these wave-specific data sets, the “RAW” directory also contains additional data sets in cross-sectional format that have not yet been distributed in long format ($SCHOOL, $SCHOOL2, EV, EXIT, $PKALOST and PBR_HHCH).

Tracking Data Original Data Survey Data Generated Data
ppath $p phrf $pgen
hpath $p_mig hhrf $hgen
$pbrutto $p_refugees pbr_hhch $kind
$hbrutto $pausl   $pequiv
  $pluecke   $pkal
  $h   $pkalost
  $h_refugees    
  $post    
  $jugend    
  $school    
  $school2    
  ev    
  $vp    

EU-SILC-Clone

Currently, the official German EU-SILC is provided only as a cross-sectional dataset by the German Federal Statistical Office. A panel dataset will presumably be available from the year 2020 onwards (Bundesrat, 2016). As a consequence, Germany is excluded from cross-country studies exploiting the longitudinal dimension of EU-SILC. The aim of the EU-SILC clone is to provide an EU-SILC-like panel dataset for Germany from the year 2005 onwards; so that Germany can be included in cross-country studies using EU-SILC panel data. The EU-SILC-Clone is built on the Socio-Economic Panel (SOEP) and, therefore, includes all EU-SILC panel variables, for which the required information is recorded in the SOEP.

../_images/SOEP_6.PNG
../_images/SOEP_7.PNG

The EU-SILC-Clone includes all of the four EU-SILC sub-datasets: The household register (D-File), the personal register (R-File), personal data (P-File) and household data (H-File). The clone datasets can be combined using the R-File which includes both, the current household ID and the person ID. ID numbers in the EU-SILC-Clone are unique and do not vary between the four datasets.

A complete documentation of the datasets can be found here: Documentation EU-SILC:

Data Sets SOEP-Core

SOEP-Core contains a multitude of different datasets. To get an overview of the data, a somewhat simplified categorization helps:

There are Tracking Data and Survey Data files which describe the development of the sample, such that the user knows which person or household was part of the interviewed sample in any given year. Then there are Original Data files, which contain the data from each year’s questionnaires without any changes except for very basic consistency checks. To help the user with the data, there also are Generated Data. These contain consistently coded variables across all waves with common names, such that the users can easily use this information when combining datasets across waves. The SOEP also provides various data on the respondent’s background, called biographical data. Biography data in general can conceptually be separated into biographical data which are unchanging (such as information on parent’s education, or data from the Mother-Child Questionnaires) and data which may be updated through changes in a respondent’s life (such as new children in the birth biography, or a job change in the job history). Some of the changing data is stored as Spell Data. For each spell there is a definition of the spell type, begin point, end point and the censoring status, indicating if a given employment or income spell is censored (left and/or right) or uncensored. One of the biggest assets of the SOEP data is their longitudinal nature, i.e. repeated observations of the same unit (person or household) over time. That’s why we provide longitudinal data sets, such as PL or HL. Finally, there are some files which cannot be easily categorized - some are one-time datasets, some provide information about the interviewers, some about respondents outside of Germany.

There are two datasets which should be the building block of any analysis, as they allow to define longitudinal populations very easily: PPATHL and HPATHL. HPATHL includes all households which have been interviewed successfully at least once. Similarly, PPATHL contains all persons who have ever lived in a household that has participated in the SOEP, i.e. that has been captured in HPATHL, including non-respondents and children. Both data files contain one record per household or person, respectively, with wave-specific variables for each year’s survey status. In addition to some time-invariant information (like gender, year of birth, migrant status), these files contain all necessary identifiers to combine other files with PPATHL and HPATHL. Although they provide essential information, PPATHL and HPATHL alone are of little use for actual analyses. The most often used sources for additional information in SOEP-Core are the cross-sectional data files provided in each survey year (or “wave”) or the data sets in the long-format.

The SOEP data sets can be viewed based on their content classification (Tracking Data, Original Data, Survey Data, Generated Data and Spell Data), the data structure (cross-sectional (cs), wide, long, spell) and also from the respondent’s perspective. From the respondent’s perspective, data sets can contain gross or net information. In addition, some data sets provide information only at the household level and other data sets provide information at the individual level.

../_images/level.PNG

Gross information at household or individual level are provided to users in the data sets hbrutto, hbrutt and pbrutto, pbrutt. Content information collected from household or individual questionnaires, for example, are original data and are stored in HL and PL. The SOEP team generates data from these original data, which are generated from the many SOEP questionnaires. New generated and user-friendly data sets such as pgen are created from the components of PL.

Data Processing

The following overview shows which data sets form the basis of each questionnaire. At the same time, the data processing process is shown. From the questionnaire to the wave-specific data sets to the prepared long data sets. Please note that not all data sets are based on questionnaires, but that many of them have been prepared lovingly and with great effort by our scientists and employees. Therefore the table does not show the full range of available data sets.

../_images/Overview_Questionnaires_Datasets.PNG

In addition to the classic SOEP survey instruments, there are also a large number of sample-specific questionnaires whose information flows into other unlisted raw data sets (e.g. $pausl, $post, $pkalost etc.). The chapter Sample Specific Questionnaires explains why such special survey instruments exist, how they become raw data sets and in which long data sets these variables can be found.

Data Set Identifiers

Because of the overall data structure with data on different observational levels, any analysis requires the combination of data using matching or merging procedures. These merging procedures need identifiers such that a combination of datasets becomes feasible. The central individual identifier across time is pid, which is fixed over time (and of course datasets). Since a person might change the household in which he or she lives at any point in time, yearly household identifiers called hid are necessary. The exact same information is also stored in $hhnr, allowing easier matching depending on the dataset used. Finally, each individual (respondents as well as children) can be traced back to be a member of or a split-off from an original household of the very first wave. This household’s ID, which is fixed no matter how often a person changes the household in the course of time, is called cid. In addition, respondents in long data can be differentiated according to the different survey year. The syear variable can be used to identify a respondent’s survey year. The SOEP provides additional identifiers in the various data sets in order to identify respondents and to provide further possibilities for merging data sets. A excerpt of these additional identifiers can be found here:

Please note that these are not all identifier variables. The name of the identifier variable can change, depending on the data set used.

  • parid “Unchanging Personal ID of Partner (PID)”
  • pgpartnr “Person Number of Partner”
  • coupid “Couple Identifier”
  • intid “Interviewer Number”
  • intid1 “Nr of First Interviewer”
  • $hhnr “Current Wave HH Number (=HHNRAKT/HID)”
  • hhnrold “HH Number Previous Year With Person ID”
  • vpersnr “ID of Deceased Person”
  • bymnr “Person Number Mother”
  • byvnr “Person Number Father”
  • mnr “Person Number Mother”
  • fnr “Person Number Father”
  • kidpnr01-kidpnr15 “PERSNR 1st. Child” - “PERSNR 15th Child”
  • sibpnr1-sibpnr11 “Person ID, 1. sibling” - “Person ID, 11. sibling”
  • persnre “Never Changing Person ID Respondent” (mostly Mother)
  • pnrtwin “Person Number 2. Sibling”
  • pnrtrip “Person Number 3. Sibling”
  • pnrquad “Person Number 4. Sibling”
  • pnralt “Old Household And Person Number”
  • pnrneu “New Household And Person Number”

Versioning and Harmonization

In some cases, variables in long format with the same content but collected differently need to be harmonized to ensure that they remain consistent and comparable over time. Starting with SOEP Core v.34, SOEP offers versioning and harmonization solutions for such variables in all Original Data in long format. The SOEP user community can recognize these versions and harmonizations in the variable name. The “_v” suffix indicates possible differences in a variable. Harmonization suggestions generated by SOEP from the different versions of these variables can be recognized with the “_h” suffix. In general, particular caution is required when using variables marked “_v” or “_h”:

1.) Differences in Response Options

Variables are versioned and harmonized because the response options have changed over time.

2.) Differences in Coding of Response Options

Variables are versioned and harmonized because the coding of the response options has changed over time. Since the values of certain response options can change, it is not possible to easily integrate the various wave-specific variables into a variable in long format. The variable must be appropriately harmonized to be useable.

3.) Content Differences in the Questions.

Variables are versioned and harmonized because the questions were asked differently in different years, but the content belongs together. If the content or wording of the question changes, the wave-specific variables cannot easily be integrated into a long variable.

4.) Change of Question Type.

Variables are versioned and harmonized because the questions were asked differently in different years, for example as a question with multiple response options and later as a question with a single response option. A possible multiple answer in certain years makes it difficult to easily integrate the wave-specific variables into a variable in long format.

5.) Euro harmonization

Variables are versioned and harmonized because they are metric and were asked as DM amounts before the introduction of the euro. For the long version of the variable, metric variables based on different currencies in different years are harmonized as euro amounts.

6.) Differences in metric variables

Variables are versioned and harmonized if they contain a year and were provided in the wave-specific raw data with different numbers of digits. The years are standardized and presented in the harmonized version with four digits. In addition, possible problems with decimal digits in metric variables from the raw datasets are corrected for the long format variable.

7.) Different respondents

Variables are versioned and harmonized when different groups of respondents have received different survey instruments and the variables have not been integrated in the wave-specific raw data sets. Special samples or a specific filtering in the questionnaire can lead to certain groups of people receiving different questions, which belong together in terms of content. Such different variables are harmonized in the long version of the variable.

A more detailed explanation of the versioning and harmonization concept can be found in the exercise chapter Working with harmonized Variables

Core Data

Tracking Data

Tracking data are the basis for linking your research-relevant variables. In addition to various demographic information, tracking data also provide information on how the interview is conducted. These data sets should be understood by you as initial data. You can use the tracking data to merge your research-relevant variables via the person and household numbers.

Dataset Label Format Identifier (ID) Additional Identifier
ppathl Individual Tracking File long pid, syear hid, cid, parid
hpathl Household Tracking File long hid, syear cid
pbrutto Gross Individual Data long pid, syear hid, cid, intid, hhnrold
hbrutto Gross Household Data long hid, syear cid, intid1, intid
pbr_exit Cumulated Exit long pid, syear hid, cid, hhnrold

¹In addition to the classic identifiers (pid, hid and cid), these data sets also have the identifiers of older data distribution versions. (pid=persnr; hid=hhnrakt; cid=hhnr).

hpathl “Household Tracking File” (long):: HPATHL consists of all waves of the Raw data sets HPATH and HHRF. For all years since 1984, the HPATHL data set contains information on all households that have ever participated in the SOEP survey at any point in time. HPATHL is important for the delimitation of the examination unit (household), especially for longitudinal analyses. HPATHL is particularly suitable for household analyses and can be used for pre-selection of specific households.

ppathl “Individual Tracking File” (long):: PPATHL consists of all waves of the Raw data sets PPATH and PHRF. For all years since 1984, the PPATHL data set contains information on all persons who have ever lived in a SOEP household at a survey time (i.e. all respondents, but also children under 17 years of age and persons who have never given an interview). PPATHL is important for the delimitation of the examination units (persons), especially for longitudinal analyses. It contains one record for each individual and year a person has been a member of a respondent household. It is keyed on pid and syear, the survey year identifier. It contains the Household ID, the never changing individual characteristics, individual weights, as well as the response status for that individual at each wave.

pbrutto “Gross Individual Data” (long):: PBRUTTO consists of all waves of the Raw data sets $PBRUTTO. PBRUTTO covers all respondents, who were successfully interviewed for the first time in a wave or were contacted for the purpose of being interviewed again. The data set provides gross information on all SOEP respondents’ interviews as well as their positions in the panel frame work.

hbrutto “Gross Household Data” (long):: HBRUTTO consists of all waves of the Raw data sets $HBRUTTO. HBRUTTO covers all households, who were successfully interviewed for the first time in a wave or were contacted for the purpose of being interviewed again. The data sets provide gross information on all SOEP households’ interviews as well as their positions in the panel frame work.

pbr_exit “Cumulated Exit” (long):: The dataset pbr_exit is a supplement of pbrutto for individual dropouts. Individual dropouts are removed from the original pbrutto population, so that pbrutto covers all current household members. Pbr_exit holds the corresponding register information for individual drop-outs from households.

Original Data

These data sets contain the direct information of the respondents. The contents of these variables mirror the contents of the survey instruments. By searching in the questionnaires you can determine the exact wording of the question or also possible filter guidance.

Dataset Label Format Identifier (ID) Additional Identifier
pl Personal questionnaire long pid, syear hid, cid, intid
hl Household questionnaire long hid, syear cid, intid
biol Biographical Data long pid, syear hid, cid, intid
jugendl Youth questionnaire for first time respondents at age 18 long pid, syear hid, cid, intid
plueckel Follow-Up Questioning long pid, syear hid, cid, intid
abroad¹ Questionnaire for people moved abroad long pid, syear hid, cid
vpl Deceased Person long vpid, syear hid, cid, intid

¹In addition to the classic identifiers (pid, hid and cid), these data sets also have the identifiers of older data distribution versions. (pid=persnr; hid=hhnrakt; cid=hhnr).

pl “Individual questionnaire” (long):: The PL data set contains all waves of the $P data sets of SOEP-Core. In addition, the PL file includes all variables of all waves of the data sets $POST and $PAUSL. This means that the PL data set contains all variables of the individual questionnaire for all waves. In addition, the individual-specific data of the samples IAB-SOEP Migration and IAB-BAMF-SOEP Refugee Survey are integrated in the PL data set.

hl “Household questionnaire” (long):: HL contains all waves of the data sets $H from SOEP-Core. This means that the HL data set includes all questions of the household questionnaire. In addition, the household-specific data of the samples IAB-SOEP Migration and IAB-BAMF-SOEP Refugee Survey are integrated in the original HL data set.

biol “Biographical Data” (long):: BIOL contains cumulated individual-level raw data from the biographical questionnaire and from wave specific biographical modules of the individual questionnaire. BIOL is intended to be used in addition to the generated biographical files (by advanced users) to complete (or modify) generated biographical variables.

jugendl “Youth questionnaire for first time respondents at age 17” (long):: JUGENDL contains the waves q (2000) up to the current wave of $JUGEND of SOEP-Core. Since 2000 (wave Q), first-time respondents between the age of 16 and 17 have received a separate biographical questionnaire with additional age-group-specific questions, for instance, about their relationship to their parents or about what they do in their free time. Up to now, only some of the data collected from this survey have been processed and provided to users in dataset BIOAGE17. The complete data will be provided in individual JUGENDL dataset.

plueckel “Follow-Up Questioning” (long): The PLUECKEL data set contains all waves of the $PLUECKE data sets of SOEP-Core. Temporary drop-outs (“gaps”) can cause problems for longitudinal analyses. This is especially true for the employment and income data stored. That is why the SOEP tries to fill in at least some of the central missing information. PLUECKEL is a small questionnaire covering information on the year previous to which the drop-out occurred. It covers questions on job-related changes, calendar of occupation, income, education and qualification.

abroad “Questionnaire for people moved abroad” (CS):: With the pilot study “Life outside Germany” in 2008, the longitudinal German Socio-Economic Panel Study (SOEP) ventured into completely uncharted methodological territory by attempting to locate the addresses of former participants, who have since immigrated abroad and to survey these individuals with the help of a specially developed written questionnaire on the reasons for their international move. The project was discontinued due to insufficient case numbers in 2014.

vpl “Questionnaire for Deceased Person” (long): The VPL data set contains all waves of the $VP data sets of SOEP-Core. The VPL file contains information about respondents who lost a person in the previous year. It provides information about the deceased person and the respondent who reported the case of death.

Survey Data

These data sets contain surveymethodical information for SOEP-Core. The various data sets provide detailed exit information from respondents or household weighting factors that you need for representative analyses.

Dataset Label Format Identifier (ID) Special Identifier
csamp Sample Definition long cid  
design Survey Design CS hhnr intid
exit¹ Cumulative drop-outs CS pid cid, syear
pbr_hhch¹ PBR_HHCH CS pid hid, syear, cid, pnralt, pnrneu, hhnrold
cirdef Randomized Survey File long hhnr  

¹In addition to the classic identifiers (pid, hid and cid), these data sets also have the identifiers of older data distribution versions. (pid=persnr; hid=hhnrakt; cid=hhnr).

csamp “Sample Definition” (long): The dataset CSAMP [SAMP] contains detailed sampling information for each of the original sampling households at the case level [cid / hhnr].

design “Survey design” (CS): The dataset DESIGN provides information on the stratified sampling of the SOEP in form of two variables. The variable STRAT identifies each of the discrete sampling groups described above. Altogether, the SOEP consists of 40 strata: one stratum in sample A, twenty-seven in sample B, one in sample C, three in sample D, one in sample E, two in sample F, four in sample G, and one in sample H. Unique inclusion probabilities pertain to each of these strata. The variable design contains the inverse of this probability, i.e., the design weight.

exit “Whereabout-study [Verbleibstudie]” (long): The dataset EXIT delivers the results from the whereabout-study [Verbleibstudie] by Quantar Public (former:TNS Infratest) 2008/2009. This study has been used to identify reasons for (demographic) dropouts. The identification of deceases are included in the corresponding variables in PPATH/L [todjahr, todinfo].

pbr_hhch “PBR_HHCH” (long): The dataset pbr_hhch is a subfile of pbrutto, which has been used from 1984 till 2009 to identify individuals with household split-offs for the subsamples A-H.

cirdef “Randomized Survey File” (long): This dataset includes randomized groups of original sampling households [rgroup] – to use for the selection of representative shares across all subsamples with full representation of any cross-sectional and longitudinal information (variables) at all levels (case, households, individuals, spells) for the entire SOEP population across waves.

Generated Data

The SOEP team has prepared these data sets for you in a special way. The data sets are prepared in a research-friendly manner and are subjected to additional plausibility checks and quality controls. They usually consist of several variables, of different survey instruments and are described by the documentation provided. Therefore, these data sets cannot be assigned 1:1 to a survey instrument.

Dataset Label Format Identifier (ID) Additional Identifier
pgen Generated Individual Data long pid, syear hid, cid, pgpartnr
hgen Generated Household Data long hid, syear cid
bioage17¹ Generated biographical youth information CS pid hid, syear, cid, bymnr, byvnr, intid
bioagel¹ Generated biographical information long pid, syear, persnre hid, cid,
kidlong¹ Data on children long pid, syear hid, cid
pequiv Cross-national Equivalent File long pid, syear hid, cid
biobirth¹ Generated biographical information CS pid cid, kidpnr01-kidpnr15
bioedu¹ Generated biographical information CS pid cid
bioimmig¹ Generated biographical information long pid, syear hid, cid
biojob¹ Generated biographical information CS pid cid
bioparen¹ Generated biographical information CS pid cid, fnr, mnr
bioresid¹ Generated biographical information CS pid hid, syear, cid, intid
biosib¹ Generated biographical information CS pid cid, sibpnr1-sibpnr11
biosoc¹ Generated biographical information CS pid hid, syear, cid, intid
biotwin¹ Generated biographical information CS pid cid, pnrtwin, pnrtrip, pnrquad
camces¹ Highest Educational Qualification, Migrants Sample M1 and M2 CS pid hid, syear, cid
cogdj¹ Data on cognitive tests (Youth) CS pid syear, cid
cognit¹ Data on cognitive potential CS pid syear, cid, intid
gripstr¹ Measures grip strength CS pid syear, cid, intid
hconsum¹ Hosehold Consume Module CS hid syear, cid
health¹ Data on health indicators CS pid syear, cid
hwealth Wealth Module long hid, syear cid
interviewer Data on the SOEP Interviewer long intid, syear cid
mihinc Multiple imputed data on monthly household income long hid, syear cid
pflege Persons needing care within the household long pid, syear cid
pkal Individual Calendar long pid, syear hid, cid
pwealth Wealth Module long pid, syear hid
timepref¹ Experiment on time preferences CS pid hid, syear, cid
trust Experiment on trust long pid hid, syear, cid

¹In addition to the classic identifiers (pid, hid and cid), these data sets also have the identifiers of older data distribution versions. (pid=persnr; hid=hhnrakt; cid=hhnr).

pgen “Generated Individual Data” (long): PGEN contains all waves of the $PGEN data sets of SOEP-Core. The PGEN-file contains user friendly data on the individual level which are consolidated from different sources. The plausibility is validated longitudinally in many respects , therefore the data are superior in most situations compared to the data in PL. The file contains one row for each person (pid is unique) with a completed individual or youth questionnaire.

hgen “Generated Household Data” (long): HGEN contains all waves of the $HGEN data sets of SOEP-Core. In order to minimize computing efforts for the user, the SOEP provides yearly status variables on household level. The HGEN data provides a set of time-consistent variables generated from the SOEP Household Questionnaire. It only includes households who participated in the respective year.

bioage17 “Generated biographical information” (CS): The design of the dataset BIOAGE17 is patterned after the 2001 Youth Questionnaire, which is the standard version for subsequent years. A special group of first time respondents are young persons living in a panel household, who reach the surveying age of 17 years. From this specific group of panel entrants, we are able to obtain some more detailed information on youth and socialisation than from other new sample members.

bioagel “Generated biographical information” (long): The BIOAGEL data files are generated using information collected in the “Mother & Child” and “Parent” questionnaires. BIOAGEL is now provided in one dataset.

kidlong “Data on children” (long): The variables stored in the KIDLONG file are based on the information annually collected and stored in the wave-specific $KIND files. The relevant information is not provided by children themselves but by answers to the questions in the household questionnaire given by the respondent within the household (mostly the head of the household). This data is reaggregated at the person level and stored as child-specific entries in the file $KIND.

pequiv “Cross-national Equivalent File” (long): PEQUIV contains all waves of the $PEQUIV data sets of SOEP-Core. The PEQUV-File is based on the Cross-National Equivalent File (CNEF) with extended income information for the SOEP. This file comprises not only the aggregated income figures provided in the CNEF but also further single income components.

pkal “Individual Calendar” (long): PKAL contains all waves of the $PKAL data sets of SOEP-Core. The PKAL datasets contain calender variables from the Individual questionnaire. The dataset includes the activity status on a monthly basis as well as the income status of a person.

biobirth “Generated biographical information” (CS): The file BIOBIRTH provides information on fertility histories of adult respondents in the SOEP. Until 2014 (version 30, wave BD) the data was stored in two separate files: BIOBIRTH containing female fertility histories, and BIOBRTHM providing male fertility histories. Fertility histories in BIOBIRTH provide information on every woman (as well as every man with a panel entry since 2001) who has ever provided at least one successful SOEP interview.

bioedu “Generated biographical information” (CS): The Socio-Economic Panel Study (SOEP) contains a broad range of variables which cover early child education and care, educational participation, educational degrees and other related topics. It is the aim of the BIOEDU dataset to provide ready-made variables on educational transitions and related topics in order to support analyses in a longitudinal perspective.

bioimmig “Generated biographical information” (long): The variables contained in BIOIMMIG deal with questions related to foreigners in (and migrants to) Germany. Specifically, questions concerning desire to return to the home country, the presence of relatives in the home country, reasons for coming to Germany, and conditions upon initial arrival in Germany.

biojob “Generated biographical information” (CS): The purpose of BIOJOB is to provide a file, that offers the user convenient access to biographical information on past job activities. BIOJOB consists of generated variables as well as plain questionnaire information. Up to now all but two variables of BIOJOB are time-invariant. Information on occupational changes and on the age at the most recent change of occupation refer to the date of the respondent‘s biography interview.

bioparen “Generated biographical information” (CS): The dataset BIOPAREN contains biography entries on the parents and on the social origin of the respondents. The information available in BIOPAREN is obtained in two different ways. On the one hand, BIOPAREN includes the children’s proxy entries on the parents from the Biography Questionnaire and the Youth Questionnaire. On the other hand, it contains the direct entries from the parents in the case the respondent lives in the same household as his parents. Please note that BIOPAREN focuses on the social parent. Biological parent identifier can be found in BIOBIRTH.

bioresid “Generated biographical information” (CS): In 1994 questions with a focus on occupancy were introduced to the Biographical Questionnaire asking for the duration of residence in the current dwelling and any second residence. The information surveyed in the Biographical Questionnaire is stored in the file BIORESID.

biosib “Generated biographical information” (CS): BIOSIB provides information on siblings living within the SOEP households. The data set contains the person numbers of all siblings in an observed family. It includes information on their sex, their year of birth, the number of siblings, the individual’s position within the birth order, and on the relationship between the observed siblings.

biosoc “Generated biographical information” (CS): Contains retrospective data on youth and socialization. Respondents of all ages describe aspects of their life at the age of 15, including their relationship with parents, grades in school, the federal state where they last attained educational qualifications, detailed information on vocational qualifications, as well as intentions to complete further education or vocational training. Questions concerning military and alternative services are also included in this data set.

biotwin “Generated biographical information” (CS): The file BIOTWIN contains all twins that were ever identified within the SOEP. To be classified as a twin, a person is required to have exactly the same age as his or her sibling (year & month of birth), have a relationship to the head of the household that indicates that he or her and a second persons are siblings, and have the same mother (as far as a pointer to the mother is available). Furthermore, it is not only twins that are recorded in the BIOTWIN data set, but also triplets or quadruple siblings.

camces “Highest Educational Qualification, Migrants Sample M1 and M2” (CS): The CAMCES-File provides information about Computer-Assisted Measurement and Coding of Educational Qualifications in Surveys.

cogdj “Data on cognitive tests (Youth)” (CS): In SOEP 2006, a separate questionnaire with cognitive tests for adolescents was used for the first time: “Lust auf DJ”. In this case, “DJ” stands for “Thinking Sports and Youth (Denksport und Jugend)”, but was also specifically selected to arouse the more common association of “Disc Jockey”. For all interviewees aged 16 - 17 years, the questionnaire “Lust auf DJ” was used and created.

cognit “Data on cognitive potential” (long): In the 2006 survey year, for the first time, short cognitive tests were carried out with a subsample of the SOEP. The goal was to employ a robust set of instruments that could be administered easily by trained interviewers within just a few minutes. In COGNIT06 users are provided with the aggregated sum scores (total values for three time packages, so-called “parcels” of 30, 60 and 90 seconds).

gripstr “Measures grip strength (left and right hand)” (long): The data on grip strength from the survey year 2012 is now included in the GRIPSTR dataset.

hconsum „HH consume module“ (CS)“: We were faced with three methodological challenges in generating the final consumption data. Firstly, due to the design of the consumption module, inconsistent answers arose between the monthly and annual amounts spent for consumption. Secondly, we encountered the well-known phenomenon of missing data, here in particular item nonresponse. And thirdly, consumption data are usually blurred by heaping. For researchers who do not want their consumption variables to include changes from all steps of data preparation, the new data set “HCONSUM” contains not only the prepared consumption variables but also flag variables providing researchers the opportunity to select individual solutions.

health „Data on health indicators“ (long): Starting in 2002 the SOEP health module in the individual questionnaire has been revised and put into a two year replication period. In the HEALTH-File users find i.e. the generated variables on height and weight with imputation flags and a user-friendly longitudinal checked generated variable of the Body Mass Index (BMI).

hwealth „Wealth module“ (long): The generated SOEP wealth data is stored in two separate data files called PWEALTH for information at the individual level and HWEALTH for correspondingly aggregated data at the household level. HWEALTH contains all information on the household level; it is purely the result of aggregating the person-level information in PWEALTH. However for all persons with valid household level information that did refuse to respond to the Individual questionnaire (partial unit non-response) imputations have been carried out and the results are included in HWEALTH.

interviewer „Data on the SOEP Interviewer“ (long): The SOEP does not only aim at collecting high-quality data on the living conditions and well-being of households, but –as a by-product of internal quality assurance processes– it lends itself increasingly as a empirical source for survey research. The purpose of the INTERVIEWER file is to provide user convenient access to all available, longitudinal information on the SOEP interviewers.

mihinc „Multiple imputed data on monthly household income (long)“: The dataset MIHINC contains the complete imputation results and is separately available. To be compatible with methods for analysing multiply imputed data, MIHINC is constructed in the so called stacked or MIM Dataset Format. It contains the following variables: HHNRAKT, SVYYEAR, MJ, MI, IHINC and IMPFLAG. Since 1995 for every survey household in all survey years there are ten imputed values for the current household income.

pflege „Persons needing care within the household“ (long): Since wave B (1985) the SOEP Household Questionnaire includes questions on household members in need of care. In order to support analyses on an individual level, this information has been restructured and stored in the cumulative file PFLEGE.

pwealth „Wealth module“ (long): In the year 2002, the Individual Questionnaire included for the first time a special module focusing on wealth. This section included questions on seven different wealth components: Owner-occupied property (including debt), other property (including debt), financial assets, private pensions (including life insurance and building savings contracts), business assets, tangible assets and consumer credit. The generated SOEP wealth data is stored in two separate data files called PWEALTH for information at the individual level and HWEALTH for correspondingly aggregated data at the household level. Wealth-related variable names in the file PWEALTH consist of six digits. The first digit tells the user which wealth component is referred to, and the second to sixth digits provide more detailed information about possible filter information, the personal share, the gross amount, and the amount of any outstanding debt. In principle a digit is coded “1” if a given variable does indeed contain this specific piece of information and “0” otherwise. The wealth information in the SOEP questionnaire is surveyed at the individual level and thus also imputed or edited at the individual level (although checked against household information for consistency).

timepref „Experiment on time preferences“ (CS): Following on the behavioral experiment on trust and trustworthiness carried out in the 2003, 2004, and 2005 SOEP surveys, the experiment “time preferences” was run in 2006. In this experiment on economic behavior, respondents were asked to decide how they would want to receive €200 in prize money: if they would want to receive it immediately by check, or if they would want to wait and receive a larger amount later—that is, with interest.

trust „Experiment on trust“ (long): Data set of the economic behavior experiment on trust and trustworthiness from the survey years 2003, 2004 & 2005, which serves to measure trust, based on an investment game. This is a one-off game for two actors who relate to each other anonymously. The first player receives a credit of ten points and can overwrite any number of points of the second player. Each overwritten point is doubled. The second player also receives a credit of ten points. After receiving the (doubled) points from the first player, it decides how much of its own credit it will transfer to the first player (zero to ten points). As with the first transfer, your points at the recipient are doubled. After the decision of the second player, the game ends and the other players are paid their income (one point corresponds to one euro, the sum is sent out as a cheque a few days later). The TRUST data set thus contains the information from all three waves in which the behavioral experiment was conducted.

Spell Data

Spell, duration or event history data are used frequently in the social sciences. In the strict sense of the word, spell data are about time periods with a defined start and end. General information about the data structure of spell data can be found in the chapter Data Structure in spell format (spell)

Working with spell data:

Working with spell data (pdf):

Working with spell data (do-files):

How to generate spell data from data in wide format: Based on the Migration Biographies of the IAB-SOEP Migration Sample:

Generating spell data:

Dataset Label Format Identifier (ID) Additional Identifier
artkalen Spell data from the activity calendar spell pid cid
biocouplm Generated biographical information spell pid cid, coupid
biocouply Generated biographical information spell pid cid
biomarsm Generated biographical information spell pid cid
biomarsy Generated biographical information spell pid cid
einkalen [deprecated] Spell data on income spell pid cid
lifespell Spell Information on the Pre- and Post-Survey History of SOEP-Respondents spell pid cid
migspell Migration history spell pid cid
pbiospe Generated biographical information spell pid cid
refugspell Migration history spell pid cid
sozkalen [deprecated] Spell data on social benefits spell hid, cid  

artkalen “Spell data from the activity calendar” (long): The ARTKALEN contains spells (monthly) for events starting in January 1983. This is in contrast to PBIOSPE, where spells were in yearly durations, and events previous to 1983 were included. The information on activity status are collected on a monthly basis in the yearly Individual Questionnaire and stored in the file ARTKALEN.

biocouplm “Generated biographical information” (long): With the BIOCOUPLM the SOEP provides consistent and continuous partnership histories for nearly all adult respondents. BIOCOUPLM is build on the prospective information at the time of each interview. The relationsship histories are collected on a monthly basis from all adult SOEP-participants since their entry into the SOEP.

biocouply “Generated biographical information” (long): With the BIOCOUPLY the SOEP provides consistent and continuous partnership histories for nearly all adult respondents. BIOCOUPLY is build on retrospective and prospective information at the time of each interview. The relationsship histories are provided on an annual basis.

biomarsm “Generated biographical information” (long): With BIOMARSM the SOEP provides consistent and continuous marital histories for nearly all adult respondents. BIOMARSM is build on the prospective information at the time of each interview. The martial histories are collected on a monthly basis from all adult SOEP-participants since their entry into the SOEP.

biomarsy “Generated biographical information” (long): With BIOMARSY the SOEP provides consistent and continuous marital histories for nearly all adult respondents. BIOMARSY is build on retrospective and prospective information at the time of each interview. The marital histories are provided on an annual basis.

einkalen “[deprecated] Spell data on income” (long) The income calendar is used to gain information about sources of income throughout the year. The respondent checks off for each month all appropriate sources of income.

lifespell “Spell Information on the Pre- and Post-Survey History of SOEP-Respondents” The SOEP team regularly conducts drop-out studies to identify the whereabouts of attritors. These studies draw on official register data and allow us to determine whether a person is still living in Germany, is deceased, or has moved abroad since the last SOEP interview. The information is combined in a spell file LIFESPELL. This dataset reports all available information on the pre- and the post-survey history of all persons who have ever been a member of a SOEP household.

migspell “Migration history”(long): MIGSPELL is derived from the migration biographies, which are collected from each new respondent of the IAB-SOEP-Migration-Samples M1 and M2. It contains data on the moves of foreign-born migrants as well as on the stays abroad of German-born respondents.

pbiospe “Generated biographical information” (long): The spell file PBIOSPE is based on the information on activity status over the life course, which is collected as a matrix from every respondent answering the Biography Questionnaire. The observations start at the age of 15 and end at the current age (up to age 65). To update the ongoing occupational career in PBIOSPE, information from the yearly Individual Questionnaire is also used.

refugspell “Migration history” (long): For migration biographies in the refugee samples, we created the spell data set REFUGSPELL. The variables in MIGSPELL and REFUGSPELL are derived from different instruments and only partially overlap. The data structure allows the data set to be linked with MIGSPELL if desired.

1992-2000 sozkalen „[deprecated] Spell data on social benefits“: The file SOZKALEN provides spell data on receiving social assistance of households, defining begin, end, and censoring status of any period of receiving 3 different types of assistance. This file is set up, using information from the calendar, asked for the previous year (asked for the years 1992-2000). Thus, it contains information on a monthly basis.

Missing Conventions

Survey variables might be missing, i.e. without a valid code or value for different reasons. In the SOEP, negative values are not valid for any variable, but are used instead to code different reasons for missing information. There are two distinctions for missing values: they may originate in the respondent‘s answer or in the survey design. The respondent may refuse or not know an answer or she may report invalid values on the one hand, and the interview design may exclude respondents with certain characteristics from some questions on the other (e.g. men will never be asked if they are pregnant). The following codes are used:

Code Label
-1 no answer / don’t know
-2 does not apply
-3 implausible value
-4 Inadmissable multiple response
-5 Not included in this version of the questionnaire
-6 Version of questionnaire with modified filtering
-8 Question not part of the survey program this year¹

¹Only applicable for datasets in long format.

A person might refuse to answer a question, which happens more often in sensitive questions (e.g. income related questions), or may just not know the answer to a question. In such a case, the missing code is “-1” for “no answer / don’t know”. Note that the SOEP does not distinguish between the refusal to answer and a true “don’t know”. Information may be missing when a question is not asked because it is not relevant for a specific person, e.g. owner-occupiers will not be asked about the amount of rent they pay. In such cases, the question “Does not apply” to this person, and the variable receives a code of “-2”. Sometimes invalid answers are encountered, when respondents fill out a PAPI interview themselves or the interviewer mistypes an answer, e.g. persons cannot work more than 168 hours a week. In such a case, multiple checks are carried out, and if the inconsistency remains, the variable is recoded “-3 Implausible value”. Some questions contain multiple answer possibilities, where the respondents are asked to pick one and only one answer. In the SOEP PAPI instruments, sometimes respondents ignore this request and provide more than one answer, e.g. they mark “very good” and “good” when asked about their current health status. In such cases, if the correct answer cannot be determined from the questionnaire itself, the code “-4 Invalid Multiple Answers” is given to this variable. With the extension of the SOEP in recent years, entirely new samples have been added to the core. In these samples, sometimes questions are left out completely, e.g. to shorten the questionnaire or because the focus of the sample is different as in some of the related studies. In such a case, the variable will be set to “-5 Not included in this version of the questionnaire” for an entire subsample. With the use of CAPI, recent developments include an “integrated” person questionnaire, i.e. the biography part and the “regular” part of the questionnaire are asked as one. Some of the questions in the biography part are repeated in the regular part. While in the PAPI mode, the respondent will answer the same question twice, the CAPI allows to filter the respondent around the question if it has already been asked. These cases are very rare - if they occur, they receive a code “-6 Version of questionnaire with modified filtering”.