Raw Data

Tracking Data

Tracking data are the basis for linking your research-relevant variables. In addition to various demographic information, tracking data also provide information on how the interview is conducted. These data sets should be understood by you as initial data. You can use the tracking data to merge your research-relevant variables via the person and household numbers.

Dataset Label Format Identifier (ID) Additional Identifier
ppfad¹ Individual Tracking File wide pid cid, $hhnr
hpfad¹ Household Tracking File wide hid $hhnr
$pbrutto¹ Gross Individual Data CS pid hid, cid, $hhnrold
$hbrutto¹ Gross Household Data CS hid cid, intid1, intid

RAW Data ppfad „Individual Tracking File“ (wide): For all years since 1984, the PPATH data set contains information on all persons who have ever lived in a SOEP household at a survey time (i.e. all respondents, but also children under 17 years of age and persons who have never given an interview). PPATH is important for the delimitation of the examination units (persons), especially for longitudinal analyses.

RAW Data hpfad „Household Tracking File” (wide): For all years since 1984, the HPATH data set contains information on all households that have ever participated in the SOEP survey at any point in time. HPATH is important for the delimitation of the examination unit (household), especially for longitudinal analyses. HPATH is particularly suitable for household analyses and can be used for pre-selection of specific households.

RAW Data $pbrutto „Gross Individual Data“ (CS): $PBRUTTO covers all respondents, who were successfully interviewed for the first time in wave $ or were contacted for the purpose of being interviewed again in wave $. The data set provides gross cross-sectional information on all SOEP respondents’ interviews as well as their positions in the panel frame work.

RAW Data $hbrutto „Gross Household Data“ (CS): $HBRUTTO covers all households, who were successfully interviewed for the first time in wave $ or were contacted for the purpose of being interviewed again in wave $. The data sets provide gross cross-sectional information on all SOEP households’ interviews as well as their positions in the panel frame work.

Original Data

These data sets contain the direct information of the respondents. The contents of these variables are 1:1 the contents of the survey instruments. By searching in the questionnaires you can determine the exact wording of the question or also possible filter guidance.

Dataset Label Format Identifier (ID) Additional Identifier
$p¹ Individual questionnaire CS pid hid, syear, cid, intid
$p_mig¹ IAB-SOEP Migration Sample: Original Individual questionnaire CS pid hid, syear, cid, intid
$p_refugees¹ Individual questionnaire Refugee Sample, incl. Biography CS pid hid, syear, cid, intid
$pausl¹ Migrant specific questions in the Individual Questionnaire CS pid hid, cid
$post¹ East specific questions from the Household questionnaire CS hid cid, intid
$h¹ Household questionnaire CS hid syear, cid, intid
$h_refugees¹ Household questionnaire Refugee Sample CS hid syear, cid, intid
$jugend¹ Youth questionnaire for first time respondents at age 16-17 CS pid hid, syear, cid, intid
$school¹ Questionnaire: Pre-Teen, 12-13 years old CS pid hid, syear, cid, intid
$school2¹ Questionnaire: Early Youth, 14-15 years old CS pid hid, syear, cid, intid
$pluecke¹ Catch-Up Individual (Re-Questioning) CS pid hid, cid, intid
ev¹ First wealth module CS pid hid, syear, cid
$vp¹ Questionnaire: the deceased individual CS pid hid, syear, cid, vpersnr, intid
biol Biographical Data long pid, syear hid, cid, intid

¹In addition to the classic identifiers (pid, hid and cid), these data sets also have the identifiers of older data distribution versions. (pid=persnr; hid=hhnrakt; cid=hhnr).

RAW Data $post “Individual Questionnaire - East Sample” (CS):: The datasets $POST include the complete file of population and variables for the East-German subsample in 1990 [GPOST] and the population-specific set of additional variables for East-Germans in 1991 [HPOST].

RAW Data $p “Individual Questionnaire” (CS): The $P-files contain all variables of the individual questionnaire for the wave $. In addition, the individual-specific data of the samples IAB-SOEP Migration and IAB-BAMF-SOEP Refugee Survey are integrated in the original $P data set.

RAW Data 1984-1995 $pausl “Individual Questionnaire - Foreigners” (CS) The datasets $PAUSL contain population-specific sets of additional variables for Foreigners from 1984 – 1995 for the subsamples A-D.

RAW Data 2013-2016 $p_mig “IAB-SOEP Migration Sample: Original Individual questionnaire” (CS): The original data from the Sample M specific survey instrument can be found in the dataset $P_MIG, combining the individual and the biographical questionnaire. Since the current version “v34”, the data set is fully integrated. The variables are also included in generated datasets. Variables equivalent to variables in the individual questionnaire of other samples are included in the dataset $P, Variables equivalent to variables in the biography questionnaire of other samples are included in the respective biography dataset (e.g. BIOMARSM), the comprehensively surveyed migration biography can be found in the new dataset MIGSPELL.

RAW Data only 2016 $p_refugees “IAB-BAMF-SOEP Survey of Refugees in Germany: Original Individual Questionnaire” (CS): The original data from the survey instruments used in Samples M3 and M4 can be found in original format in the dataset $P_REFUGEES, where the individual and the biographical questionnaires are combined. Since the current version “v34”, the variables are integrated in original or generated datasets. Variables equivalent to those in the individual questionnaire of other samples are included in the dataset $P. Also included in $P are all variables which will be asked more than once, but specific to the refugee questionnaire, Variables equivalent to those in the biographical questionnaires in other samples are included in the respective biographical datasets (e.g., BIOMARSM), the comprehensively surveyed migration biography can be found in the new dataset REFUGSPELL.

RAW Data $h “Household questionnaire” (CS): The $H-files contain all questions of the household questionnaire.

RAW Data $h_refugees “Household questionnaire Refugee Sample” (CS): The $H_REFUGEES-files contain all questions of the household refugees questionnaire. Since the current version “v34”, the variables are integrated in original or generated datasets.

RAW Data only 1990 $host “East specific questions from the Household questionnaire” (CS): The dataset $HOST includes the complete file of population and variables for the East-German subsample in 1990 [GHOST].

RAW Data $jugend “Youth questionnaire for first time respondents age 16-17” (CS): Since 2000 (wave Q), first-time respondents between the ages of 16 and 17 have received a separate biographical questionnaire with additional age-group-specific questions, for instance, about their relationship to their parents or about what they do in their free time. Up to now, only some of the data collected from this survey have been processed and provided to users in dataset BIOAGE17. The complete data will be provided in individual $JUGEND datasets.

RAW Data ev “First wealth module” (long): The dataset $EV contains information on assets, surveyed in 1988 [E] at the household level.

RAW Data $school “Questionnaire: Pre-Teen, 12-13 year olds” (CS): Since 2014 the $SCHOOL-files contain all variables of the „Pre-teen (Schülerinnen und Schüler)“ questionnaire. Therefore the data sets provide variables about school, home, leisure time, health, self-perception and relationships with friends, siblings and parents.

RAW Data $school2 “Questionnaire: Early Youth, 14-15 year olds” (CS): Since 2016 the $SCHOOL2-files contain all variables of the „Early Youth (Frühe Jugend)“ questionnaire. Therefore the data sets provide variables about self-perception, independence, school, leisure time or relationships with friends, siblings and parents.

RAW Data $pluecke “Catch-Up Questioning” (CS): Temporary drop-outs (“gaps”) can cause problems for longitudinal analyses. This is especially true for the employment and income data stored. That is why the SOEP tries to fill in at least some of the central missing information. $PLUECKE is a small questionnaire covering information on the year previous to which the drop-out occurred. This covers questions on job-related changes, calendar of occupation, income, education and qualification.

RAW Data $vp “Questionnaire: the deceased individual (CS): The $VP-files contain information about respondents who lost a person in the previous year. It provides information about the deceased person and the respondent who reported the case of death.

biol “Biographical Data” (long):: BIOL contains cumulated individual-level raw data from the biographical questionnaire and from wave specific biographical modules of the individual questionnaire. BIOL is intended to be used in addition to the generated biographical files (by advanced users) to complete (or modify) generated biographical variables.

Survey Data

These data sets contain survey methodical information for SOEP core. The various data sets provide detailed exit information from respondents or household weighting factors that you need for representative analyses.

Dataset Label Format Identifier (ID) Special Identifier
phrf¹ Weighting and staying probabilities wide pid cid
hhrf¹ Weighting and staying probabilities wide hid, cid  

¹In addition to the classic identifiers (pid, hid and cid), these data sets also have the identifiers of older data distribution versions. (pid=persnr; hid=hhnrakt; cid=hhnr).

RAW Data phrf “Weighting and staying probabilities” (wide): In the SOEP database, different weighting variables for cross-sectional as well as for different kinds of longitudinal weighting are set aside for each person in the PHRF-file. The weighting variables can also be found in PPATHL.

RAW Data hhrf “Weighting and staying probabilities” (wide): In the SOEP database, different weighting variables for cross-sectional as well as for different kinds of longitudinal weighting are set aside for each household in the HHRF-file. The weighting variables can also be found in HL.

Generated Data

The SOEP team has prepared these data sets for you in a special way. The data sets are prepared in a research-friendly manner and are subjected to additional plausibility checks and quality controls. They usually consist of several variables, of different survey instruments and are described by the documentation provided. Therefore, these data sets cannot be assigned 1:1 to a survey instrument.

Dataset Label Format Identifier (ID) Additional Identifier
$pgen¹ Generated Individual Data CS pid hid, cid
$hgen¹ Generated Household Data CS hid cid
$kind¹ Data on children (from HH-Questionnaire) CS pid hid, cid
$pequiv¹ Cross-national Equivalent File CS pid hid, syear, cid
$pkal¹ Individual Calendar CS pid hid, cid
$pkalost¹ Individual Calender CS pid hid, cid

¹In addition to the classic identifiers (pid, hid and cid), these data sets also have the identifiers of older data distribution versions. (pid=persnr; hid=hhnrakt; cid=hhnr).

RAW Data $pgen “Generated Individual Data” (CS): The $PGEN-files contain user friendly data on the individual level which are consolidated from different sources. The plausibility is validated longitudinally in many respects , therefore the data are superior in most situations compared to the data in $P. The file contains one row for each person (persnr is unique) with a completed individual or youth questionnaire.

RAW Data $hgen “Generated Household Data” (CS): In order to minimize computing efforts for the user, the SOEP provides yearly status variables on household level. The $HGEN data provides a set of time-consistent variables generated from the SOEP household questionnaire. It only includes households who participated in the respective year.

RAW Data $kind “Data on children (from HH-Questionnaire)” (CS): The variables from the annual $kind files are not based on answers provided by the children themselves, but by answers provided by the head of household. This data is re-aggregated on the person level and saved as child-specific entries in the file $kind. The annual $kind datasets also contain additional information on institutional care and school attendance for children and young people.

RAW Data $pequiv “Cross-national Equivalent File” (CS): The $PEQUV-File is based on the Cross-National Equivalent File (CNEF) with extended income information for the SOEP. This file comprises not only the aggregated income figures provided in the CNEF but also further single income components.

RAW Data $pkal “Individual Calendar” (CS): The $pkal datasets contain calender variables from the Individual questionnaire. The dataset includes the activity status on a monthly basis as well as the income status of a person.

RAW Data``1990-1991 $pkalost “Individual Calendar” (CS): PKALOST extends existing calender information in PKAL. The file contains further current and retrospective calender information for the East-German population in 1990 and 1991. These calender information include retrospective monthly data even for the time before unification.

Naming Convention of Data Sets and Variables

The following explanations only refer to the data sets of the subdirectory “raw” in your distribution file. There is no systematic variable naming for the long files above the subdirectory “raw”. To distinguish the multitude of data sets and variables, the SOEP uses systematic dataset and variable names for data in cross-sectional format. These names provide a lot of information for data users. Example of a data set name:

xp

../_images/dataset_example.PNG

The first identifier of each data set name is the wave identifier (“x”). It can contain one or two letters. .

Each wave or survey year can be assigned using a letter in the alphabet:

1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
a b c d e f g h i j k l m n o p q
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
r s t u v w x y z ba bb bc bd be bf bg bh

As can be seen from the table, the sample data set “xp” contains survey information from the survey year 2007.

The second identifier of each data set name is the abbreviation for the respective survey instrument or, for generated data sets, the name of the content (“p”).

  • h= Household
  • hbrutto= Household Gross
  • hgen= Generated Household Data
  • p= Individuals
  • pbrutto= Person Gross
  • p_mig= Migrants
  • pgen= Generated individual data
  • jugend = Youth (Ages 16-17)
  • school= Pre-Teen (Ages 11-12)
  • vp= Deceased Individual
  • luecke= Catch-Up Individual Questionnaire
  • hkind= Information for children from household questionnaire
  • pequiv= Cross National Equivalent File
  • pkal= Calendar

Further examples:

  • bah = Wave „ba“ (Survey year 2010), Household data sets
  • bfschool= Wave „bf“ (Survey year 2015), Pupils data sets
  • zhgen = Wave „z“ (Survey year 2009), Generated Household data sets

Variable names in the SOEPcore data files follow basic conventions: First, there are datasets with “speaking” variable names, where the variable name itself conveys something about the information stored in this variable. This is usally the case when the dataset is generated.

For the original datasets such as $H, $P and $KIND, the variable names are set up “around” the unit of analysis (individual - “p”, household - “h”, and child - “k”) and show before this indicator the wave in which the data was collected and after it the reference where the question can be found in the original survey instrument (see Figure 9 for an overview).

../_images/wuqi.PNG

Example for a variable name: bfp0103

../_images/variable_example.PNG

The first identifier of a variable name is the wave (i.e. “bf“) Every wave or rather every year can be assigned to a specific letter in the alphabet:

1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
a b c d e f g h i j k l m n o p q
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
r s t u v w x y z ba bb bc bd be bf bg bh

As can be seen from the table, the variable “bfp0103” contains information from the survey year 2015.

The second identifier of a variable is the abbreviation for the respective survey instrument or the type of information (“p”)

  • h= Household
  • hbrutto= Household gross
  • hgen= Generated household data
  • p=Individual data
  • pbrutto= Person gross
  • p_mig= Person migrants (M1 und M2)
  • pgen= Generated individual data
  • jugend = Youth (Ages 16-17)
  • school= Pre-Teen (Ages 11-12)
  • vp= Deceased Individual
  • luecke= Catch-Up Individual
  • hkind= Children information from the household questionnaire
  • pequiv= Cross National Equivalent File
  • pkal= Calender

The third identifier of a variable name describes the question number (“01“) and a possible fourth identifier describes the position of the answer category (“03“).

../_images/question_example_2.PNG

The example variable „bfp0103“ describes the „satisfaction of work“. The variable was raised in 2015 („bf“) and it can be found in the individual questionnaire („p“). In the associated individual questionnaire, the variable can be found in the first question („01“) under the third position of all answers categories („03“).

More examples: - ap06 = Wave „a“ (survey year 1984), Individual Dataset, Question 6 - th1603 = Wave „t“ (survey year 2003), Household Dataset, Question 16, Item 3 - lp10312= Wave „l“ (survey year 1995), Individual Dataset, Question 3, Item 12 - bap15604 = Wave „ba“ (survey year 2010), Individual Dataset, Question 156, Item 4

Since the data structure is getting richer every year, we extended the common variable naming convention WUQI, starting with the wave „bh“(2017). Additionally, we provide our users with an „instrument“ variable that contains all our survey instruments for each analyzing unit.

Extended Variable Naming Convention

../_images/wu_q_i_q.PNG

Note

The extended variable naming convention WU_Q_I_q is applied since version v.34 and is only used for data sets from wave bh onwards and only applicable for the datasets $p, $h, $kind

What`s new: We added underscores between unit of analysis, question identifier and item identifier to separate the analysis unit, question and item visually. In addition, a questionnaire identifier was introduced, which is also separated by an underscore from the item. This new version of naming variables only comes to use, if the survey instrument differs from the „original“ instrument.

When working with the data set in wide-format related to the respective survey wave, it should first be noted that it is generally not clear from the variable names in which instruments the information was collected. This would not be possible at all, given the number of associated questionnaires. It is only possible to derive from the integration hierarchy in which questionnaires the variable was not collected.

Main SOEP (A-L1 <- L2-3 <- N) <- Migrationsample (M1, M2) <- Refugee Sample (M3, M4) <- Refugee Sample (M5)

Due to our different samples in the SOEP, there are some samples groups that are getting sample specific questions, like the migrant sample that started in 2013. For that specific group, we created an extended individual questionnaire, with migrant specific question and standard SOEP questions that are asked every year. If you want to know where a variable var1 was collected, there are the following possibilities:

On the one hand, the metadata-based codebook gives an answer to the question, on the other hand, the following stata command can also be used:

fre instrument if var1!=-5

Let`s take a look at the variable bhp109_01_q57

  • bh= Year 2017
  • P= Person questionnaire
  • 109= Question 109
  • _01= First Item
  • _q57= ?

To know which questionnaire is the right one, you have to take a look at the instrument variable.

Value Questionnaire
50 2017 Individual Questionnaire (A-L1 ; PAPI) [soep-core-2017-pe]
51 2017 Individual Questionnaire (A-L3 ; CAPI) [soep-core-2017-pe2]
52 2017 Individual Questionnaire (L2-L3 ; CAWI) [soep-core-2017-pe3]
53 2017 Individual Questionnaire (N; CAPI) [soep-core-2017-pe4]
54 2017 Individual Questionnaire (M1-M2 Re-Surveyed; CAPI) [soep-core-2017-p-m12]
55 2017 Questionnaire Individual-Biography (M1-M2 First-Surveyed; CAPI) [soep-core-2017-pb-m12-erst]
56 2017 Questionnaire Individual-Biography (M3-M5 First-Surveyed; CAPI) [soep-core-2017-pb-m345-erst]
57 2017 Questionnaire Individual-Biography (M3-M4 Re-Surveyed; CAPI) [soep-core-2017-pb-m34-wieder]
58 2017 Biography Questionnaire (A-L1 First-Surveyed; PAPI) [soep-core-2017-ll]
59 2017 Biography Questionnaire (A-L3; N First-Surveyed; CAPI) [soep-core-2017-ll2]

The instrument variable for identifying the exact questionnaire can be found in the respective data set. The value Q57 of the example identifies the individual biography questionnaire for re-surveyed respondents of the samples M3/M4 as the variable source. If you are now interested in the direct question in the questionnaire, open the individual biography questionnaire for refugees (Re-Surveyed), look for question number 109 and look at the first item. The variable bhp109_01_q57 was raised with the following question:

Q109: When was the beginning of the integration course?

  • 1 Year
  • 2 Month
  • 99 No Details

Using the variable name and the instrument variable, you can easily identify the corresponding question in the corresponding questionnaire:

  • bhp109_01_q57
  • bh= Year 2017
  • P= Individual questionnaire
  • 109= Question 109
  • _01= First Item
  • _q57= 2017 Questionnaire Individual with Biography (M3-M4 Re-Surveyed; CAPI) [soep-core-2017-pb-m34-wieder]