Dataset Identifiers

Because of the overall data structure with data on different observational levels, any analysis requires the combination of data using matching or merging procedures. These merging procedures need identifiers such that a combination of datasets becomes feasible. The central individual identifier across time is pid, which is fixed over time (and of course datasets). Since a person might change the household in which he or she lives at any point in time, yearly household identifiers called hid are necessary, facilitating matching depending on the dataset used. Finally, each individual (respondents as well as children) can be traced back to be a member of or a split-off from an original household from the very first wave. This household’s ID, which is fixed no matter how often a person changes households over time, is called cid. In addition, respondents in long data can be differentiated by survey year. The syear variable can be used to identify a respondent’s survey year. The SOEP provides additional identifiers in the various datasets in order to identify respondents and to provide further possibilities for merging datasets. A excerpt of these additional identifiers can be found here:

Please note that these are not all identifier variables. The name of the identifier variable can change depending on the dataset used.

  • parid “Unchanging Individual identifier of Partner (PID)”

  • pgpartnr “Individual Identifier of Partner”

  • coupid “Couple Identifier”

  • intid “Interviewer Identifier”

  • intid1 “Identifier of First Interviewer”

  • vpid “Individual Identifier of Deceased Indivdiual”

  • mnr “Individual Identifier Mother”

  • fnr “Individual Identifier Father”

  • kidpnr01-kidpnr19 “Individual Identifier nth Child”

  • sibpnr1-sibpnr11 “Individual Identifier, nth Sibling”

  • pnrtwin “Individual Identifier 2nd Sibling”

  • pnrtrip “Individual Identifier 3rd Sibling”

  • pnrquad “Individual Identifier 4th Sibling”

Partner Identifier

Partner identification (parid and pgpartnr)

Partner indicators (parid from ppathl and pgpartnr from pgen) have the purpose of defining couples in SOEP households and thus to make possible analyses on the dyadic level. Persons without spouse and (cohabitating) partner receive a missing code “-2” (=does not apply). The assignment of the partner ID within households is based on four sources of information: A question in the person-file, that asks (unmarried) respondents to identify their partner in the household, the household matrix reported by the head of household at the beginning of the interview (stell from pbrutto), the partnership biography in the lifehistory calendar reported by new respondents, and self-reports on marital status and life events, such as marriage, move in with partner, separation, etc. In unclear cases, due to tempo- ral non-response for instance, we also consider longitudinal information from previous and prospective waves. Moreover, parid is self-consistent between two individuals. For analyses of partner relationships, this information can be used to link all persons with their respective partners, and all information on both partners can also be stored in a common dataset. parid includes all persons that have ever participated in the SOEP. pgpartnr from the pgen dataset contains the same information, but is restricted to the pgen population and includes only persons with persons interview.

Monthly Couple Identifier (coupid)

The COUPID can only be found in the biocouplm dataset, because partnerships can change many times within a year. Multiple partnerships within a year can only be recorded correctly if the partnership is assigned to the exact month. So if a partnership for example ends in March, its joint COUPID also ends and a new one begins in June or so.

Family Identifier

Individual Identifier nth Child (kidpnr01-kidpnr19)

kidpnr01-kidpnr19 in the data set biobirth contains invariable individual identifier of biological children[nn] of the respondent (for the first child up to the 19th child), given it is identifiable in the SOEP. The sequence of children within biobirth is recorded with regards to the birth order in terms of age of the children. The order ranks from the oldest child specified under kidpnr01 to the youngest child. If the age is missing it is listed in the first record (kidpnr01), and in subsequent records following kidpnr01 if more than one child’s personal identifier remains missing. kidpnr[nn] is “-1” if a child was reported in the birth biography who could not be assigned to a SOEP household (Children outside the parental household). If no child could be identified in the household context or in the birth biography, the code “-2” is assigned.

Individual Identifier Mother/Father (mnr fnr)

The personal ID of the parents (fnr and mnr) from bioparen is generated in three steps:

  • The parents of the respondent are identified by the relationship to the head of the household (stell in pbrutto). Ideally, the children’s parents are identified at the time of the first survey of the child. Furthermore, the social parents and not necessarily the biological parents are identified.

  • The parents of the respondent are identified via the mother’s ID as well as the mother’s partner ID in $$kind. By using these variables the “oldest“ parents are identified. Ideally, these are the parents at the time the child is 17 years old (one year before the first survey).

  • The biological mother-ID and father -ID of the respondent can be identified in biobirth.

As bioparen aims at identifying the social parents that live in the household when the child is surveyed, the steps above are carried out in the hierarchy 1-3 with step 1 having the highest priority. If one is interested in only biological parents, please have a look at the information in biobirth

Individual Identifier, nth Sibling (sibpnr1-sibpnr11)

The variables provide the never changing person IDs for the siblings of the individual identified by PID. The sibling relationship is generated from the parent information in biobirth and bioparen. Two persons are defined as siblings if they report both, the same mother and father, only the same mother, or only the same father. This information on the sibling relationship is stored in sibdef1-sibdef11. In the case of inconsistent information on parents in biobirth and bioparen, bioparen was assigned the lowest priority. Please note, that bioparen uses a social definition of parenthood based on cohabitation. In contrast, biosib contains both biological (biobirth) and social siblings with a higher priority on biological relations.

Individual Identifier Twins, Triplets, Quadruplets, (pnrtwin, pnrtrip, pnrtquad)

The ids pnrtwin, pnrtrip and pnrtquad from biotwin contain all twins that were ever identified within the SOEP. pnrtwin and – in rare cases if available – pnrtrip or pnrtquad contain the individual identifier of second, and third or fourth sibling in the group. This means that every case in the data set consists of a group of twins (or triplets or quadruplets). The code “-2” is assigned to pnrtrip and/or pnrtquad if a third or fourth twin sibling doesn’t exist. PERSNR and PNRTWIN however should always contain valid codes.

Individual Identifier of Deceased Indivdiual (vpid)

vpid in the vpl data set contains the individual identifier of a deceased indivdiual and is difficult to interpret because

  • SOEP respondents in a household may provide information on several deceased persons. These deceased persons may or may not have participated in SOEP.

  • Non-SOEP respondents provide information about one (or even more) deceased SOEP person(s).

So the following scenarios can occur:

Deceased person in SOEP

Deceased person NOT in SOEP

Respondent is interviewee in SOEP

PID, VPID

PID, -/-

Respondent is NOT interviewee in SOEP

-/-, VPID

-/-, -/-

This means a person with a pid=0 has not been part of the SOEP, but has completed a deceased person questionnaire for a deceased person from a SOEP household (vpid has a SOEP ID). When individuals with SOEP ID (pid) report a deceased person who was not part of SOEP, special vpids are assigned:

../_images/vpid_example.png

90s numbers in vpid are assinged to “non-Soep participants”. E.g. if a mother is deceased, but she did not participate in the Soep. In this example a No-SOEP-Person died in 2016, another No-SOEP-Person in 2017. Different 90s (e.g. 98/99) are only assigned if more than one No-SOEP-Person died in a year. When a SOEP person dies and is reported by another person, the vpid is the pid of the deceased respondent.

Interviewer Identifier

Interviewer ID (intid and intid1)

Intid and intid1 are fixed IDs over time to identify interviewers across years, households and questionnaires within datasets. The interviewer ID is used to identify the respective interviewer of different respondents. Unlike most other datasets in the SOEP the interviewer dataset has no PID or HID to identify the observations, but you can merge the interviewer information to other datasets using the intid. Due to changing IDs in the SOEP raw data in this and past versions of the interviewer data set it may happen that the intid of an interviewer changes over time. This can happen at most once per interviewer and is unfortunately not flagged. When the interviewer is replaced or when the interviewer changes over time, intid1 references to the first interviewer who conducted an interview in the survey household.

Last change: Jan 13, 2025