Principles of Data Analysis

All SOEPtutorials can be found on our YouTube Channel

The data structure for panel data consists of three dimensions. At first, the respective examination units (n) and a matrix of dependent and independent variables (y,x) are completely analogous to a cross-sectional design. Another level is the dimension of time (t), whereby a distinction is made between two data formats for panel data structures - “wide” or “long” (with wide format the variable matrix is indexed with the dimension of time and with long format the respective examination units). Regardless of the selected data format, when using panel data with several survey waves, the data matrices are often not completely provided with information due to the panel mortality of individual survey units or because data from new panel members are only collected at a later point in time. In both cases, the term “unbalanced panel data” is used. In contrast, the classical panel data structure, on the other hand, is “balanced”, i.e. as many observations of dependent and independent variables are available for all study units as there are waves of data collection. The data of social science panel data often show a data structure, which is characterized by many investigation units (large n) as well as, in relation to it, few waves and therefore measuring time (small t). When data from a panel study are available, even descriptive forms of data analysis are often of particular interest, since the identification of changes in a variable over time and the corresponding separation of interindividual and intraindividual changes can represent important social facts, particularly in the case of generalizable samples. It is of social scientific interest whether a constant 15 % proportion of people whose income is below the poverty risk level is repeatedly found in the same person over time, or whether there was a even balance of increases and decreases in poverty risks and only half of the population was permanently exposed to the risk. The choice of complex analysis methods for panel data depends first and foremost on the respective measurement level of the dependent and independent variables, but also on whether they are time-constant variables (such as gender or migration background) or time-invariant variables. The statistical analysis models of panel data range from structural equation models, various regression models, event analysis, sequence data analysis, latent growth models to causal analyses using matching methods. A particular advantage of panel data is that the chronological sequence of changes can be modelled and calculated and the problem of unobserved heterogeneity, which is often encountered in the social sciences, can be significantly reduced, at least in comparison with cross-sectional data.

Cross-sectional data structure (CS)

Cross sectional data is a type of data, which observes many subjects at the same point of time. Each person is assigned a row in the data set and is only included once in such a data set. By merging cross-sectional SOEP data across waves you receive a dataset in wide-format.

Row

ID

wave

sex

income

1

1

2015

m

1500

2

2

2015

m

1000

3

6

2015

f

2000

4

8

2015

m

5500

Data Structure in wide-format (wide)

The SOEP data are available with different data structures. In the wide format, a respondent’s repeated responses are displayed in a single row and each response in a separate column. Each column represents a variable. We provide four datasets in the wide format: ppath, phrf, hpath, hhrf.

Row

ID

sex

income2015

income2016

income2017

1

1

m

1500

1500

2000

2

2

m

1000

1200

1200

3

6

f

2000

2000

2000

4

8

m

5500

6000

6500

Data Structure in long Format (long)

The long format is a condensed and user-friendly dataset structure for longitudinal section analysis. Here, each person has one line per survey year. This means that you do not have several datasets for the different waves, but one dataset in which all survey waves are represented. A person can appear more than once in such a dataset. In the long format, one line describes a person-year combination.

Row

ID

syear

sex

income

1

1

2015

m

1500

2

1

2016

m

1500

3

1

2016

m

2000

4

2

2015

m

1000

5

2

2016

m

1200

6

2

2016

m

1200

7

6

2015

f

2000

8

6

2016

f

2000

9

6

2017

f

2000

Data Structure in spell format (spell)

In the strict sense of the word, spell data are about time periods with a defined start and end. When handling spell data it is necessary to take potential censoring into account. Censoring denotes that the beginning (left censored) or ending (right censored) of a spell is imprecise because of missing information or the beginning or ending of a spell is outside of the period of observation. It is quite conceivable that a person has only one spell over a given period, such as a male who is full-time employed. For a ten year period, there may be just the one spell “full-time employed”. In panel data, the same person would have 10 observations, one per year. A person may have many spells over a time period, and even have overlapping spells, like working part-time and receiving a disability pension. Spell data is useful for looking at stays in a certain state, and transitions in and out of that state.

Row

ID

spellnr

spelltype

begin

end

censored

1

1

1

Retired

1983

2007

left and right censored

2

1

2

Housewife/husband

1983

1984

left censored

3

1

3

Housewife/husband

1994

1994

uncensored

4

1

4

Housewife/husband

1998

1998

uncensored

5

2

1

Full-Time Employment

1984

1984

left censored

6

2

2

Full-Time Employment

1985

1985

uncensored