Working with SOEP Data in Open Data Format

The SOEP offers data in the new Open Data Format (ODF). This format includes metadata accessible in Stata, R, and Python making it particularly useful for R- and Python-users who want to load SOEP-core datasets with metadata directly in R and Python.

This guide demonstrates how to open and work with SOEP data in ODF format using R, Python or Stata.

The Open Data Format

The Open Data Format (ODF) is a metadata-enriched, non-proprietary, and platform-independent data format. It includes metadata in multiple languages stored in DDI Codebook Format. The format is specified as a zip-compressed folder containing a CSV file with the data and an XML file with the metadata. To import and export data in the ODF format, the SOEP provides packages for Stata, R, and Python.

For further information on the Open Data Format, visit the ODF website.

Using the ODF in R

Installing the Opendataformat Package in R

To get started with SOEP data in the open data format you need to install the opendataformat package from CRAN using install.packages.

install.packages("opendataformat")
library("opendataformat")

Read a Data File

Now you can use the package functions to work with ODF data files. For demonstration purposes, let’s load the sample data included in the opendataformat-package. To open the documentation/help-files for the read_odf() function you can execute ?read_odf.

path <- system.file("extdata", "data.zip", package="opendataformat")
?read_odf
df <- read_odf(file = path)

You can specify further parameters of the read_odf()-function: The languages in which metadata should be imported, the (maximum) number of rows you want to import, the number of rows you want to skip (excluding the header), and the variables/columns you want to import (indices or column names). In the following code line, you see the default parameters:

df <- read_odf(file = path, languages = "all", nrows = Inf, skip = 0, select = NULL)

Display and Use Metadata

To display metadata of a dataset of a variable in the viewer you can use the docu_odf()-function.

#To display dataset information
docu_odf(df)
R viewer output of docu odf for dataset

You can also display information for a specific variable:

#Variable information
docu_odf(df$bap87)
R viewer output of docu odf for a variable

You can also choose the language of the metadata displayed, if the information is available in a particular language, by setting the languages parameter to a particular language (languages=”de”) or display metadata in all available languages (languages=”all”). Default is the current language (which is normally English). You can set the default language of a dataframe in ODF-format using the setLanguage_odf()-function (e.g. df<-setLanguage_odf(df, “de”)).

# Alternatively, the metadata can be displayed in the console using by setting `style="print"` or `style="console"`, or in both with `style="both"`.
docu_odf(df$bap87, style = "print")
# You can also choose the language of the metadata (when the language is available).
docu_odf(df$bap87, languages="de")
# To display the metadata in all available languages, set languages="all"
docu_odf(df$bap87, languages="all")
# You can also set the default language of a dataset
df<-setlanguage_odf(df, "de")
# Then variable information will by default be displayed in the respective language:
docu_odf(df$bap87)

The function read_odf() reads the data to R as an tibble (dataframe). The metadata is stored in the attributes of the ODF-tibble object and in the attributes of the columns of the ODF-tibble object. To retrieve attributes you can use the R base functions attr() and attributes().

# display all attributes of a dataframe
attributes(df)
# display all attributes of a variable/column
attributes(df$bap87)

# You can also display a specific attribute of a dataframe:
attributes(df)$label_de
attr(df, "label_de")

# Or of a variable
attributes(df$bap87)$labels_de
attr(df$bap87, "labels_de")

Alternatively, you can use the getmetadata_odf()-function from the opendataformat-package to retrieve labels and other metadata.:

# display the labels of all variables in a dataframe (in the active/current language)
getmetadata_odf(df, type = "labels")

#in a specific language
getmetadata_odf(df, type = "labels", language="de")

# To see which languages are available for the dataframe, set retrieve=languages
getmetadata_odf(df, type="languages")

#you can also display the value labels of a specific variable:
valuelabels<-getmetadata_odf(df$bap87, type="valuelabels")
valuelabels

# You can also display the descriptions, the URLs or the variable types :
# Descriptions of all variables:
getmetadata_odf(df, type="description")

# of one variables:
getmetadata_odf(df$bap87, type="description")

# URLs:
getmetadata_odf(df, type="url")

# variable types:
getmetadata_odf(df, type="type")

Save a dataset as an ODF-File

To save a dataset as an ODF-file again, the opendataformat package provides the write_odf()-function. All the metadata stored in the attributes, that is compatible with the ODF-specification is preserved in the ODF-file.

The compatible metadata for the dataframe includes the dataset name in the name-attribute, labels with a language tag (e.g. label_en), descriptions with a language tag (e.g. description_en), and a URL in the url-attribute. The compatible metadata for the variables/columns includes labels with a language tag (e.g. label_en), descriptions with a language tag (e.g. description_en), the variable type in the type-attribute, value labels with a language tag (e.g. labels_en), and a URL in the url-attribute.

You can choose to save metadata in one or several specific languages only using the languages-argument. Default is languages=”all”.

# Write the dataframe `df` to the ODF-file `my_datafile.zip` in the current working directory.
write_odf(x=df, file="my_datafile.zip")
# Write the dataframe `df` while keeping only metadata in English.
write_odf(x=df, file="my_datafile.zip", languages="en")

For further instructions on how to work with the ODF-format in R, read the vignette for the opendataformat-package.

Using the ODF in Stata

Installing the Opendataformat Package

To work with SOEP-Data in the open data format in Stata you have to install the opendf package in Stata using ssc install to download and install the package from SSC.

1********** Download and Install opendataformat package: *********
2ssc install opendf

Python Integration

For the opendf package to work, the Python integration in Stata is required. Therefore, you need Stata version 16 or higher and some Python installation on your computer. You can easily test if Python is working within Stata. If you have problems with the Python integration, you can copy a Python version to your computer using opendf installpython function from the opendf package:

 1********** opendf installpython and opendf removepython **********
 2
 3* For the opendf read package to work you need a working python integration 
 4* in Stata. Therefore you need a python version on you Computer and Stata 
 5* must be able to find it.
 6* You can test if python works:
 7python
 8print("Hello World")
 9end
10
11* If you have problems with your python integration in Stata, you can fix it 
12* easily (for the opendataformat package). The opendataformat package offers 
13* a function to copy a python version to you computer ito a location where the 
14* opendataformat package will find it.
15* The function by default downloads python 3.12 to the Stata ado folder where  
16* the package files are stored:
17opendf installpython
18
19* Alternatively you can specify path and version
20* If you choose another location, the opendataformat package will not 
21* automatically find the python.exe. You have to indicate the location each 
22* time you start Stata.
23*opendf installpython, version("3.12") location("C:/.../")
24
25* You can remove the python version(s) from the default folder or a folder 
26* you define:
27opendf removepython
28*opendf removepython, version("3.12") location("C:/.../")
29
30* For further instructions see the help files:
31help opendf installpython
32help opendf removepython

Read a Data File

Now you can use the package functions to work with ODF-data files. For demonstration purposes, we copy the testdataset (testdata.zip) from GitHub to the local temp-directory:

1* Download testdata (testdata.zip) to Temp-Directory
2copy "https://opendataformat.github.io/files/example_dataset.zip" `c(tmpdir)', replace

With the opendf read function you can load the data file to Stata:

1********** opendf read **********
2
3* To read a opendf data file you can use the opendf read function.
4* Here the tastdata.zip-file is loaded from Temporary Directory.
5opendf read "`c(tmpdir)'example_dataset.zip"

You can specify further parameters of the opendf read-function: The range of rows you want to load (excluding the header), the range of columns you want to load, and whether you want to save the dataset directly as .dta-file.

 1********** opendf read **********
 2
 3* To read a opendf data file you can use the opendf read function.
 4* Here the tastdata.zip-file is loaded from Temporary Directory.
 5opendf read "`c(tmpdir)'example_dataset.zip"
 6
 7* You can also specify the rows and colums to load using the rowrange and 
 8* colrange options. Here the first ten rows and columns 2-5 are loaded.
 9opendf read "`c(tmpdir)'example_dataset.zip",  rowr(:10) colr(2:5) save("testdata.dta") replace clear
10
11* For more information look into the help-file
12help opendf read

Display and Use Metadata

To display metadata you can use the opendf docu-function for the dataset information:

1* You can display metadata/information using the opendf docu function.
2* Display Metadata for Dataset:
3opendf docu
Stata console output of opendf docu

Or information on a specific variable:

1* Display Metadata for Variable
2opendf docu bap87
Stata console output of opendf docu bap87

You can also choose the languages for the metadata displayed, if the information is available in particular languages, by setting the languages parameter to these languages (e.g. languages(“en”)) or to display metadata for all languages (languages(“all”)). Default is the active label language. You can set the active language using the label language-command.

1*To switch to another label/metadata language use the label language function
2label language de
3
4*Alternatively you can use the languages option of opendf docu
5opendf docu, languages("en")
6opendf docu, languages("all")

The metadata is stored in the characteristics of the dataset. To retrieve them you can use the Stata base functions:

1*Display characteristics directly
2char list
3local _url: char _dta[url]
4di "`_url'"
5local _description_bap87: char bap87[description_de]
6di "`_description_bap87'"

Save a Dataset as an ODF-File

To save a dataframe as an ODF-file, the opendataformat package provides the opendf write-function. All the metadata stored in the characteristics, that is compatible with the ODF-specification is preserved in the ODF-file.

The compatible metadata for the dataframe includes the dataset name in the dataset-characteristic, labels with a language tag (e.g. label_en), descriptions with a language tag (e.g. description_en), and a URL in the url-characteristic. The compatible metadata for the variables/columns includes labels with a language tag (e.g. label_en), descriptions with a language tag (e.g. description_en), the variable type in the type-characteristic, value labels with a language tag (e.g. labels_en), and a URL in the url-characteristic.

You can also specify the columns/variables to save and the languages of the metadata. By default, the metadata is saved in all available languages (languages(“all”))

 1********** opendf write **********
 2
 3* To save the dataset to a ODF-file you can use the opendf write function.
 4* To simply write it to the working directory, just enter the file-name:
 5* If the data file already exists and you want to replace it, 
 6* you need the replace option.
 7opendf write "testdata.zip", replace
 8
 9* To save only specific variables, you can use the variables option:
10opendf write "testdata.zip", variables(bap9201 bap9001 bap87) replace
11
12* If you want to save the metadata of only one language or particular languages,
13* you can use the languages option:
14opendf write "testdata.zip", languages("de en") replace
15opendf write "testdata.zip", languages("de") replace
16
17* For more information look into the help-file
18help opendf write

Using the ODF in Python

Installing the Opendataformat Package

To work with SOEP-Data in the open data format in Python you have to install the opendataformat library using pip install. Currently the development version can be installed from GitHub using the CMD command pip install git+:

pip install git+https://github.com/opendataformat/python-package-opendataformat.git

Read an ODF Data File

To use the opendataformat package you have to import the library in python. .. code-block:: python

import opendataformat as odf

Now you can use the package functions to work with ODF-data files. With the read_odf() function you can load the data file to Python. For demonstration purposes we load the example-dataset from the ODF-Website:

df = odf.read_odf('https://opendataformat.github.io/files/example_dataset.zip')

You can specify further parameters of the read_odf() function: The number of rows you want to read (excluding the header), the number of rows you want to skip (excluding the header), the columns you want to load, the metadata languages you want to load, and which values should be treated as NAs.

df = odf.read_odf(path = 'https://opendataformat.github.io/files/example_dataset.zip', languages = "all", usecols = None, skiprows=None, nrows=None, na_values = None)

Display and Use Metadata

To display metadata you can use the docu_odf() function for the dataset information: The docu_odf() function both displays the metadata dictionary and returns it.

# Display metadata for dataset and assign metadata dictionary to `metadata`
metadata = odf.docu_odf(df)
Python  output of docu_odf for dataset

You can also use docu_odf() to display the metadata for a variable:

# Display metadata for variable and assign metadata dictionary to `variable_metadata`
variable_metadata = odf.docu_odf(df.bap87)
Python output of docu_odf for a variable

You can also display a specific metadata field in a specific language. You can choose the languages for the metadata displayed, if the information is available in particular languages, by setting the languages parameter to these languages (e.g. languages = “de”). To select a specific metadata field set the metadata-parameter to the specific metadata entry (e.g. metadata = “valuelabels”).

# Display value labels in German and assign to valualabels_de
valuelabels_de = odf.docu_odf(df.bap87, metadata = "valuelabels", languages = "de")
Python output of docu_odf for value labels in German

Save a Dataset as an ODF-File

To save a Pandas dataset as an ODF-file, the opendataformat package provides the write_odf()-function. All the metadata stored in the attributes, that is compatible with the ODF-specification is preserved in the ODF-file.

The compatible metadata for the dataframe includes the dataset name in the dataset-attribute, labels with a language tag (e.g. label_en), descriptions with a language tag (e.g. description_en), and a URL in the url-attribute. The compatible metadata for the variables/columns includes labels with a language tag (e.g. label_en), descriptions with a language tag (e.g. description_en), the variable type in the type-attribute, value labels with a language tag (e.g. labels_en), and a URL in the url-attribute.

You can also specify the columns/variables to save and the languages of the metadata. By default, the metadata is saved in all available languages (languages = “all”)

# Safe dataset as `dataset.zip` in the current working directory
odf.write_odf(df, 'dataset.zip')

# You can also keep only the metadata for (a) specific language(s) using the languages-parameter.
odf.write_odf(df, 'dataset.zip', languages = "all")

Last change: Jan 13, 2025