Working with SOEP Data in Open Data Format¶
The SOEP offers data in the new open data format (opendf). This format includes metadata accessible in Stata and R, making it particularly useful for R-users who want to load SOEP-core datasets with metadata directly in R.
This guide demonstrates how to open and work with SOEP data in opendf format using R or Stata.
The Open Data Format¶
The open data format (opendf) is a metadata-enriched, non-proprietary, and platform-independent data format. It includes metadata in multiple languages stored in DDI Codebook Format. The format is specified as a zip-compressed folder containing a CSV file with the data and an XML file with the metadata. To import and export data in the opendf format, the SOEP provides packages for R and Stata.
Using the opendf in R¶
Installing the Opendataformat Package in R
To get started with SOEP data in the open data format you need to download the opendataformat package using the install_git()-function from the devtools-library:
#install.packages("devtools")
library(devtools)
devtools::install_git("https://git.soep.de/opendata/r-package-opendataformat.git")
library("opendataformat")
Read a Data File
Now you can use the package functions to work with opendf data files. For demonstration purposes, let’s load the sample data included in the opendataformat-package. To open the documentation/help-files for the read_opendf() function you can execute ?read_opendf.
path <- system.file("extdata", "data.zip", package="opendataformat")
?read_opendf
df <- read_opendf(file = path)
You can specify further parameters of the read_opendf()-function: The languages in which metadata should be imported, the (maximum) number of rows you want to import, the number of rows you want to skip (excluding the header), and the variables/columns you want to import (indices or column names). In the following code line, you see the default parameters:
df <- read_opendf(file = path, languages = "all", nrows = Inf, skip = 0, variables = NULL)
Display and Use Metadata
To display metadata you can use the docu_opendf()-function. With style=”print” or style=”console” the metadata is displayed in the console. Alternatively, you can display the metadata in the viewer with style=”viewer” or style=”html”. Default is style=”both”.
#To display dataset information
docu_opendf(df, style="print")
You can additionally display a list of the variables in the dataset:
#You can additionally display a list of the variables in the dataset
docu_opendf(df, variables="yes")
Or information on a specific variable:
#Variable information
docu_opendf(df$bap87)
You can also choose the language of the metadata displayed, if the information is available in a particular language, by setting the languages parameter to a particular language (languages=”de”) or display metadata in all available languages (languages=”all”). Default is the current language (which is normally English). You can set the default language of a dataframe in opendf-format using the setLanguage_opendf()-function (eg. df<-setLanguage_opendf(df, “de”)).
#You can also choose the language of the metadata (when the language is available).
docu_opendf(df$bap87, languages="de")
#To display the metadata in all available languages, set languages="all"
docu_opendf(df$bap87, languages="all")
#You can also set the default language of a dataset
df<-setLanguage_opendf(df, "de")
#Then variable information will by default be displayed in the respective language:
docu_opendf(df$bap87)
The metadata is stored in the attributes of the opendf-dataframe object and in the attributes of the columns of the opendf-dataframe object. To retrieve attributes you can use the R base functions attr() and attributes().
# display all attributes of a dataframe
attributes(df)
# display all attributes of a variable/column
attributes(df$bap87)
# You can also display a specific attribute of a dataframe:
attributes(df)$label_de
attr(df, "label_de")
# Or of a variable
attributes(df$bap87)$labels_de
attr(df$bap87, "labels_de")
Alternatively, you can use the labels_opendf()-function from the opendataformat-package to retrieve labels and other metadata.:
# display the labels of all variables in a dataframe (in the active/current language)
labels_opendf(df)
#in a specific language
labels_opendf(df, language="de")
# To see which languages are available for the dataframe, set retrieve=languages
labels_opendf(df, retrieve="languages")
#you can also display the value labels of a specific variable:
labels_opendf(df$bap87, valuelabels=T)
# alternative:
labels_opendf(df$bap87, retrieve="valuelabels")
# You can also display the descriptions, the URLs or the variable types :
# Descriptions of all variables:
labels_opendf(df, retrieve="description")
# of one variables:
labels_opendf(df$bap87, retrieve="description")
# URLs:
labels_opendf(df, retrieve="url")
# variable types:
labels_opendf(df, retrieve="type")
Write an opendf-dataframe to a Data-File
To save a dataframe as an opendf-file again, the opendataformat package provides the write_opendf()-function. All the metadata stored in the attributes, that is compatible with the opendf-specification is preserved in the opendf-file.
The compatible metadata for the dataframe includes the dataset name in the name-attribute, labels with a language tag (eg. label_en), descriptions with a language tag (eg. description_en), and a URL in the url-attribute. The compatible metadata for the variables/columns includes labels with a language tag (eg. label_en), descriptions with a language tag (eg. description_en), the variable type in the type-attribute, value labels with a language tag (eg. labels_en), and a URL in the url-attribute.
You can choose to save metadata in one or several specific languages only using the languages-argument. Default is language=”all”.
# Write the dataframe `df` to the opendf-file `my_datafile.zip` in the current working directory.
write_opendf(x=df, file="my_datafile.zip")
# Write the dataframe `df` while keeping only metadata in English.
write_opendf(x=df, file="my_datafile.zip", languages="en")
library(devtools)
devtools::install_git("https://git.soep.de/thartl/r-package-opendataformat.git")
library("opendataformat")
For further instructions on how to work with the opendf-format in R, read the vignette (PDF, HTML) for the opendataformat-package.
Using the opendf in Stata¶
Installing the Opendataformat Package
To work with SOEP-Data in the open data format in Stata you have to install the opendataformat package in Stata using net install to download the package files from the GitHub repository. You can find the instructions on the GitHub page.
1********** Download and Install opendataformat package: *********
2net install opendf, from (https://thartl-diw.github.io/opendf/) replace
Python Integration
For the opendataformat package to work, the Python integration in Stata is required. Therefore, you need Stata version 16 or higher and some Python installation on your computer. You can easily test if Python is working within Stata. If you have problems with the Python integration, you can copy a Python version to your computer using opendf installpython function from the opendataformat package:
1********** opendf installpython and opendf removepython **********
2
3* For the opendf read package to work you need a working python integration
4* in Stata. Therefore you need a python versio on you Computer and Stata
5* must be able to find it.
6* You can test if python works:
7python
8print("Hello World")
9end
10
11* If you have problems with your python integration in Stata, you can fix it
12* easily (for the opendataformat package). The opendataformat package offers
13* a function to copy a python version to you computer ito a location where the
14* opendataformat package will find it.
15* The function by default downloads python 3.12 to the Stata ado folder where
16* the package files are stored:
17opendf installpython
18
19* Alternatively you can specify path and version
20* If you choose another location, the opendataformat package will not
21* automatically find the python.exe. You have to indicate the location each
22* time you start Stata.
23*opendf installpython, version("3.12") location("C:/.../")
24
25* You can remove the python version(s) from the default folder or a folder
26* you define:
27opendf removepython
28*opendf removepython, version("3.12") location("C:/.../")
29
30* For further instructions see the help files:
31help opendf installpython
32help opendf removepython
Read a Data File
Now you can use the package functions to work with opendf-data files. For demonstration purposes, we copy the testdataset (testdata.zip) from GitHub to the local temp-directory:
1* Download testdata (testdata.zip) to Temp-Directory
2copy "https://raw.githubusercontent.com/thartl-diw/opendf/main/Testdata/testdata.zip" `c(tmpdir)', replace
With the opendf read function you can load the data file to Stata:
1********** opendf read **********
2
3* To read a opendf data file you can use the opendf read function.
4* Here the tastdata.zip-file is loaded from Temporary Directory.
5opendf read "`c(tmpdir)'testdata.zip"
You can specify further parameters of the opendf read-function: The range of rows you want to load (excluding the header), the range of columns you want to load, and whether you want to save the dataset directly as .dta-file.
1********** opendf read **********
2
3* To read a opendf data file you can use the opendf read function.
4* Here the tastdata.zip-file is loaded from Temporary Directory.
5opendf read "`c(tmpdir)'testdata.zip"
6
7* You can also specify the rows and colums to load using the rowrange and
8* colrange options. Here the first ten rows and columns 2-5 are loaded.
9opendf read "`c(tmpdir)'testdata.zip", rowr(:10) colr(2:5) save("testdata.dta") replace clear
10
11* For more information look into the help-file
12help opendf read
Display and Use Metadata
To display metadata you can use the opendf docu-function for the dataset information:
1* You can display metadata/information using the opendf docu function.
2* Display Metadata for Dataset:
3opendf docu
Or information on a specific variable:
1* Display Metadata for Variable
2opendf docu bap87
You can also choose the languages for the metadata displayed, if the information is available in particular languages, by setting the languages parameter to these languages (e.g. languages(“en”)) or to display metadata for all languages (languages(“all”)). Default is the active label language. You can set the active language using the label language-command.
1*To switch to another label/metadata language use the label language function
2label language de
3
4*Alternatively you can use the languages option of opendf docu
5opendf docu, languages("en")
6opendf docu, languages("all")
The metadata is stored in the characteristics of the dataset. To retrieve them you can use the Stata base functions:
1*Display characteristics directly
2char list
3local _url: char _dta[url]
4di "`_url'"
5local _description_bap87: char bap87[description_de]
6di "`_description_bap87'"
Write an opendf-dataframe to a Data-File
To save a dataframe as an opendf-file, the opendataformat package provides the opendf write-function. All the metadata stored in the attributes, that is compatible with the opendf-specification is preserved in the opendf-file.
The compatible metadata for the dataframe includes the dataset name in the dataset-attribute, labels with a language tag (eg. label_en), descriptions with a language tag (eg. description_en), and a URL in the url-attribute. The compatible metadata for the variables/columns includes labels with a language tag (eg. label_en), descriptions with a language tag (eg. description_en), the variable type in the type-attribute, value labels with a language tag (eg. labels_en), and a URL in the url-attribute.
You can also specify the columns/variables to save and the languages of the metadata. By default, the metadata is saved in all available languages (languages(“all”))
1********** opendf write **********
2
3* To save the dataset to a opendf-file you can use the opendf write function.
4* To simply write it to the working directory, just enter the file-name:
5* If the data file already exists and you want to replace it,
6* you need the replace option.
7opendf write "testdata.zip", replace
8
9* To save only specific variables, you can use the variables option:
10opendf write "testdata.zip", variables(bap9201 bap9001 bap87) replace
11
12* If you want to save the metadata of only one language or particular languages,
13* you can use the languages option:
14opendf write "testdata.zip", languages("de en") replace
15opendf write "testdata.zip", languages("de") replace
16
17* For more information look into the help-file
18help opendf write
Last change: Dec 04, 2024