Working with spatial data in R

Prerequisites

Data Access

To work with SOEP’s spatial data you need to have a data distribution contract together with an application for stationary access to a terminal in the Research Data Center (RDC). The forms can be downloaded from here. Having data access granted, you can use the IGEL workstations in the SOEP RDC to get access to the corresponding server infrastructures MORAN and HAUSER. The MORAN server hosts the coordinates of the SOEP households together with some fake coordinates that cannot be distinguished. On this server you can, for example, compute distances or nearest points as shown later in this tutorial. On the HAUSER server you are provided your computational outcome form the MORAN server but without the coordinates. The coordinates of the SOEP households must never be available together with SOEP survey data. More information on how to use the SOEP IGEL technology can be found here.

R

When working with SOEP’s spatial data at the RDC using SOEP IGEL we provide you with the software you need. If you want to work with spatial test data before, you need a certain software setup. To be able to work yourself through the following examples you need to have R and RStudio installed. Further, for working the spatial data in R, you need to have GEOS, GDAL, and PROJ installed, too. When running a Windows OS, simply install the Rtools for Windows from here. MacOS and Linux users are referred to the website of the sf package for installation instructions. You can find them here.

If you are not yet familiar with R we recommend the book R for Data Science by Wickham and Grolemund which is available here for free. Besides that, there two freely available books which are helpful to get you started with spatial data in R:

  • Spatial Data Science by Pebesma and Bivand which is available here

  • Geocomputation with R by Lovelace, Nowosad, and Muenchow which is available here

Coordinate Reference Systems

Any location of an object can be characterised by its coordinates. Besides a coordinate the altitude of an object might be of interest. So locations can be given in 2-dimensional or 3-dimensional space. A 2-dimensional coordinate referenced to two orthogonal axes (say x and y) can be given by either a pair of values (x,y) or by an angle (with respect to one of the axes) and a distance. The pair of coordinates (x,y) is referred to as cartesian coordinates and the latter as polar or spherical coordinates. In a 3-dimensional world, where you define a location on a globe, things become a little more complicated, because earth is an ellipsoid rather than a perfect sphere. In defining a location in a 3-dimensional coordinate system, the coordinate reference system (CRS) is very important. The CRS combines the information about the coordinate system itself and a geodetic datum. The geodetic datum describes how the location of the coordinate system is related to earth. So different CRS will provide different values of coordinates for the same location. Finally, the projection method provides information on how to flatten objects that are on a round surface so they can be viewed on a flat surface. The CRS is often summarised by an EPSG-code (European Petroleum Survey Group Geodesy) allowing for convenient transformations between different CRS. For more details we refer you to Chapter 2 of Spatial Data Science or Chapter 6 of Geocomputation with R. The most frequently used CRS are displayed in the following table.

EPSG code

Note

PRJ4-String

Projection

3857

WGS 84 / Pseudo-Mercator

+proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs

Popular Visualisation Pseudo Mercator

4326

WGS 84

+proj=longlat +datum=WGS84 +no_defs +type=crs

(null)

25832

ETRS89 / UTM zone 32N

+proj=utm +zone=32 +ellps=GRS80 +units=m +no_defs +type=crs

Transverse Mercator

25833

ETRS89 / UTM zone 33N

+proj=utm +zone=33 +ellps=GRS80 +units=m +no_defs +type=crs

Transverse Mercator

32632

WGS 84 / UTM zone 32N

+proj=utm +zone=32 +datum=WGS84 +units=m +no_defs +type=crs

Transverse Mercator

32633

WGS 84 / UTM zone 33N

+proj=utm +zone=33 +datum=WGS84 +units=m +no_defs +type=crs

Transverse Mercator

The differences in the maps output can be seen in the following figure. The maps show Europe (provided by https://data.opendatasoft.com) in two different CRS.

Europe displayed in different CRS.

The CRS of the fake SOEP households is defined by the EPSG code 32632. The CRS usually used by Google Maps / OpenStreetMap is defined by EPSG code 4326. To check you can use for example https://coordinates-converter.com/.

If you already know about all of the prerequisites and are familiar with R, the sf package as well as the tidyverse, you can directly switch to the complete example provided in section Complete Example. If you are not familiar with all of this, we recommend you go through the following text first.

Reading data

Spatial data comes in a variety of formats, ranging from ASCII-files to more sophisticated data structures like shapefiles or XML-dialects. Fortunately the R package sf provides functions to conveniently read most of the data formats. Besides that, the package and its use is well documented here. The package can easily be installed and loaded by using the following lines of code in R.

install.packages('sf') # installing the package
library(sf) # loading the package

# further packages
install.packages('tidyverse') # a selection of packages
library(tidyverse) # (mainly using dplyr, ggplot2)
install.packages('here') # for working with .RProj and paths
library(here)

Further packages that are useful when working with R are the tidyverse (a selection of several packages) and here (makes working in projects easier). A complete list of packages used to create this page is provided in Appendix at the end of this page.

When working with spatial data in the SOEP we provide you with fake coordinates of the SOEP households in a shapefile format. This format is easy to read using the sf package in R and provides you with all the details you need. The data you bring along is likely to be in a different format. Formats that can be read using st_read are listed in the GDAL Documentation. If you bring along data in other formats consider the haven package for SAS, SPSS, or Stata formats, documented here. Non-spatial data formats however, mostly include no information on the CRS. Thus you need to know and assign it to the data. An example when reading data from spreadsheets is provided below.

Shapefiles

Shapefiles consist of at least three files, namely

  • .shp containing the geometries for example for a point (address) or a polygon (state)

  • .dbf containing the attribute data for each geometry, for exampe, the address data (street, zip code, city) or names of the states

  • .shx containing the indices linking geometries and attribute data

There are usually some more files provided containing further information. One example of shapefiles is provided by the Federal Agency for Cartography and Geodesy (BKG, GeoBasis-DE / BKG 2019). The BKG provides public available data on administrative areas in Germany. The dataset VG250 includes theses administrative units of the hierarchical levels from the country down to municipalities and is available here. Unzipping the folder will provide you with the shapefiles and the documentation of the data. Because the data contains information on different levels (Germany, its states, district, municipalities) these are stored in different layers. To check which information is avaialable in the data, use st_layers. If you do not explicitly provide a layer, st_read will take the first available layer. The function returns the layer names, the according type of geometry, the number of features (geometries) and fields (variables). Reading the German Federal States you need to pick the layer VG250_LAN.

# layers in the data
st_layers(here('Daten', 'vg250_ebenen_0101'))
## Driver: ESRI Shapefile
## Available layers:
##    layer_name geometry_type features fields
## 1   VG250_GEM       Polygon    11135     26
## 2   VG250_KRS       Polygon      431     26
## 3   VG250_LAN       Polygon       35     26
## 4    VG250_LI   Line String    35483      4
## 5    VG250_PK         Point    11003     13
## 6   VG250_RBZ       Polygon       20     26
## 7   VG250_STA       Polygon       11     26
## 8   VG250_VWG       Polygon     4688     26
## 9    VG_DATEN            NA       35      8
## 10   VG_WERTE            NA       32      4
# read the shapefile for Federal States
States <- st_read(here('Daten', 'vg250_ebenen_0101'), layer = 'VG250_LAN', quiet = TRUE)

Taking a look at a the main variables GEN (Geographical name), BEZ (Designation), and geometry (a multipolygon), we are provided with information displayed differently from usual data.frame or tibble like data. The ‘header’ tells you it’s a collection of simple features having 16 features (rows) and 2 fields (variables, excluding the geometry column). The type of geomtry is a multipolygon (set of polygons) in a two-dimensional space (XY). The bounding box (bbox) provides information on the bottom left and top right corner of the box bounding the geometries. Finally, you get information on the coordinate reference system (CRS), summarized by EPSG-Code 25832.

# look at the data
States %>%
  filter(GF == 4) %>% # use land with structure only
  select(GEN, BEZ, geometry) # select necessary variables
## Simple feature collection with 16 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 280371.1 ymin: 5235856 xmax: 921292.4 ymax: 6101444
## projected CRS:  ETRS89 / UTM zone 32N
## First 10 features:
##                    GEN                  BEZ                       geometry
## 1   Schleswig-Holstein                 Land MULTIPOLYGON (((464810.7 61...
## 2              Hamburg Freie und Hansestadt MULTIPOLYGON (((578219 5954...
## 3        Niedersachsen                 Land MULTIPOLYGON (((479451.1 59...
## 4               Bremen     Freie Hansestadt MULTIPOLYGON (((466930.3 58...
## 5  Nordrhein-Westfalen                 Land MULTIPOLYGON (((477607.3 58...
## 6               Hessen                 Land MULTIPOLYGON (((534242 5721...
## 7      Rheinland-Pfalz                 Land MULTIPOLYGON (((416304.5 56...
## 8    Baden-Württemberg                 Land MULTIPOLYGON (((546771.2 55...
## 9               Bayern            Freistaat MULTIPOLYGON (((609387.6 52...
## 10            Saarland                 Land MULTIPOLYGON (((359841 5499...

SOEP Regional Data

Also the fake coordinates for the SOEP households are provided in a shapefile format. The data contains an ID-variable (ID) and the survey year (erhebj) together with a point geometry. The ID provided is not related to any of the SOEP’s hid or pid. It is simply a number ranging from 1 to the corresponding number of rows in the data set. Households can thus only be identified, merged, or selected using their coordinate. Moreover the data also contain the SOEP’s initial sample. Thus containing regional information on respondents and non-respondents. Because of data protection regulations, the SOEP survey data must not be used together with the household’s coordinates. This is why the environment where you are allowed to work with the household’s coordinates is strictly separated. Unlike the VG250 data the SOEP data has the EPSG-code 32632.

SOEP <- st_read(here('Daten', 'soep_v29'))
## Reading layer `soep_hh_korr_nur_fakes_utm32-v29' from data source `/media/H/Projekte/Adhoc_Analysen/GeodatenBSP/Daten/soep_v29' using driver `ESRI Shapefile'
## Simple feature collection with 209734 features and 2 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 288107 ymin: 5250939 xmax: 888152 ymax: 6055336
## projected CRS:  WGS 84 / UTM zone 32N
SOEP
## Simple feature collection with 209734 features and 2 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 288107 ymin: 5250939 xmax: 888152 ymax: 6055336
## projected CRS:  WGS 84 / UTM zone 32N
## First 10 features:
##    erhebj ID               geometry
## 1    2000  1 POINT (355929 5665928)
## 2    2000  2 POINT (800240 5821852)
## 3    2000  3 POINT (794464 5831012)
## 4    2000  4 POINT (478831 5431318)
## 5    2000  5 POINT (799318 5825023)
## 6    2000  6 POINT (794752 5833726)
## 7    2000  7 POINT (799149 5824960)
## 8    2000  8 POINT (791594 5827687)
## 9    2000  9 POINT (791657 5827422)
## 10   2000 10 POINT (792280 5826345)

Spreadsheets

When reading spatial data from spreadsheets the data is read by different drivers. The example data is contained in an Excel spreadsheet with three variables X (longitude), Y (latitude), and Object. Using the st_read function the csv-driver from GDAL is chosen automatically, reading 4 columns however. Thus we get rid of the last column first.

POI <- st_read(here('Daten', 'POIs_Berlin.csv'))
## Reading layer `POIs_Berlin' from data source `/media/H/Projekte/Adhoc_Analysen/GeodatenBSP/Daten/POIs_Berlin.csv' using driver `CSV'
POI
##                   X                Y                    Objekt field_4
## 1  13.3776978296166 52.5162788298363         Brandenburger Tor    <NA>
## 2  13.3693019102667 52.5250274843424              Hauptbahnhof    <NA>
## 3  13.3886070846104 52.5121727411876                DIW Berlin    <NA>
## 4  13.4589278721678  52.496825784817              Mulecule Man    <NA>
## 5  13.2887256012712 52.5580119710438           Flughafen Tegel    <NA>
## 6  13.4005469632413 52.4792779435797          Tempelhofer Feld    <NA>
## 7  13.4484578382279 52.4482123689564          Hufeisensiedlung    <NA>
## 8   13.281110837709 52.4473851451014                 FU Berlin    <NA>
## 9  13.1725456943567 52.4300072503147                   Wannsee    <NA>
## 10 13.2128459704725 52.5411476127425         Zitadelle Spandau    <NA>
## 11 13.5727300494297 52.4438163134271          Schloss Köpenick    <NA>
## 12 13.2795658406826 52.5080182126806 Zentraler Omnibus Bahnhof    <NA>
POI <- POI[, -4]

Because this data contains latitude and longitude already, we can simply transform it to a spatial dataset using st_as_sf and the CRS has to be assigned to it.

POI <- st_as_sf(POI, coords = c("X", "Y"), crs = 4326)
POI
## Simple feature collection with 12 features and 1 field
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 13.17255 ymin: 52.43001 xmax: 13.57273 ymax: 52.55801
## geographic CRS: WGS 84
## First 10 features:
##               Objekt                  geometry
## 1  Brandenburger Tor  POINT (13.3777 52.51628)
## 2       Hauptbahnhof  POINT (13.3693 52.52503)
## 3         DIW Berlin POINT (13.38861 52.51217)
## 4       Mulecule Man POINT (13.45893 52.49683)
## 5    Flughafen Tegel POINT (13.28873 52.55801)
## 6   Tempelhofer Feld POINT (13.40055 52.47928)
## 7   Hufeisensiedlung POINT (13.44846 52.44821)
## 8          FU Berlin POINT (13.28111 52.44739)
## 9            Wannsee POINT (13.17255 52.43001)
## 10 Zitadelle Spandau POINT (13.21285 52.54115)

Transformations

To be able to work with all three data sets (States, SOEP, and POI) we have to make sure all have the same CRS. Using st_transform we can reproject the data sets to have the same EPSG-code (common_crs).

common_crs <- 25832

SOEP <- st_transform(SOEP, crs = common_crs)
SOEP <- SOEP[SOEP$erhebj == 2005, ] # only use 2005 SOEP data

POI <- st_transform(POI, crs = common_crs)

# States <- st_transform(States, crs = common_crs) aleady correct crs

Sometimes it might be easier for you to work in other programs and you wish to have latitude and longitude data. In this case you can use st_coordinates to transform point geometries into (x,y) coordinates. The below example first transforms the coordinates for the households 1 to 5 in the SOEP data into latitude and longitude data and then creates the (x,y) coordinates from the point geometry.

soep5_lat_lon <-  st_transform(SOEP[1:5,], 4326)
st_coordinates(soep5_lat_lon)
##          X        Y
## 1  6.94659 51.12465
## 2 13.36828 52.48428
## 3 13.40601 52.39064
## 4  8.70727 49.03966
## 5 13.43205 52.48811

Plotting Spatial Data

Besides being able to look up coordinates or objects in google maps or OpenStreetMap, we can use the ggplot2 package included in the tidyverse package for displaying the data. The package provides the geom_sf function to easily plot the data.

# subsetting and plotting the data
States %>% # use the states data
  filter(GEN == 'Berlin') %>% # filter for Berlin only
  ggplot() + # plot base
  geom_sf(fill = 'white') + # add the polygon for Berlin and fill it whtie
  geom_sf(data = POI) + # add the POI data
  geom_sf_text(data = POI, aes(label = Objekt),
               nudge_y = -1000,
               check_overlap = TRUE) + # overlapping labels will not be displayed
  xlab('') + ylab('')
../_images/plot_data-1.png

Frequently Used Operations

This section will provide an overview of some frequently used operations when working with spatial data. The dataset SOEP contains some fake coordinates of household addresses we will work with throughout the examples.

Finding Households in a Specified Area

Suppose we are interested in identifying all the SOEP households located in Berlin. The corresponding polygon for Berlin is provided in the States dataset. Because we are interested in Berlin only, we save the polygon in an own object BE. The function st_contains identifies the row-index in the SOEP dataset that fall within the polygon BE and returns a list (soep_in_berlin). For checks along the way we can look at plots of the data.

BE <- States[States$GEN == 'Berlin', ]
BE %>% select(GEN, BEZ)
## Simple feature collection with 1 feature and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 777974.1 ymin: 5808837 xmax: 823510.5 ymax: 5845580
## projected CRS:  ETRS89 / UTM zone 32N
##       GEN  BEZ                       geometry
## 11 Berlin Land MULTIPOLYGON (((802831.7 58...
soep_in_berlin <- st_contains(BE, SOEP)
soep_in_berlin
## Sparse geometry binary predicate list of length 1, where the predicate was `contains'
##  1: 2, 3, 5, 6, 8, 9, 10, 11, 12, 13, ...
SOEP_BE <- SOEP[unlist(soep_in_berlin), ]

BE %>%
  ggplot() + geom_sf(fill = 'white') + geom_sf(data = SOEP_BE)
../_images/soep_in_BE-1.png

Distances / Areas

To compute distances the function st_distance can be provided a single dataset with n rows providing a n x n matrix of distances of the geometries contained in the data (dist_m). The unit of the distance returned depends on the CRS. In the example provided below the distances of the POIs to the location of the DIW are given in meters.

Providing the function a second dataset of m rows will create a n x m distance matrix (dist_soep_poi). The created object is a matrix with 529 rows (SOEP households in Berlin) and 12 columns (POIs in Berlin).

When computing distances on large data sets it might be helpful to subset the data, because the distance of a household in Munich might be irrelevant to a research question focusing on Berlin or distances smaller than 5000m. According the the CRS, consider specifying the which argument.

Areas can be computed for (multi-)polygons. The function st_area provides the corresponding information.

# distances between the POIs
dist_m <- st_distance(POI)
rownames(dist_m) <- POI$Objekt
colnames(dist_m) <- POI$Objekt

# distance of POIs to DIW Berlin
dist_m['DIW Berlin', ]
## Units: [m]
##         Brandenburger Tor              Hauptbahnhof                DIW Berlin
##                  870.8119                 1941.2940                    0.0000
##              Mulecule Man           Flughafen Tegel          Tempelhofer Feld
##                 5074.7951                 8488.2154                 3751.7696
##          Hufeisensiedlung                 FU Berlin                   Wannsee
##                 8202.7627                10269.1064                17307.5100
##         Zitadelle Spandau          Schloss Köpenick Zentraler Omnibus Bahnhof
##                12364.7837                14651.8215                 7422.6480
# distance between each household and each POI
dist_soep_poi <- st_distance(SOEP_BE, POI)
dim(dist_soep_poi)
## [1] 529  12
# save distances in an object
DIST <- as_tibble(dist_soep_poi)
# add names
names(DIST) <- str_c('distance_to_', str_remove(POI$Objekt, ' '))

# attach distances to data
SOEP_BE <- bind_cols(SOEP_BE, DIST)


# area covered by Berlin
st_area(BE)
## 893060962 [m^2]

Nearest Point

To find the point or feature closest to another one the function st_nearest_feature will return a vector with indices of the nearest feature. In the example below we are looking for the households living closest to the POIs in Berlin. In the second step we compute the corresponding distances between the POI and the household.

nearest_hh <- st_nearest_feature(POI, SOEP_BE)

diag(st_distance(POI, SOEP_BE[nearest_hh, ]))
## Units: [m]
##  [1]  256.5810 1195.8833  789.6812  392.5787 1913.8130  891.4404 1295.8621
##  [8]  520.3830  971.0417  364.4532  250.2833 1194.7832
BE %>% # polygon for Berlin
  ggplot() + geom_sf(fill = 'white') + # plot the Berlin polygon
  geom_sf(data = POI) + # add the POIs
  geom_sf(data = SOEP_BE[nearest_hh, ], col = 'red') # add the nearest household
../_images/nearest-1.png

Within Distance

If your interest is about which households live within a certain distance to a specific point, st_is_within_distance lets you lookup geometries in a given distance (argument dist) and returns a list. The below example looks up households in a 5km distance of the Brandenburger Tor. The plot shows the 5km radius area in yellow, the location of the Brandenburger Tor (black dot) and than households within the distance (red dots).

r_5000 <- st_is_within_distance(POI[POI$Objekt == 'Brandenburger Tor',],
                                SOEP_BE,
                                dist = 5000)
r_5000
## Sparse geometry binary predicate list of length 1, where the predicate was `is_within_distance'
##  1: 1, 3, 6, 12, 21, 25, 28, 39, 54, 55, ...
BE %>% # polygon for Berlin
  ggplot() + geom_sf(fill = 'white') + # plot the Berlin polygon
  geom_sf(data = POI[POI$Objekt == 'Brandenburger Tor',]) + # add the POI
  geom_sf(data = SOEP_BE[unlist(r_5000), ], col = 'red') + # add the nearest household
  geom_sf(data = st_buffer(POI[POI$Objekt == 'Brandenburger Tor',], dist = 5000),
          alpha = 0.2, fill = 'yellow')
../_images/withinDistance-1.png

The question cal also be asked the other way around: How many POIs are within a 5km radius of the SOEP households? This way the function st_is_within_distance returns a list of length equal to the number of SOEP households in Berlin (529). For each household the (row) index for the POI is given. To get the number of POIs in the 5km radius, we can simply ask for the length (the number of row indices) of each list-component. To get the according distances see section Distances / Areas

poi_5000 <- st_is_within_distance(SOEP_BE, POI, dist = 5000)
poi_5000
## Sparse geometry binary predicate list of length 529, where the predicate was `is_within_distance'
## first 10 elements:
##  1: 1, 2, 3, 6
##  2: (empty)
##  3: 1, 3, 4, 6, 7
##  4: 5
##  5: 2, 5, 12
##  6: 1, 2, 12
##  7: 10
##  8: 5, 10, 12
##  9: 8
##  10: 5, 10, 12
N_POI <- as_tibble(sapply(poi_5000, length))
names(N_POI) <- 'n_poi_in_5km'

SOEP_BE <- bind_cols(SOEP_BE, N_POI)

Spatial joins

When you are used to working with SOEP data you will have probably merged / joined data sets using the identifying variables (cid, hid, pid) and the survey year (syear) before. When you are working with spatial data you will have to choose one of the geometry predicate function provided by the sf package. The default is a left join of the two data sets using st_intersects as the geometry predicate function for joining. You can however change this, for example, to join the nearest features, see section Nearest Point. In our example here, we join the nearest SOEP household to each of the points of interest. The geometry column here provides the coordinates from the POI data set.

NEAR <- st_join(POI, SOEP, join = st_nearest_feature)
NEAR
## Simple feature collection with 12 features and 3 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 783630.5 ymin: 5817059 xmax: 810721.7 ymax: 5831749
## projected CRS:  ETRS89 / UTM zone 32N
## First 10 features:
##               Objekt erhebj    ID                 geometry
## 1  Brandenburger Tor   2005 75759 POINT (796986.8 5827473)
## 2       Hauptbahnhof   2005 69076 POINT (796358.6 5828411)
## 3         DIW Berlin   2005 75759 POINT (797754.3 5827062)
## 4       Mulecule Man   2005 75646 POINT (802628.5 5825649)
## 5    Flughafen Tegel   2005 73266 POINT (790677.7 5831749)
## 6   Tempelhofer Feld   2005 69820 POINT (798787.2 5823455)
## 7   Hufeisensiedlung   2005 69477 POINT (802251.6 5820202)
## 8          FU Berlin   2005 67568 POINT (790892.2 5819422)
## 9            Wannsee   2005 67923 POINT (783630.5 5817059)
## 10 Zitadelle Spandau   2005 68827 POINT (785646.9 5829572)

Export Results

To export your results you can use st_write to create a .csv file. When exporting your results pleace check the requirements here.

path_export <- paste0('/home/', Sys.info()['user'], '/transfer/export/',  Sys.Date())

if(!file.exists(path_export)){
  dir.create(path_export, recursive = TRUE)
}

st_write(SOEP_BE, file.path(path_export, 'Output_SOEP_BE.csv'),
         append = FALSE, overwrite = TRUE)

README <- tibble(name = names(SOEP_BE)[-grep('geometry', names(SOEP_BE))],
                 description = c('erhebj',
                                 'ID',
                                 'distance (in meters) of household to Brandenburger Tor',
                                 'distance (in meters) of household to Hauptbahnhof',
                                 'distance (in meters) of household to DIW-Berlin',
                                 'distance (in meters) of household to Mulecule Man',
                                 'distance (in meters) of household to Flughafen Tegel',
                                 'distance (in meters) of household to Tempelhofer Feld',
                                 'distance (in meters) of household to Hufeisensiedlung',
                                 'distance (in meters) of household to FU Berlin',
                                 'distance (in meters) of household to Wannsee',
                                 'distance (in meters) of household to Zitadelle Spandau',
                                 'distance (in meters) of household to Schloss Köpenick',
                                 'distance (in meters) of household to Zentraler Omnibus Bahnhof',
                                 'number of POIs within 5 km radius of household'))

write.csv(README, file.path(path_export, 'Output_SOEP_BE.csv'),
         row.names = FALSE)

Complete Example

Suppose you want to know which households of the SOEP from survey year 2011 live within a distance of 5000m to the following points of interest (POI):

  • Brandenburger Tor

  • Zitadelle Spandau

  • Wannsee

Besides that, you want to know how far their distance to the corresponding POI is and which household lives closest to the corresponding POI. After computing the informations need you want to export the results for further use on the HAUSER server.

# Global stuff
# ~~~~~~~~~~~~

# packages
library(here)
library(sf)
library(tidyverse)

# global values
survey_year <- 2011
distance <- 5000 # meter
common_crs <- 25832



# Step 1: read the data
# ~~~~~~~~~~~~~~~~~~~~~

# read polygons for Federal States
States <- st_read(here('Daten', 'vg250_ebenen_0101'), layer = 'VG250_LAN', quiet = TRUE)

# read SOEP data
SOEP <- st_read(here('Daten', 'soep_v29'), quiet = TRUE)

# read POI data
POI <- st_read(here('Daten', 'POIs_Berlin.csv'), quiet = TRUE)



# Step 2: Transform data
# ~~~~~~~~~~~~~~~~~~~~~~

# transform SOEP the data
SOEP <- st_transform(SOEP, crs = common_crs)

# transform POIs
POI <- POI[, -4]

POI <- st_as_sf(POI, coords = c("X", "Y"), crs = 4326)
POI <- st_transform(POI, crs = common_crs)



# Step 3: Subset the data
# ~~~~~~~~~~~~~~~~~~~~~~~

# Berlin only
BE <- States %>%
  filter(GEN == 'Berlin')

# interesting POIs only
POI <- POI %>%
  filter(Objekt %in% c("Brandenburger Tor", "Zitadelle Spandau", "Wannsee"))

# SOEP survey year: 2011 only
SOEP <- SOEP %>%
  filter(erhebj == survey_year)

# SOEP housholds in Berlin only
SOEP <- SOEP[unlist(st_contains(BE, SOEP)), ]



# Step 4: Plot the data
# ~~~~~~~~~~~~~~~~~~~~~

BE %>%
  ggplot() + geom_sf() +
  geom_sf(data = SOEP) +
  geom_sf(data = POI, col = 'red') +
  geom_sf(data = st_buffer(POI, dist = distance),
          alpha = 0.2, fill = 'yellow')
../_images/example-1.png
# Step 5: Get the desired information
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# get the household in relevant distance
soep_in_5km <- st_is_within_distance(POI, SOEP, dist = distance)
SOEP_5km <- SOEP[unlist(soep_in_5km), ]

# get the distances
dist_matrix <- as_tibble(st_distance(SOEP_5km, POI))
names(dist_matrix) <- str_c('distance_to_', str_remove(POI$Objekt, ' '))

# attach distances to data
SOEP_5km <- bind_cols(SOEP_5km, dist_matrix)

# find the nearest household
nearest_hh <- st_nearest_feature(POI, SOEP_5km)

SOEP_5km$nearest_to <- ''
SOEP_5km$nearest_to[nearest_hh] <- POI$Objekt



# Step 6: Check your data
# ~~~~~~~~~~~~~~~~~~~~~~~

BE %>%
  ggplot() + geom_sf() +
  geom_sf(data = SOEP) +
  geom_sf(data = POI, col = 'red') +
  geom_sf(data = st_buffer(POI, dist = distance),
          alpha = 0.2, fill = 'yellow') +
  geom_sf(data = SOEP_5km, col = 'blue') +
  geom_sf(data = SOEP_5km[SOEP_5km$nearest_to != '', ], col = 'green')
../_images/example-2.png
# Step 7: Export your data
# ~~~~~~~~~~~~~~~~~~~~~~~~
path_export <- paste0('/home/', Sys.info()['user'], '/transfer/export/',  Sys.Date())

if(!file.exists(path_export)){
  dir.create(path_export, recursive = TRUE)
}

st_write(SOEP_5km, file.path(path_export, 'Output.csv'),
         append = FALSE, overwrite = TRUE)
## Deleting layer `Output' using driver `CSV'
## Writing layer `Output' to data source `/home/hwsteinhauer/transfer/export/2021-01-27/Output.csv' using driver `CSV'
## Updating existing layer Output
## Writing 356 features with 6 fields and geometry type Point.
README <- tibble(name = names(SOEP_5km)[-grep('geometry', names(SOEP_5km))],
                 description = c('erhebj',
                                 'ID',
                                 'distance (in meters) of household to Brandenburger Tor',
                                 'distance (in meters) of household to Wannsee',
                                 'distance (in meters) of household to Zitadelle Spandau',
                                 'household nearest to the POI'))

write.csv(README, file.path(path_export, 'README.csv'),
         row.names = FALSE)

Appendix

Session info - Platform

##  setting  value
##  version  R version 4.0.3 (2020-10-10)
##  os       Ubuntu 20.04.1 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  de_DE.UTF-8
##  ctype    de_DE.UTF-8
##  tz       Europe/Berlin
##  date     2021-01-27

Session info - Packages

##  package     * version date       lib source
##  bookdown    * 0.21    2020-10-13 [1] CRAN (R 4.0.3)
##  dplyr       * 1.0.2   2020-08-18 [1] CRAN (R 4.0.2)
##  forcats     * 0.5.0   2020-03-01 [1] CRAN (R 4.0.0)
##  ggplot2     * 3.3.2   2020-06-19 [1] CRAN (R 4.0.1)
##  gridExtra   * 2.3     2017-09-09 [1] CRAN (R 4.0.0)
##  here        * 1.0.0   2020-11-15 [1] CRAN (R 4.0.3)
##  kableExtra  * 1.3.1   2020-10-22 [1] CRAN (R 4.0.3)
##  knitr       * 1.30    2020-09-22 [1] CRAN (R 4.0.2)
##  pacman      * 0.5.1   2019-03-11 [1] CRAN (R 4.0.2)
##  purrr       * 0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
##  readr       * 1.4.0   2020-10-05 [1] CRAN (R 4.0.3)
##  rgdal       * 1.5-19  2021-01-05 [1] CRAN (R 4.0.3)
##  sessioninfo * 1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
##  sf          * 0.9-7   2021-01-06 [1] CRAN (R 4.0.3)
##  sp          * 1.4-4   2020-10-07 [1] CRAN (R 4.0.3)
##  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 4.0.0)
##  tibble      * 3.0.4   2020-10-12 [1] CRAN (R 4.0.3)
##  tidyr       * 1.1.2   2020-08-27 [1] CRAN (R 4.0.2)
##  tidyverse   * 1.3.0   2019-11-21 [1] CRAN (R 4.0.0)
##
## [1] /home/hwsteinhauer/R/x86_64-pc-linux-gnu-library/4.0
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

Section author: Hans Walter Steinhauer hwsteinhauer@diw.de

Last updated: 2021-02-22