# Fixed Effects Estimation¶

You want to find out whether certain variables relevant to the labour market, such as work experience or education time, influence a person’s hourly wage. Other variables such as gender or marriage status should also be taken into account. You decide to use the SOEP data to set up a fixed effects estimation model.

**Create an exercise path with four subfolders:**

**Example:**

- H:/material/exercises/do
- H:/material/exercises/output
- H:/material/exercises/temp
- H:/material/exercises/log

These are used to store your script, log files, datasets and temporary datasets. Open an empty do file and define your created paths with globals:

1 2 3 4 5 6 7 8 9 | ```
***********************************************
* Set relative paths to the working directory
***********************************************
global AVZ "H:\material\exercises"
global MY_IN_PATH "\\hume\rdc-prod\distribution\soep-long\soep.v33.1\stata_en\"
global MY_DO_FILES "$AVZ\do\"
global MY_LOG_OUT "$AVZ\log\"
global MY_OUT_DATA "$AVZ\output\"
global MY_OUT_TEMP "$AVZ\temp\"
``` |

The global „AVZ“ defines the main path. The main paths are subdivided using the globals “MY_IN_PATH”, “MY_DO_FILES”, “MY_LOG_OUT”, “MY_OUT_DATA”, “MY_OUT_TEMP”. The global “MY_IN_PATH” contains the path to your ordered data.

**a) Generate your own SOEPWage.dta data set. The data set should contain information on gross monthly wage, marital status and other personal characteristics.**

To perform your analysis, you need different SOEP variables. The SOEP offers various options for a variable search:

- Search the questionnaires for useful variables. (for more information visit the chapter Variable Search with Questionnaires)
- Find a suitable variable via the topic list of paneldata.org (for more information visit the chapter Topic Search with paneldata.org)
- Search for a suitable variable using a search term in paneldata.org (for more information visit the chapter Variable Search with paneldata.org)
- Use the documentation provided by the generated variables (for more information visit the chapter Documentation of Generated Data)

Use the various important variables of the ppfadl.dta data set as your start file. Your source file should contain the following variables:

- Person ID "pid"
- Survey year "syear"
- Birth Year "gebjahr"
- The net variable with information on the interview type "netto"
- The weighting variable "phrf"
- The sex of the person "sex"
- Sample Membership "pop"

1 | ```
use pid syear sex gebjahr netto pop phrf using "${MY_IN_PATH}/ppfadl.dta", clear
``` |

Apply the necessary content variables to your starting data set. You need the following variables for your analysis:

- Employment Status "plb0022"
- Current Gross Labor Income in Euro "pglabgro"
- Actual Work Time Per Week "pgtatzeit"
- Working Experience Full-Time Employment "pgexpft"
- Amount Of Education Or Training In Years "pgbilzeit"
- Marital Status In Survey Year "pgfamstd"

1 2 | ```
merge 1:1 pid syear using "${MY_IN_PATH}/pl.dta", keepus(plb0022) keep(master match) nogen
merge 1:1 pid syear using "${MY_IN_PATH}/pgen.dta", keepus(pglabgro pgtatzeit pgexpft pgbilzeit pgfamstd) keep(master match) nogen
``` |

Only keep people who have completed an interview and who live in a private household.

1 2 3 4 5 | ```
* Only select people with completed interviews
keep if inrange(netto, 10, 19)
* Only private households
keep if pop==1 | pop==2
``` |

Since you are only interested in the period from 2012 to 2016 in your analysis, remove all survey information that does not fall within this period. To finish, save your data set.

1 2 | ```
* Period from 2012 to 2016
keep if syear>=2012 & syear<=2016
``` |

**Exercise 1: Prepare your data set**

**a) Load your created SOEPWage.dta data set. The data set contains information on gross monthly wage, marital status and other personal characteristics.**

1 2 3 | ```
*** Exercise 1: Prepare your data set
* a) Load data set
use "${MY_OUT_DATA}/SOEPWage.dta", clear
``` |

**b) Recode all missing values in Stata Missings (.)**

1 2 | ```
* b) Recode Missings
mvdecode _all, mv(-8/-1 = .)
``` |

For more information about the missing codes of SOEP data visit the chapter Missing Conventions

**c) Generate the variables “hourly wage” (gross monthly wage/4.33*working time) for persons who have earned at least 1 Euro and have worked at least one hour, “Married vs. Unmarried” and age.**

1 2 3 4 5 6 7 | ```
* c) Generate Variables
gen wage = pglabgro/(4.33*pgtatzeit) if pglabgro>=1 & pgtatzeit>=1
gen married = 1 if pgfamstd==1 | pgfamstd==6 | pgfamstd==7 | pgfamstd==8
replace married = 0 if inrange(pgfamstd, 2, 5)
gen age = syear - gebjahr
``` |

**d) Adjust the variable “hourly wage” from outlier values by setting values smaller than the 1st percentile to the same value. Set values greater than 3 times the 99th percentile to 3*99th percentile. Then generate the variable lwage = log(wage).**

1 2 3 4 5 6 7 8 9 | ```
* d) Adjust wage variable
sum wage, detail
replace wage = 1/3*r(p1) if wage<1/3*r(p1)
replace wage = 3*r(p99) if wage>3*r(p99) & wage<.
gen lwage = log(wage)
label variable lwage "Log hourly wage"
save "${MY_OUT_DATA}/SOEPWage_temp.dta", replace
``` |

**Exercise 2: Descriptive statistics**

**a) Define the data set as a panel data set.**

1 2 3 | ```
*** Exercise 2: Descriptive statistics
* a)
xtset pid syear // Declaring data as panel data
``` |

**b) What percentage of people participate in all five waves (xtdescribe)**

1 2 | ```
* b)
xtdescribe, patterns(16) // -> unbalanced panel
``` |

42808 respondents have contributed information within waves bc (2012) - bg (2016) and about 40% (17069) of the 42808 respondents have provided information for all waves.

**c) Describe the variable “Married” with xttab and xttrans. Take a look at some individual wage (pid=30320901, pid=30932501, pid==3101602, pid==3101801) developments with xtline.**

1 2 3 | ```
* c)
* Stability of the relationship status
xttab married
``` |

You can observe 41.37 percent of person-year observations with Married==No. At least once 19717 people within the period from 2012 to 2016 have stated not to have been married. 25014 persons reported to have been married at least once during this period. Those who were not married for at least one year responded with “married==no” in 94.69% of the observations. Whereas those who have been married at least once responded in 95.88 percent of the observations with”Married==Yes”. A very stable response behaviour can therefore be observed.

1 2 | ```
* Transition probabilities
xttrans married, freq
``` |

96.87 percent of the person-year observations with “married==no” are also not yet married in the next period. 98.51 percent of the persons who are married indicate that they will also be married in the following period. A stable behaviour of the respondents can be seen.

1 2 | ```
* Individual sequences of "wage"
xtline wage if pid==30320901 | pid==30932501 | pid==3101602 | pid==3101801, overlay
``` |

The graphic shows a comparison of the hourly wage for four different respondents.

**Exercise 3: Pooled OLS Regression**

**a) Execute a pooled OLS regression with “Log hourly wage” as dependent variable and “Married”, “Gender”, “Work experience” and “Training time” as independent variables. Interpret the coefficients for “married”, “gender” and “length of training”. Why are these not causal effects?**

1 2 3 | ```
*** Exercise 3: Pooled OLS Regression
* a) Pooled OLS
reg lwage married sex pgexpft pgbilzeit
``` |

The variables married, sex and pgbilzeit most likely correlate with other disregarded/unobserved variables that have an effect on the wage. For example, women work more frequently in occupations with lower wages.

**b) Run the regression again with the option “vce(cluster persnr)” to get clustered standard errors. How do the standard errors of the coefficients change?**

1 2 | ```
* b) Pooled OLS with cluster standard errors
reg lwage married sex pgexpft pgbilzeit, vce(cluster pid)
``` |

The standard errors are getting bigger.

**Exercise 4: Fixed Effects**

**a) Subtract the person-specific mean value from each variable of the model. Use the “egen” function. Ideally you should also use a loop.**

1 2 3 4 5 6 7 8 9 10 11 12 | ```
*** Exercise 4: Fixed Effects
* a) Subtract person-specific averages
gen sample = 1
foreach var in lwage married sex pgexpft pgbilzeit {
bysort pid: egen `var'Mean = mean(`var')
replace `var'Mean = . if `var'==.
gen `var'Demeaned = `var' - `var'Mean
replace sample = 0 if `var'==.
}
bysort pid (sample): replace sample = sample[1]
``` |

**b) Estimate the Fixed Effects model with the previously generated variables. Why is no coefficient estimated for “gender”? How do the coefficients change compared to the pooled OLS estimate? Is the effect of “married” now causally interpretable?**

1 | ```
reg lwageDemeaned marriedDemeaned sexDemeaned pgexpftDemeaned pgbilzeitDemeaned, vce(cluster pid) nocons
``` |

No coefficient was estimated for sex because sex was stable over time for all observations. The coefficient of married is now significant at the 5% level!

**c) Now estimate the Fixed Effects model using the command
“xtreg lwage married sex pgexpft pgbilzeit, fe “. What do you notice about the coefficients compared to task 4 b)? And with the standard errors?**

1 2 | ```
* c) xtreg, fe
xtreg lwage married pgexpft pgbilzeit, fe vce(cluster pid)
``` |

The coefficients are not identical with 4 b) and the standard errors become larger, because model b) does not take into account the estimation of mean values in the standard errors.

**d) Now add dummy variables for the years (i.syear). What happens with the effect of “labour market experience”?**

1 2 | ```
* d) xtreg with dummy
xtreg lwage married pgexpft pgbilzeit i.syear, fe vce(cluster pid)
``` |

Effects on the variables remain significant. The model could possibly be specified on a case by case basis. The Mincer equation is based on (potential) labour market experience squared.

**e) Now you can also square labour market experience into the model. To what extent does the effect of labour market experience change compared to task 5d)?**

1 2 | ```
* e) expft squared
xtreg lwage married c.pgexpft##c.pgexpft pgbilzeit i.syear, fe vce(cluster pid)
``` |

The coefficients of pgexpft and pgexpft^2 remain significant whereas the coefficient for married is no longer significant.

1 | ```
graph twoway (func y = _b[pgexpft]*x + _b[c.pgexpft#c.pgexpft]*x*x, range(0 40))
``` |

The graph shows that the effects of the labour market experience decrease after approximately 15 years of professional experience.

**f) Now estimate the model from task 5e) with longitudinal section weights. Why is the number of cases now significantly smaller? Why could the coefficient of “pgbilzeit” have changed?**

Tip

Create your own longitudinal person weights e.g. longitudinal person weight from wave A to wave D. Take the starting wave cross-sectional weight (aphrf) and multiply through by each following wave staying factor, as in the following example: gen adphrf=aphrf*bpbleib*cpbleib*dpbleib

Since you are looking at the period 2012-2016, you must create a suitable longitudinal weight. To do this, use the phrf data set from the RAW subdirectory. Apply the required variables on your analysis data set and generate your period-related longitudinal section weight. To understand the structure of the data distribution file and the location of the different data sets, visit the chapter Data Distribution File. For more information about the weighting data sets and other survey data sets, visit the chapter Survey Data.

1 2 3 4 5 6 7 | ```
* f) Fixed Effects weighted
global MY_IN_PATH2 "\\hume\rdc-prod\complete\soep-core\soep.v33.2\stata_en\"
rename pid persnr
merge m:1 persnr using "${MY_IN_PATH2}/phrf.dta", nogen keep(master match) keepus(bcphrf bdpbleib bepbleib bfpbleib bgpbleib)
gen wlong = bcphrf*bdpbleib*bepbleib*bfpbleib*bgpbleib
label variable wlong "Weighting BC-BG"
rename persnr pid
``` |

Now estimate the model from 5e) and use the created weight.

1 | ```
xtreg lwage married c.pgexpft##c.pgexpft pgbilzeit i.syear [pw=wlong], fe vce(cluster pid)
``` |

The number of observations is now much smaller. The effect of pgbilzeit is stronger than before. Pgbilzeit has a lower effect in the wlong==0 group, where the return is different for each additional educational year. People in the wlong===0 group may not get the return for the additional education they expected on the local labour market and may therefore move -> higher probability for dropout.