4 Application to the SDR

Iván A. Carrillo and Alan F. Karr

The dataset we use is the restricted SDR data, under a license agreement from NSF. The SDR collects information about employment situation, principal employer, principal job, past employment, recent education, demographics, and disability, among others that vary from wave to wave. We use only information requested in all the waves of interest: 1995, 1997, 1999, 2001, 2003, 2006, and 2008.

To illustrate our methodology, we constructed a model for individuals' salaries over time. The response is the log of salary (in the principal job), with an identity link function, and several covariates; modeling log of salary (as opposed to salary) is a standard practice. There are both time-independent covariates (such as gender) and time-dependent ones (such as employment sector). We have four major classes of covariates. The Degree variables are: degree field, years since degree, and age at graduation. The Job variables are: job field or category, sector, postdoc indicator, adjunct faculty indicator, hours worked per week in the principal job, weeks per year in the principal job, how related is the job to the doctoral degree, part-time for different reasons, number of months since started in the principal job, the starting month in the principal job, whether the employer/type of job has changed since previous wave, and whether changed employer/type of job since previous wave because was laid off or job terminated. The Person's demographics are: gender, citizenship status, race/ethnicity, presence of children in family, marital status, and spouse's working status. Finally, the "Environment� variables are: years since 1995, state (of employment), and the consumer price index (of the region of employment). The full list of variables, interactions, and categories can be found in Carrillo and Karr (2011). For categorical variables, the reference category is the one with the largest count.

The dataset for our model consists of 59,346 subjects and 190,693 observations, distributed as: $n_{95} =$ 30,234, $n_{97} =$ 30,652, $n_{99} =$ 26,732, $n_{01} =$ 26,778, $n_{03} =$ 24,956, $n_{06} =$ 25,910, and $n_{08} =$ 25,431. Those data correspond to non-missing salaries between $5,000 and $999,995, for people with consistent ages across the waves, and with non-missing value for the variable indicating whether the (postsecondary educational institution) employer was public or private. The average (cross-sectional) survey weight for each of those waves are: ${\bar{w}}_{95} =$ 15.37, ${\bar{w}}_{97} =$ 16.28, ${\bar{w}}_{99} =$ 19.96, ${\bar{w}}_{01} =$ 20.74, ${\bar{w}}_{03} =$ 22.71, ${\bar{w}}_{06} =$ 22.93, and ${\bar{w}}_{08} =$ 24.88.

The survey weights that we use for each wave are the final adjusted weights. These weights are the original design weights adjusted for nonresponse and post-stratification. However, the theory that we developed in Section 3 assumes that the weights are the inverse of the selection probabilities; in other words, the original design weights. This is a mismatch whose effect we plan to investigate in the future. On the other hand, the calculations in the last part of the Appendix (which do not assume anything about the weights) suggest that the effect of this mismatch is small.

The covariates and interactions that we considered were selected because they were suggested either by exploratory analyses or by the subject matter experts at the NSF. Carrillo and Karr (2011) present the estimated $β$ coefficients in the model $y_{i j} = \log ({SALARY}_{i j}) = {X^{'}}_{i j} β + ε_{i j},$ where $X_{i j}$ includes the intercept along with the other covariates. This $β$ corresponds to the one in model $ξ,$ in Formula (3.1), and whose properties are discussed in Section 3. The working covariance matrix is estimated to be ${\hat{V}}_{i} = \hat{ϕ} R (\hat{α}),$ with $\hat{ϕ} = {\hat{σ}}^{2} = (\sum_{i \in s} \sum_{j = 95}^{08} w_{i j} {\hat{e}}_{i j}^{2}) / (\sum_{i \in s} \sum_{j = 95}^{08} w_{i j} - p) = 0.196,$ where ${\hat{e}}_{i j} = y_{i j} - {X^{'}}_{i j} \hat{β}$ and $p =$ 208 is the number of covariates in $X_{i j}, w_{i j}$ is the cross-sectional weight for subject $i$ at wave $j$ as long as $i \in s_{j}$ and zero otherwise. The estimate $\hat{α}$ contains the $21 = (7 \times 6) / 2$ estimated auto-correlations ${\hat{α}}_{j j^{'}} = {\hat{α}}_{j^{'} j} = (\sum_{i \in s} \sqrt{w_{i j}} \sqrt{w_{i j^{'}}} {\hat{e}}_{i j} {\hat{e}}_{i j^{'}}) / (\hat{ϕ} [\sum_{i \in s} \sqrt{w_{i j}} \sqrt{w_{i j^{'}}} - p]),$ for $j \neq j^{'} =$ 1995, 1997, 1999, 2001, 2003, 2006, 2008, and ${\hat{α}}_{j j} = 1$ for all $j .$ These estimated values form the auto-correlation matrix:

$R (\hat{α}) = (\begin{matrix} 1 & {\hat{α}}_{95,97} & {\hat{α}}_{95,99} & {\hat{α}}_{95,01} & {\hat{α}}_{95,03} & {\hat{α}}_{95,06} & {\hat{α}}_{95,08} \\ 1 & {\hat{α}}_{97,99} & {\hat{α}}_{97,01} & {\hat{α}}_{97,03} & {\hat{α}}_{97,06} & {\hat{α}}_{97,08} \\ 1 & {\hat{α}}_{99,01} & {\hat{α}}_{99,03} & {\hat{α}}_{99,06} & {\hat{α}}_{99,08} \\ 1 & {\hat{α}}_{01,03} & {\hat{α}}_{01,06} & {\hat{α}}_{01,08} \\ 1 & {\hat{α}}_{03,06} & {\hat{α}}_{03,08} \\ sym & 1 & {\hat{α}}_{06,08} \\ 1 \end{matrix}) = (\begin{matrix} 1 & 0.38 & 0.36 & 0.32 & 0.30 & 0.28 & 0.27 \\ 1 & 0.42 & 0.36 & 0.33 & 0.32 & 0.31 \\ 1 & 0.46 & 0.38 & 0.36 & 0.34 \\ 1 & 0.47 & 0.40 & 0.38 \\ 1 & 0.49 & 0.44 \\ sym & 1 & 0.55 \\ 1 \end{matrix}) .$

We now give some conclusions about salaries in the Ph.D. workforce based on the estimated coefficients, which appear in Carrillo and Karr (2011). First of all, a sensible estimate of mean salary considers the intercept, the hours worked per week (whose average is 47), and years since degree (average of 15); so that an estimate of the overall average is $\exp (9.4 + 47 \times 0.038 - 47^{2} \times 0.0003 + 15 \times 0.03 - 15^{2} \times 0.0006) = $ 52,067,$ for a subject with all other continuous covariates equal to zero and in the reference of all categorical covariates.

All other things being constant, women's salaries are about 93.4% those of men, whereas race does not seem to have an effect on salaries. The gender $\times$ years since 1995 interaction is not significant; therefore this salary differential is not changing over time. Notice that with a single year's data, we would not be able to evaluate the effect of time. Even more important than that, using only the data from a single wave, say 2008, we would not be able to assess whether the effect of being female is changing over time.

Doctorate holders with a management job have the highest salaries, followed by those in health occupations; on the other hand, those with the lowest salaries are the ones employed in "other� occupations, followed by those in political science.

Among employment sectors, salaries are highest in for-profit industry (20% higher than for the reference category of tenured faculty in public 4-year institutions), followed in order by the federal government, self-employment, non-profit industry, all of which are higher than the reference category. The lowest salaries are those in two-year colleges and in two- and four-year institutions for which tenure is not applicable.

The highest single negative effect on salaries also occurs within the education sector. Those with positions as adjunct faculty members have salaries that are approximately 59% of the salaries of comparable doctorate holders. Not surprisingly, postdoctoral salaries are only about 74% of the salaries of comparable people in other types of positions.

Sector is also a contributing factor to the hard-to-interpret dependence of salary on the starting month for the current position: salaries are lower for starting months of August and September. Additional analyses show that the monthly effect is present only in the education sector, where, as we have seen, salaries are lower than in industry or government, and in which starting months of August and September are common. Therefore, sector is part of the answer, but not the entire answer. Finer-grained divisions of the education sector, using Carnegie classifications, further reduce, but do not remove, the significance of monthly effects. The SDR does not seem to contain sufficient data to remove the monthly effects entirely, so we have retained the SDR definition of sector.

People with degrees in computing and information sciences have the highest salaries (around 20% higher than in the biological sciences), followed by those in electrical and computer engineering and in economics (approximately 16% higher). Doctorate holders in agricultural and food sciences, environmental life sciences, earth, atmospheric, and ocean sciences, and in "other� social sciences have the lowest salaries. The "other� social sciences are the social sciences excluding economics and political science.

Married people have the highest salaries, followed by those who are in married-like relationships, widowed, separated, divorced, and never married. The latter have salaries only around 89% as high as the married ones; one could argue that there is some association between never married and age. The presence of children older than two is associated with higher salaries, but the presence of children younger than two is not.

Doctorate holders with jobs only somewhat related to their degree field make around 93% of what people with closely related jobs (the reference category) do. If the job is not related to the doctoral degree as the result of a change in career or professional interests, they make around 82% of what people with closely related jobs do. On the other hand, those with jobs not related for other reasons make only about 76% of what the reference category does.

There is an increase of around 3% for every additional year since doctorate graduation, although there is a diminishing effect for higher number of years. We interpret this as the effect of experience. There is a small penalty for receiving the doctorate later in life; for every additional year of age at graduation, the salary reduces by 1%.

We also found that the regional Consumer Price Index (CPI) is significant. The higher the CPI, the higher the salary. We could not use the CPI associated with the labor market of employment because the SDR data do not identify geography beyond the state. We included the state in the model as a proxy for cost of living; the state effect is highly significant and some state coefficients are among the highest overall. The highest salaries are in California, Washington D.C. and its suburbs, and New York City and its suburbs. On the other hand, the lowest salaries are in Puerto Rico, Vermont, Montana, Maine, Idaho, South Dakota, North Dakota, and in the Territories/Abroad.

Having a part-time job due to being retired or semi-retired is significant and in several significant interactions. Because of this, we do not think that the available data present the full picture about retirement, for example, for people who are (semi-)retired and yet have full-time jobs.

Finally, we analyzed residuals; Figures 4.1 and 4.2 show a Box and Whisker plot of standardized residuals by year and a spaghetti plot of standardized residuals, respectively.

Figure 4.1 shows that the model fits reasonably well for all the reference years as most of the standardized residuals lie between -2 and 2. Also, the distributions of residuals do not seem to greatly differ from year to year.

Figure 4.1 Box and Whisker plot of standardized residuals by year

From Figure 4.2 we also conclude that the model fits reasonably well for most people, as most of the lines fluctuate between -2 and 2. Nonetheless, there are a few people for which the model seems to greatly over-predict in 2003 and some few people for whom that happens in 2006. We included several terms in the model to correct this issue but clearly none seemed to do so completely.

Figure 4.2 Spaghetti plot of standardized residuals

The last thing we tried was to produce exploratory classification trees for these residual blips. We found that, in the dataset available, the only thing related to them was the survey mode. The blips in 2003 are disproportionately high for web responses, and the blips in 2006 are disproportionately high for CATI responses. We conclude that either there is a mode effect in these two years or those respondents have something different, in those years, that is not included in the available variables.

Finally, the plot of fitted values versus observed (which can be found in Carrillo and Karr 2011) also shows a similar story. For most observations the model performs well, apart from those few cases in 2003 and 2006 for whom there is large over-estimation.

Previous | Next

Date modified:: 2017-09-20

Language selection

Search and menus

Search

Publications

Survey Methodology

Browse by

4 Application to the SDR