4 Application to the SDR
Iván A. Carrillo and Alan F. Karr
Previous | Next
The dataset we use is the
restricted SDR data, under a license agreement from NSF. The SDR collects
information about employment situation, principal employer, principal job, past
employment, recent education, demographics, and disability, among others that
vary from wave to wave. We use only information requested in all the waves of
interest: 1995, 1997, 1999, 2001, 2003, 2006, and 2008.
To illustrate our
methodology, we constructed a model for individuals' salaries over time. The
response is the log of salary (in the principal job), with an identity link
function, and several covariates; modeling log of salary (as opposed to salary)
is a standard practice. There are both time-independent covariates (such as
gender) and time-dependent ones (such as employment sector). We have four major
classes of covariates. The Degree variables are: degree field, years
since degree, and age at graduation. The Job variables are: job field or
category, sector, postdoc indicator, adjunct faculty indicator, hours worked
per week in the principal job, weeks per year in the principal job, how related
is the job to the doctoral degree, part-time for different reasons, number of
months since started in the principal job, the starting month in the principal
job, whether the employer/type of job has changed since previous wave, and
whether changed employer/type of job since previous wave because was laid off
or job terminated. The Person's demographics are: gender, citizenship
status, race/ethnicity, presence of children in family, marital status, and
spouse's working status. Finally, the "Environment� variables are: years since
1995, state (of employment), and the consumer price index (of the region of
employment). The full list of variables, interactions, and categories can be
found in Carrillo and Karr (2011). For categorical variables, the reference
category is the one with the largest count.
The dataset for our model
consists of 59,346 subjects and 190,693 observations, distributed as: 30,234, 30,652, 26,732, 26,778, 24,956, 25,910,
and 25,431.
Those data correspond to non-missing salaries between $5,000 and $999,995, for
people with consistent ages across the waves, and with non-missing value for
the variable indicating whether the (postsecondary educational institution)
employer was public or private. The average (cross-sectional) survey weight for
each of those waves are: 15.37, 16.28, 19.96, 20.74, 22.71, 22.93,
and 24.88.
The survey weights that
we use for each wave are the final adjusted weights. These weights are the
original design weights adjusted for nonresponse and post-stratification.
However, the theory that we developed in Section 3 assumes that the weights are
the inverse of the selection probabilities; in other words, the original design
weights. This is a mismatch whose effect we plan to investigate in the future.
On the other hand, the calculations in the last part of the Appendix (which do
not assume anything about the weights) suggest that the effect of this mismatch
is small.
The covariates and
interactions that we considered were selected because they were suggested
either by exploratory analyses or by the subject matter experts at the NSF.
Carrillo and Karr (2011) present the estimated coefficients in the model where includes the intercept along with the other
covariates. This corresponds to the one in model in Formula (3.1), and whose properties are discussed
in Section 3. The working covariance matrix is estimated to be with where and 208 is
the number of covariates in is the cross-sectional weight for subject at wave as long as and zero otherwise. The estimate contains the estimated auto-correlations for 1995,
1997, 1999, 2001, 2003, 2006, 2008, and for all These estimated values form the
auto-correlation matrix:
We now give some
conclusions about salaries in the Ph.D. workforce based on the estimated
coefficients, which appear in Carrillo and Karr (2011). First of all, a
sensible estimate of mean salary considers the intercept, the hours worked per
week (whose average is 47), and years since degree (average of 15); so that an
estimate of the overall average is for a subject with all other continuous
covariates equal to zero and in the reference of all categorical covariates.
All other things being
constant, women's salaries are about 93.4% those of men, whereas race does not
seem to have an effect on salaries. The gender years
since 1995 interaction is not significant; therefore this salary differential
is not changing over time. Notice that with a single year's data, we would not
be able to evaluate the effect of time. Even more important than that, using
only the data from a single wave, say 2008, we would not be able to assess
whether the effect of being female is changing over time.
Doctorate holders with a
management job have the highest salaries, followed by those in health
occupations; on the other hand, those with the lowest salaries are the ones
employed in "other� occupations, followed by those in political science.
Among employment sectors,
salaries are highest in for-profit industry (20% higher than for the reference
category of tenured faculty in public 4-year institutions), followed in order
by the federal government, self-employment, non-profit industry, all of which
are higher than the reference category. The lowest salaries are those in
two-year colleges and in two- and four-year institutions for which tenure is
not applicable.
The highest single
negative effect on salaries also occurs within the education sector. Those with
positions as adjunct faculty members have salaries that are approximately 59%
of the salaries of comparable doctorate holders. Not surprisingly, postdoctoral
salaries are only about 74% of the salaries of comparable people in other types
of positions.
Sector is also a
contributing factor to the hard-to-interpret dependence of salary on the
starting month for the current position: salaries are lower for starting months
of August and September. Additional analyses show that the monthly effect is
present only in the education sector, where, as we have seen, salaries are
lower than in industry or government, and in which starting months of August
and September are common. Therefore, sector is part of the answer, but not the
entire answer. Finer-grained divisions of the education sector, using Carnegie
classifications, further reduce, but do not remove, the significance of monthly
effects. The SDR does not seem to contain sufficient data to remove the monthly
effects entirely, so we have retained the SDR definition of sector.
People with degrees in
computing and information sciences have the highest salaries (around 20% higher
than in the biological sciences), followed by those in electrical and computer
engineering and in economics (approximately 16% higher). Doctorate holders in
agricultural and food sciences, environmental life sciences, earth,
atmospheric, and ocean sciences, and in "other� social sciences have the lowest
salaries. The "other� social sciences are the social sciences excluding
economics and political science.
Married people have the
highest salaries, followed by those who are in married-like relationships,
widowed, separated, divorced, and never married. The latter have salaries only
around 89% as high as the married ones; one could argue that there is some
association between never married and age. The presence of children older than
two is associated with higher salaries, but the presence of children younger
than two is not.
Doctorate holders with
jobs only somewhat related to their degree field make around 93% of what people
with closely related jobs (the reference category) do. If the job is not
related to the doctoral degree as the result of a change in career or
professional interests, they make around 82% of what people with closely
related jobs do. On the other hand, those with jobs not related for other
reasons make only about 76% of what the reference category does.
There is an increase of
around 3% for every additional year since doctorate graduation, although there
is a diminishing effect for higher number of years. We interpret this as the
effect of experience. There is a small penalty for receiving the doctorate
later in life; for every additional year of age at graduation, the salary
reduces by 1%.
We also found that the
regional Consumer Price Index (CPI) is significant. The higher the CPI, the
higher the salary. We could not use the CPI associated with the labor market of
employment because the SDR data do not identify geography beyond the state. We
included the state in the model as a proxy for cost of living; the state effect
is highly significant and some state coefficients are among the highest
overall. The highest salaries are in California, Washington D.C. and its
suburbs, and New York City and its suburbs. On the other hand, the lowest
salaries are in Puerto Rico, Vermont, Montana, Maine, Idaho, South Dakota,
North Dakota, and in the Territories/Abroad.
Having a part-time job
due to being retired or semi-retired is significant and in several significant
interactions. Because of this, we do not think that the available data present
the full picture about retirement, for example, for people who are
(semi-)retired and yet have full-time jobs.
Finally, we analyzed
residuals; Figures 4.1 and 4.2 show a Box and Whisker plot of standardized
residuals by year and a spaghetti plot of standardized residuals, respectively.
Figure 4.1 shows that the
model fits reasonably well for all the reference years as most of the
standardized residuals lie between -2 and 2. Also, the distributions of
residuals do not seem to greatly differ from year to year.

Figure 4.1 Box and Whisker plot of standardized
residuals by year
From Figure 4.2 we also
conclude that the model fits reasonably well for most people, as most of the
lines fluctuate between -2 and 2. Nonetheless, there are a few people for which
the model seems to greatly over-predict in 2003 and some few people for whom
that happens in 2006. We included several terms in the model to correct this
issue but clearly none seemed to do so completely.

Figure 4.2 Spaghetti plot of standardized residuals
The last thing we tried
was to produce exploratory classification trees for these residual blips. We
found that, in the dataset available, the only thing related to them was the
survey mode. The blips in 2003 are disproportionately high for web responses,
and the blips in 2006 are disproportionately high for CATI responses. We
conclude that either there is a mode effect in these two years or those
respondents have something different, in those years, that is not included in
the available variables.
Finally, the plot of
fitted values versus observed (which can be found in Carrillo and Karr 2011)
also shows a similar story. For most observations the model performs well,
apart from those few cases in 2003 and 2006 for whom there is large
over-estimation.
Previous | Next