Administrative Records for Survey Methodology 3 |
|
Chair | Dr Asaph Young Chun (US Census Bureau ) |
Coordinator 1 | Professor Mike Larsen (George Washington University) |
Coordinator 2 | Dr Ingegerd Jansson (Statistics Sweden) |
Coordinator 3 | Dr Manfred Antoni ( Institute for Employment Research) |
Coordinator 4 | Dr Daniel Fuss (Leibniz Institute for Educational Trajectories) |
Coordinator 5 | Dr Corinna Kleinert (Leibniz Institute for Educational Trajectories) |
As the use of administrative data in education surveys increases, we need to think through and deal with the associated issues. A challenging situation arises when education survey data come from students, and administrative data are used to supplement sample survey data. A useable case rule must be developed to identify key information items or combinations of these items needed to qualify as a unit respondent. A unit respondent in a survey is typically an interview respondent, but we will define a unit respondent based on data, regardless of the source. The useable case rule can sometimes be satisfied with data from only a subset of the sources and can exclude interview data. This presentation uses data from the 2007-08 National Postsecondary Student Aid Study (NPSAS:08) to examine tradeoffs between unit respondent sample size and data record completeness, when defining unit respondents.
There are at least two approaches to defining a unit respondent when data from administrative sources exist in addition to survey responses. One approach is to define a unit respondent as a sample member completing the survey interview under some rule for determining what constitutes ‘complete’. This approach is most common, and unit response rates computed this way follow the American Association for Public Opinion Research (AAPOR) definition and the United States National Center for Education Statistics (NCES) Statistical Standards. Nonresponse weight adjustments are used to compensate for the nonrespondents and to reduce the potential for unit nonresponse bias. The data collected from other sources may then be used to fill in any missing data item values for these respondents. Another approach is to define a unit respondent as a sample member with sufficient data from any source to be judged complete. Filling in data using other sources, when available, and using logical and statistical imputations are all used to compensate for missing data and to reduce the potential for item nonresponse bias. When an interview respondent is used, there is the potential for a large amount of unit nonresponse bias due to interview nonresponse, if the response rate is low. When a useable case respondent is used, there is the potential for a large amount of item nonresponse bias due to missing items, especially when a subset of items is only available from one source, such as the interview.
In this presentation, we will compare these two approaches. Of particular interest are the approaches’ different use of weight adjustment and imputation to compensate for nonrespondents and reduce potential nonresponse bias. While weighting and imputation have been compared in the past, we will examine this comparison in the context of education data when administrative data are available, allowing more imputation and less weight adjustment. We will discuss the advantages and disadvantages of both approaches, as well as the potential concerns with the adoption and use of a unit respondent definition without requiring an interview.
The UK has several commercial database companies that synthesise multiple sources of data (surveys, administrative databases, plus other sources from the burgeoning ‘big data’ ecosystem) to impute address-level and household-level characteristics. But how accurate are they, and can they be used to improve the efficiency of survey sample designs?
In this paper, we describe a project in which a large, very high quality random sample survey dataset is used to verify commercial imputations with regard to household type, size, and age profile. After providing key descriptive statistics with regard to the imputations’ sensitivity and specificity, we then use this information to simulate optimal sample designs for hypothetical but realistic objectives such as (i) a sample of households with children aged 0-4; (ii) a sample of households renting in the private sector; and (iii) a sample of people aged 75+. These designs take into account real costs, likely response rates as a function of data collection method, and the statistical consequences of varying sampling fractions between strata. We finish with a set of general conclusions about the usefulness of commercial databases in the design of high quality UK sample surveys.
The Survey of Income and Program Participation (SIPP) was redesigned for the 2014 panel. With this redesign came the opportunity to use new modeling methods along with administrative records to improve imputations in the SIPP. As an initial step toward this transition, this methodology was applied to select, high-level branching variables that we have called ‘topic flags.’ Topic flags indicate whether a certain section of questions (e.g. about Social Security receipt) were relevant for a respondent. Topic flags summarize screener questions and monthly-level data to an annual indicator of employment, social insurance programs, means-tested programs, health insurance, and more. For missing data, topic flags are imputed using a parametric method called Sequential Regression Multivariate Imputation (SRMI). As opposed to hot-deck imputation that can only control for a limited number of characteristics, SRMI can control for many more variables. The variables used in the model-based imputation can also come from household, spouse, or parent characteristics. Moreover, our models include data from administrative records, which helps to mitigate the problem of survey data not “missing at random.” This paper describes our modeling process, its advantages over more traditional imputation methods like hot-deck imputation, and demonstrates the usefulness of linking administrative data into the models.