# Load necessary libraries
library(auxvecLASSO)
library(survey)
#> Warning: package 'survey' was built under R version 4.5.1
#> Loading required package: grid
#> Loading required package: Matrix
#> Loading required package: survival
#>
#> Attaching package: 'survey'
#> The following object is masked from 'package:graphics':
#>
#> dotchart
library(sampling)
#> Warning: package 'sampling' was built under R version 4.5.1
#>
#> Attaching package: 'sampling'
#> The following objects are masked from 'package:survival':
#>
#> cluster, strata
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.5.1
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# Set seed for reproducibility
set.seed(123)
Auxiliary variables can greatly improve performance when using models in survey data analyses, for example in contexts like survey calibration, imputation or prediction. Traditional methods for variable selection, such as the commonly used stepwise forward selection method, have several drawbacks: they are greedy/myopic, sensitive to small changes in data, and may often lead to selecting irrelevant variables due to inflated Type I Errors.
In contrast, the LASSO (Least Absolute Shrinkage and Selection Operator) offers several advantages: simultaneous selection and shrinkage, stability even in high-dimensional settings, computational efficiency, and discouraging overfitting, leading to models that tend to generalize better to new data. In short, viewing auxiliary variable selection as a prediction/statistical learning problem can offer many advantages (see also, for example, Caughey & Hartman (2017) doi:10.2139/ssrn.3494436 and Mainzer et al. (2024) PMC7615727.
The auxvecLASSO
package provides tools for selecting
auxiliary variables using LASSO and conducting diagnostics related to
survey calibration. The main function
selection_auxiliary_variables_lasso_cv()
allows users to
perform LASSO-penalized regression (logistic regression for binary
outcomes or linear regression for continuous outcomes) with
cross-validation to select auxiliary variables for modeling one or more
outcome variables. It allows for the inclusion of all two-way
interactions among the auxiliary variables and the option to enforce a
hierarchical structure by keeping certain variables in the model through
the use of zero penalty factors.
A relative strength of the package is the accompanying function
assess_aux_vector()
, which can be used to evaluate the
performance of candidate auxiliary vectors.
In this vignette, we demonstrate how to apply the key functions in
the auxvecLASSO
package using the api
dataset
from the survey
package.
In this example, we will:
Select auxiliary variables using LASSO
Evaluate the chosen auxiliary vector
We will work with the api
data from the
survey
package. It contains survey data about the API
scores of a population of schools and additional auxiliary variables.
Based on the apipop
dataset, we will draw a stratified
sample which will serve as the point of departure for our modeling.
We begin by using the apipop
dataset and add some binary
variables to it to make calculations as easy as possible. Auxiliary
variables can also contain more than two categories, and even be
continuous/numerical. See the survey
package documentation
for more information on allowed formats for auxiliary variables.
# Load the population data
data(api)
# Load the population data file and add binary variables
api_pop <- apipop
api_pop$api00_bin <- as.factor(ifelse(api_pop$api00 > 650, 1, 0))
api_pop$growth_bin <- as.factor(ifelse(api_pop$growth > 25, 1, 0))
api_pop$meals_bin <- as.factor(ifelse(api_pop$meals > 40, 1, 0))
api_pop$ell_bin <- as.factor(ifelse(api_pop$ell > 15, 1, 0))
api_pop$hsg_bin <- as.factor(ifelse(api_pop$hsg > 20, 1, 0))
api_pop$full_bin <- as.factor(ifelse(api_pop$full > 90, 1, 0))
api_pop$sch.wide_bin <- as.factor(ifelse(api_pop$sch.wide == "Yes", 1, 0))
api_pop$awards_bin <- as.factor(ifelse(api_pop$awards == "Yes", 1, 0))
api_pop$comp.imp_bin <- as.factor(ifelse(api_pop$comp.imp == "Yes", 1, 0))
api_pop$stype <- as.factor(api_pop$stype)
# Keep only relevant variables
api_pop <- api_pop |>
dplyr::select("cds", "stype", "enroll", ends_with("_bin"))
Next, we draw a stratified sample based on stype to create a sample file.
strata_sample <- sampling::strata(
api_pop, # The population dataset
stratanames = c("stype"), # Stratification variable
size = c(150, 200, 175), # Desired sample sizes per stratum
method = "srswor" # Sampling without replacement
)
api_sample <- getdata(api_pop, strata_sample)
api_sample$SamplingWeight <- 1 / api_sample$Prob
We will also add a response indicator to the sample file.
# Artificial logistic regression coefficients (for both main effects and interactions)
coefficients <- c(
"api00_bin" = 0.5,
"growth_bin" = -0.3,
"ell_bin" = -0.6,
"interaction1" = 0.9, # Interaction term api00_bin:growth_bin
"interaction2" = 1.2 # Interaction term api00_bin:ell_bin
)
# Create interaction terms for logistic model
api_sample$interaction1 <- as.numeric(api_sample$api00_bin) * as.numeric(api_sample$growth_bin)
api_sample$interaction2 <- as.numeric(api_sample$api00_bin) * as.numeric(api_sample$ell_bin)
# Calculate the logit (log-odds) based on the artificial coefficients and interaction terms
logit <- -1.2 +
coefficients["api00_bin"] * as.numeric(api_sample$api00_bin) +
coefficients["growth_bin"] * as.numeric(api_sample$growth_bin) +
coefficients["ell_bin"] * as.numeric(api_sample$ell_bin) +
coefficients["interaction1"] * api_sample$interaction1 +
coefficients["interaction2"] * api_sample$interaction2
# Compute the logistic probabilities
api_sample$response_prob <- 1 / (1 + exp(-logit)) # Logistic function
# Simulate the response variable (1 for response, 0 for non-response)
api_sample$response <- as.factor(rbinom(n = nrow(api_sample), size = 1, prob = api_sample$response_prob))
# --- Check the summary of the sample ---
summary(api_sample)
#> cds enroll api00_bin growth_bin meals_bin ell_bin
#> Length:525 Min. : 118.0 0:271 0:290 0:265 0:263
#> Class :character 1st Qu.: 408.2 1:254 1:235 1:260 1:262
#> Mode :character Median : 700.5
#> Mean : 855.4
#> 3rd Qu.:1202.5
#> Max. :3282.0
#> NA's :7
#> hsg_bin full_bin sch.wide_bin awards_bin comp.imp_bin stype ID_unit
#> 0:225 0:278 0:140 0:238 0:212 E:175 Min. : 35
#> 1:300 1:247 1:385 1:287 1:313 H:150 1st Qu.:1312
#> M:200 Median :2830
#> Mean :2971
#> 3rd Qu.:4521
#> Max. :6183
#>
#> Prob Stratum SamplingWeight interaction1
#> Min. :0.03958 Min. :1.000 Min. : 5.033 Min. :1.000
#> 1st Qu.:0.03958 1st Qu.:1.000 1st Qu.: 5.033 1st Qu.:1.000
#> Median :0.19646 Median :2.000 Median : 5.090 Median :2.000
#> Mean :0.14480 Mean :2.048 Mean :11.798 Mean :2.137
#> 3rd Qu.:0.19868 3rd Qu.:3.000 3rd Qu.:25.263 3rd Qu.:2.000
#> Max. :0.19868 Max. :3.000 Max. :25.263 Max. :4.000
#>
#> interaction2 response_prob response
#> Min. :1.000 Min. :0.6225 0: 60
#> 1st Qu.:2.000 1st Qu.:0.7503 1:465
#> Median :2.000 Median :0.8455
#> Mean :2.086 Mean :0.8684
#> 3rd Qu.:2.000 3rd Qu.:0.9569
#> Max. :4.000 Max. :0.9983
#>
Variables in our analyses will have different purposes:
Outcome variables [the response indicator and central survey variables] (the response indicator together with survey variables used to evaluate point estimates and standard errors where unknown population totals make it hard to evaluate bias/MSE and to use these as auxiliary variables)
Candidate auxiliary variables (variables that might be selected to the auxiliary vector)
When evaluating auxiliary vector performance, we will also need:
Register variables (variables with known population totals - these are assumed to be good proxies of central survey variables)
Domain variables (variables that delineate subsets of the population for which we want to compute/evaluate estimates)
In this example, we will use the following variables:
The response indicator response and enroll are outcome variables, where enroll is assumed to be a survey variable.
The variables api00_bin, growth_bin, meals_bin, meals_bin, ell_bin, hsg_bin, avg.ed_bin, full_bin, comp.imp_bin and stype are candidate auxiliary variables.
The variable sch.wide_bin is assumed to be a register variable.
The stratification variable stype and awards_bin are domain variables in this example.
The select_auxiliary_variables_lasso_cv()
function can
be used to perform LASSO-based variable selection for binary and
continuous outcomes. The LASSO variable selection is made through one
call of the function select_auxiliary_variables_lasso_cv()
.
In this example, we will demand that stype be included
in the auxiliary vector since it is the stratification variable.
# Apply LASSO for selecting auxiliary variables
lasso_result <- select_auxiliary_variables_lasso_cv(
df = api_sample,
outcome_vars = c("response", "enroll"),
auxiliary_vars = c(
"api00_bin", "growth_bin", "comp.imp_bin", "meals_bin",
"meals_bin", "ell_bin", "hsg_bin", "full_bin", "stype"
),
must_have_vars = "stype", # Include the domain variable (stype) in the model
check_twoway_int = TRUE, # Include two-way interactions
nfolds = 5, # Use 5-fold cross-validation
verbose = FALSE, # Don't print progress messages
standardize = TRUE, # Standardize the predictors
return_models = FALSE, # Don't return the models, only the selection results
parallel = FALSE # Run without parallel processing
)
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
# Display the results
lasso_result
#>
#> =========================
#> LASSO Auxiliary Variable Selection
#> =========================
#>
#> Selected Variables (21):
#> api00_bin0, growth_bin1, stypeH, stypeM, hsg_bin1:full_bin1, hsg_bin1, full_bin1, ell_bin1, api00_bin1:growth_bin1, api00_bin1:comp.imp_bin1, api00_bin1:hsg_bin1, api00_bin1:stypeH, comp.imp_bin1:stypeH, meals_bin1:ell_bin1, ell_bin1:hsg_bin1, hsg_bin1:stypeH, full_bin1:stypeH, full_bin1:stypeM, api00_bin1, comp.imp_bin1, meals_bin1
#>
#> By Outcome:
#> - response : 7 variables
#> api00_bin0, growth_bin1, stypeH, stypeM, hsg_bin1:full_bin1, hsg_bin1, full_bin1
#> - enroll : 20 variables
#> ell_bin1, stypeH, stypeM, api00_bin1:growth_bin1, api00_bin1:comp.imp_bin1, api00_bin1:hsg_bin1, api00_bin1:stypeH, comp.imp_bin1:stypeH, meals_bin1:ell_bin1, ell_bin1:hsg_bin1, hsg_bin1:full_bin1, hsg_bin1:stypeH, full_bin1:stypeH, full_bin1:stypeM, api00_bin1, growth_bin1, comp.imp_bin1, hsg_bin1, meals_bin1, full_bin1
#>
#> Selected Lambdas:
#> - response : 0.02082
#> - enroll : 12.88
#>
#> Penalty Factors:
#> Zero-penalty (must-keep): 2
#> Regular-penalty: 43
#>
#> Stored Models: 0
#>
#> Goodness-of-Fit:
#> Outcome: response
#> Cross-validated:
#> cv_error: 0.6052
#> cv_error_sd: 0.0135
#> Full data:
#> deviance_explained: 0.1303
#> auc: 0.8122
#> accuracy: 0.8857
#> brier_score: 0.0892
#> Coefficients at Lambda Min:
#> api00_bin0 -1.8730
#> stypeH -0.4799
#> stypeM 0.4307
#> growth_bin1 0.2189
#> hsg_bin1:full_bin1 -0.0040
#> api00_bin1 0.0000
#> comp.imp_bin1 0.0000
#> meals_bin1 0.0000
#> ell_bin1 0.0000
#> hsg_bin1 0.0000
#> full_bin1 0.0000
#> api00_bin1:growth_bin1 0.0000
#> api00_bin1:comp.imp_bin1 0.0000
#> api00_bin1:meals_bin1 0.0000
#> api00_bin1:ell_bin1 0.0000
#> api00_bin1:hsg_bin1 0.0000
#> api00_bin1:full_bin1 0.0000
#> api00_bin1:stypeH 0.0000
#> api00_bin1:stypeM 0.0000
#> growth_bin1:comp.imp_bin1 0.0000
#> growth_bin1:meals_bin1 0.0000
#> growth_bin1:ell_bin1 0.0000
#> growth_bin1:hsg_bin1 0.0000
#> growth_bin1:full_bin1 0.0000
#> growth_bin1:stypeH 0.0000
#> growth_bin1:stypeM 0.0000
#> comp.imp_bin1:meals_bin1 0.0000
#> comp.imp_bin1:ell_bin1 0.0000
#> comp.imp_bin1:hsg_bin1 0.0000
#> comp.imp_bin1:full_bin1 0.0000
#> comp.imp_bin1:stypeH 0.0000
#> comp.imp_bin1:stypeM 0.0000
#> meals_bin1:ell_bin1 0.0000
#> meals_bin1:hsg_bin1 0.0000
#> meals_bin1:full_bin1 0.0000
#> meals_bin1:stypeH 0.0000
#> meals_bin1:stypeM 0.0000
#> ell_bin1:hsg_bin1 0.0000
#> ell_bin1:full_bin1 0.0000
#> ell_bin1:stypeH 0.0000
#> ell_bin1:stypeM 0.0000
#> hsg_bin1:stypeH 0.0000
#> hsg_bin1:stypeM 0.0000
#> full_bin1:stypeH 0.0000
#> full_bin1:stypeM 0.0000
#>
#> Outcome: enroll
#> Cross-validated:
#> cv_error: 188129.5
#> cv_error_sd: 13594.59
#> Full data:
#> r_squared: 0.4676
#> mse: 171872.7
#> rmse: 414.5753
#> mae: 305.8876
#> Coefficients at Lambda Min:
#> stypeH 1181.2387
#> stypeM 480.4663
#> comp.imp_bin1:stypeH -224.4012
#> api00_bin1:stypeH -218.1802
#> hsg_bin1:stypeH -216.1054
#> hsg_bin1:full_bin1 -164.2081
#> full_bin1:stypeH -91.9793
#> ell_bin1 60.7008
#> api00_bin1:growth_bin1 -35.2461
#> api00_bin1:hsg_bin1 -22.4905
#> full_bin1:stypeM -21.6063
#> ell_bin1:hsg_bin1 16.5929
#> meals_bin1:ell_bin1 11.6447
#> api00_bin1:comp.imp_bin1 -2.9260
#> api00_bin0 0.0000
#> api00_bin1 0.0000
#> growth_bin1 0.0000
#> comp.imp_bin1 0.0000
#> meals_bin1 0.0000
#> hsg_bin1 0.0000
#> full_bin1 0.0000
#> api00_bin1:meals_bin1 0.0000
#> api00_bin1:ell_bin1 0.0000
#> api00_bin1:full_bin1 0.0000
#> api00_bin1:stypeM 0.0000
#> growth_bin1:comp.imp_bin1 0.0000
#> growth_bin1:meals_bin1 0.0000
#> growth_bin1:ell_bin1 0.0000
#> growth_bin1:hsg_bin1 0.0000
#> growth_bin1:full_bin1 0.0000
#> growth_bin1:stypeH 0.0000
#> growth_bin1:stypeM 0.0000
#> comp.imp_bin1:meals_bin1 0.0000
#> comp.imp_bin1:ell_bin1 0.0000
#> comp.imp_bin1:hsg_bin1 0.0000
#> comp.imp_bin1:full_bin1 0.0000
#> comp.imp_bin1:stypeM 0.0000
#> meals_bin1:hsg_bin1 0.0000
#> meals_bin1:full_bin1 0.0000
#> meals_bin1:stypeH 0.0000
#> meals_bin1:stypeM 0.0000
#> ell_bin1:full_bin1 0.0000
#> ell_bin1:stypeH 0.0000
#> ell_bin1:stypeM 0.0000
#> hsg_bin1:stypeM 0.0000
#>
#> Interaction Metadata:
#> Interaction terms selected (11):
#> hsg_bin1:full_bin1, api00_bin1:growth_bin1, api00_bin1:comp.imp_bin1, api00_bin1:hsg_bin1, api00_bin1:stypeH, comp.imp_bin1:stypeH, meals_bin1:ell_bin1, ell_bin1:hsg_bin1, hsg_bin1:stypeH, full_bin1:stypeH, full_bin1:stypeM
#> Main effects present in selected interactions: hsg_bin1, full_bin1, api00_bin1, growth_bin1, comp.imp_bin1, stypeH, meals_bin1, ell_bin1, stypeM
#> Full formula tested: ~ (api00_bin + growth_bin + comp.imp_bin + meals_bin + meals_bin + ell_bin + hsg_bin + full_bin + stype)^2 -1
The output from the
select_auxiliary_variables_lasso_cv()
function includes
several key components:
This is the list of all variables (main effects and interaction terms) that were selected by the LASSO model. Variables like api00_bin0, growth_bin1, stypeH, stypeM, etc., represent the individual effects of the auxiliary variables (binary or continuous) used in the model. api00_bin1:comp.imp_bin1, growth_bin1:stypeH, etc., represent interactions between two variables. These are considered important for predicting the outcome and have non-zero coefficients.
response and enroll: These are the two outcomes modeled separately. This breakdown helps you see which variables are more relevant for each specific outcome.
Lambda is a regularization parameter used in LASSO to control the strength of the penalty.
Two variables have been forced into the model (must-keep) with a penalty factor of zero. These are variables that must stay in the model regardless of the regularization. Please note that “variables” in this case refers to the columns of the model matrix. In our example, stype was stated as a must-have variable. This variable is a factor with three levels, where the first level has been removed in the modeling phase to avoid the dummy variable trap.
The remaining 43 “variables” are subject to regularization and can potentially be shrunk to zero depending on the strength of the lambda penalty.
This section describes the model fit at the chosen
lambda.min
(the lambda that minimizes cross-validation
error). Note that the set of goodness-of-fit measures are different
between binary and continuous outcomes.
For response:
Cross-validated error: 0.6052 with standard deviation 0.0135. This indicates the average prediction error across folds during cross-validation.
Deviance explained: 0.1303. The deviance explained is a measure of the goodness-of-fit, indicating that the model explains some of the variance in the response variable.
AUC (Area Under the Curve): 0.8122. AUC is a measure of the model’s ability to distinguish between the two classes (response = 0 or 1). The obtained value suggests that the model has good discriminatory power.
Accuracy: 0.8857. The model correctly classifies 96.19% of the observations, which is quite high, although it should be noted that the response indicator is imbalanced in our toy example.
Brier Score: 0.0892. This is a measure of the accuracy of probabilistic predictions. A lower Brier score indicates better calibration of predicted probabilities.
For enroll:
Cross-validated error: 188,129.5 with standard deviation 13,594.59. This is the model’s average prediction error across folds for the enroll outcome.
R-squared: 0.4676. This means the model explains 46.76% of the variance in the enroll outcome. R-squared is a measure of how well the independent variables explain the variation in the dependent variable.
MSE (Mean Squared Error): 171,872.7. A lower MSE indicates a better fit.
RMSE (Root Mean Squared Error): 414.5753. A higher RMSE indicates worse predictive accuracy, as it represents the standard deviation of the prediction errors.
MAE (Mean Absolute Error): 305.8876. This is the average magnitude of the prediction errors. The lower the MAE, the better.
This section displays the values of the coefficients (betas) at the
selected lambda.min
. These coefficients represent the
effect of each variable in the model at the chosen regularization
strength. For response, variables like
api00_bin0 and growth_bin1 have
non-zero coefficients, suggesting they have a significant relationship
with response. For enroll, several
variables have large coefficients, including stypeH,
stypeM, and comp.imp_bin1:stypeH,
which seem to significantly influence the prediction of
enroll.
Variables with a zero coefficient have been regularized out of the
model, meaning they didn’t contribute to the prediction at the chosen
lambda.min
.
This section gives an overview of the results concerning interaction terms and the full formula tested.
Summary
This output demonstrates how LASSO performs both feature selection and regularization to reduce the model complexity and improve generalization. The model for response performs quite well, whereas the enroll model has room for improvement in terms of predictive power.
Next, we will assess the calibration of auxiliary variables using the
assess_aux_vector()
function. This function provides a
comprehensive assessment of auxiliary vector calibration performance for
a given survey design. The function can also calibrate weights based on
a specified calibration formula and population totals. The function
returns a detailed list of diagnostics, which can help to evaluate and
improve survey weights.
Based on the results from the LASSO modeling, together with a domain-expertise judgment of the statistical relationships, data sufficiency, and what usually works in survey analysis and calibration, the following auxiliary vector is assessed:
Main effects:
stype (all levels), growth_bin, api100_bin, ell_bin,
comp.imp_bin, hsg_bin
Interaction effects:
comp.imp_bin x stype, api00_bin x stype,
hsg_bin x stype, api00_bin x
growth_bin
To make the example as realistic as possible, the response set from the sample file is analyzed (since, in practice, we don’t have survey variable information for non-respondents).
We will also create a survey design object to pass to the assessment function.
design <- survey::svydesign(
ids = ~1,
data = api_resp,
strata = ~stype,
weights = ~SamplingWeight
)
Since we are considering register variable diagnostics, register variable means need to be calculated.
register_pop_means <- list(
total = list(sch.wide_bin = mean(api_pop$sch.wide_bin == "1")),
by_domain = list(
stype = tapply(
api_pop$sch.wide_bin, api_pop$stype,
function(x) mean(x == "1")
),
awards_bin = tapply(
api_pop$sch.wide_bin, api_pop$awards_bin,
function(x) mean(x == "1")
)
)
)
We also need to calculate population totals for the weight
calibration to be possible. This can be done using the package function
generate_population_totals
which requires a calibration
formula as input.
## --- Define the calibration formula ---
calibration_formula <- ~ stype + growth_bin + api00_bin + ell_bin + comp.imp_bin + hsg_bin + api00_bin:stype + hsg_bin:stype + comp.imp_bin:stype + api00_bin:growth_bin
## --- Calculate population totals with a single terms object built on the POPULATION ---
pop_totals <- generate_population_totals(
population_df = api_pop,
calibration_formula
)
We use the objects created in the preparations phase and pass them as arguments to the assessment function.
Please note that the assessment function can be used to compare different auxiliary vectors by passing different calibration formulae (for example, on with and one without interaction terms) and matching population totals to it.
result_diagnostics <- assess_aux_vector(
design = design,
df = api_resp,
already_calibrated = FALSE,
calibration_formula = calibration_formula,
calibration_pop_totals = pop_totals$population_totals,
register_vars = c("sch.wide_bin"),
register_pop_means = register_pop_means,
survey_vars = c("enroll", "full_bin"),
domain_vars = c("stype", "awards_bin"),
diagnostics = c(
"weight_variation",
"register_diagnostics",
"survey_diagnostics"
),
verbose = FALSE
)
## --- Display output ---
result_diagnostics
#>
#> ===================================
#> Auxiliary Vector Assessment Summary
#> ===================================
#>
#> Weight Variation Metrics:
#> min 3.176759
#> max 38.641693
#> median 6.444397
#> mean 13.320430
#> sd 10.894298
#> range 35.464934
#> coefficient_of_variation 0.817864
#> gini_index 0.416277
#> entropy 5.835159
#> skewness 0.806881
#> kurtosis -1.105973
#> bottom_1pct 3.365732
#> top_1pct 38.641693
#>
#> Register Diagnostics (Total):
#> - sch.wide_bin :
#> Mean: 0.824194
#> SE: 0.014359
#> RSE: 0.017421
#> Bias: -0.002735
#> MSE: 0.000214
#> p(Bias): 0.848931
#>
#> Register Diagnostics (By Domain):
#> * by stype :
#> - sch.wide_bin :
#> Domain: stype=E
#> Mean: 0.893511
#> SE: 0.018643
#> RSE: 0.020865
#> Bias: 0.000274
#> MSE: 0.000348
#> p(Bias): 0.988271
#> Domain: stype=H
#> Mean: 0.539416
#> SE: 0.031145
#> RSE: 0.057739
#> Bias: -0.018200
#> MSE: 0.001301
#> p(Bias): 0.558984
#> Domain: stype=M
#> Mean: 0.734369
#> SE: 0.025401
#> RSE: 0.034589
#> Bias: -0.004334
#> MSE: 0.000664
#> p(Bias): 0.864525
#> * by awards_bin :
#> - sch.wide_bin :
#> Domain: awards_bin=0
#> Mean: 0.460754
#> SE: 0.046257
#> RSE: 0.100394
#> Bias: -0.010385
#> MSE: 0.002248
#> p(Bias): 0.822358
#> Domain: awards_bin=1
#> Mean: 1.000000
#> SE: 0.000000
#> RSE: 0.000000
#> Bias: 0.000000
#> MSE: 0.000000
#> p(Bias): 1.000000
#>
#> Survey Diagnostics (Total):
#> - enroll :
#> Mean: 603.552505
#> SE: 12.749225
#> RSE: 0.021124
#> Bias: NA
#> MSE: NA
#> p(Bias): NA
#> - full_bin :
#> Mean: 0.515931
#> SE: 0.028311
#> RSE: 0.054874
#> Bias: NA
#> MSE: NA
#> p(Bias): NA
#>
#> Survey Diagnostics (By Domain):
#> * by stype :
#> - enroll :
#> Domain: stype=E
#> Mean: 402.578483
#> SE: 12.861493
#> RSE: 0.031948
#> Bias: NA
#> MSE: NA
#> p(Bias): NA
#> Domain: stype=H
#> Mean: 1345.400458
#> SE: 57.483026
#> RSE: 0.042726
#> Bias: NA
#> MSE: NA
#> p(Bias): NA
#> Domain: stype=M
#> Mean: 908.561555
#> SE: 30.641798
#> RSE: 0.033726
#> Bias: NA
#> MSE: NA
#> p(Bias): NA
#> - full_bin :
#> Domain: stype=E
#> Mean: 0.551476
#> SE: 0.038187
#> RSE: 0.069246
#> Bias: NA
#> MSE: NA
#> p(Bias): NA
#> Domain: stype=H
#> Mean: 0.434076
#> SE: 0.043144
#> RSE: 0.099392
#> Bias: NA
#> MSE: NA
#> p(Bias): NA
#> Domain: stype=M
#> Mean: 0.422270
#> SE: 0.034253
#> RSE: 0.081116
#> Bias: NA
#> MSE: NA
#> p(Bias): NA
#> * by awards_bin :
#> - enroll :
#> Domain: awards_bin=0
#> Mean: 785.267252
#> SE: 32.154192
#> RSE: 0.040947
#> Bias: NA
#> MSE: NA
#> p(Bias): NA
#> Domain: awards_bin=1
#> Mean: 520.801388
#> SE: 14.316490
#> RSE: 0.027489
#> Bias: NA
#> MSE: NA
#> p(Bias): NA
#> - full_bin :
#> Domain: awards_bin=0
#> Mean: 0.435763
#> SE: 0.045506
#> RSE: 0.104428
#> Bias: NA
#> MSE: NA
#> p(Bias): NA
#> Domain: awards_bin=1
#> Mean: 0.554710
#> SE: 0.035744
#> RSE: 0.064437
#> Bias: NA
#> MSE: NA
#> p(Bias): NA
The function assess_aux_vector()
produces a structured
summary of diagnostics that help you understand how your calibration or
auxiliary vector adjustment performed. The output is organized into
three main sections:
This block describes how the analysis weights are distributed. Stable, well-behaved weights improve the reliability of estimates.
min / max / median / mean / sd / range: Basic
distribution of weights.
Here: weights range from 3.18 to
38.64 with mean 13.32 and sd
10.89 = relatively wide spread.
coefficient_of_variation (CV) = sd / mean. Rules
of thumb: CV ≲ 0.5 is comfortable; CV > 1 suggests heavy
dispersion.
Here: 0.82 = notable dispersion but not
extreme.
gini_index (0–1): inequality of weights (0 =
equal).
Here: 0.416 = moderate inequality.
entropy: higher = more uniform weights (units
depend on log base).
Here: 5.84.
skewness / kurtosis: shape diagnostics (skewness
> 0 means long right tail; kurtosis < 0 is flatter than
normal).
Here: skewness 0.81 (some high-weight tail),
kurtosis −1.11 (flatter).
bottom_1pct / top_1pct: extreme percentiles for
quick outlier check.
Here: 1st pct 3.37, 99th pct
38.64.
For each register variable supplied with population means, the function reports design-based estimates, uncertainty, and bias relative to the benchmark.
Reported statistics per variable (and per domain, if requested):
Mean: weighted sample estimate.
SE: standard error of the estimate.
RSE = SE / Mean (relative SE).
Bias = (Weighted mean − Population mean).
MSE: mean squared error ([Bias]^2 + Var).
p(Bias): two-sided test that Bias = 0 (large values imply no detectable bias).
Based on the output, the calibrated estimates match the register means closely (no evidence of significant biases). For example at the total level, the bias of the estimate of the mean sch.wide_bin is −0.0027 with a p-value of p=0.849, which indicates that the chosen auxiliary vector consists of valuable auxiliary variables. Similar results are shown at the domain-level. As a test, the domain awards_bin=1 was included, in which all rows have sch.wide_bin=1 - the output mean estimate of exactly 1.000 is consistent with the given data structure.
The assessment function provides design-based estimates and precision for the survey variable estimates. Bias/MSE/p-values are NA because there is no benchmark to compare against.
Overall mean enrollment was 603.6 students (RSE = 2.1%), and the proportion of full schools was 0.516 (RSE = 5.5%). Domain-level RSEs generally remained below 10%, supporting reliable comparisons across stype and awards_bin.
The auxvecLASSO
package is designed to help survey
practitioners treat auxiliary variable selection as a statistical
learning problem. In this vignette, we walked through two key components
of the package:
Variable selection with LASSO
(select_auxiliary_variables_lasso_cv()
), which identifies
promising auxiliary variables (and interactions) for modeling survey
outcomes.
Calibration diagnostics
(assess_aux_vector()
), which evaluates how a chosen
auxiliary vector performs in practice, using metrics on weight
distribution, register variable alignment, and survey estimate
precision.
Together, these tools provide a workflow for:
generating candidate auxiliary vectors, and
checking whether the resulting weights behave sensibly and lead to accurate, stable estimates.
While this vignette illustrated the functions on a simple dataset, the same approach extends to larger and more complex surveys. Readers are encouraged to adapt the demonstrated steps to their own data, paying special attention to the diagnostics: stable weights, low bias on register variables, and acceptable precision on survey estimates are all signs that the chosen auxiliary vector is working well.
For further details, advanced options, and examples, consult the package documentation.