---
title: "Introduction to the auxvecLASSO package"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{intro-to-auxvecLASSO}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
# Load necessary libraries
library(auxvecLASSO)
library(survey)
library(sampling)
library(dplyr)

# Set seed for reproducibility
set.seed(123)
```

# Introduction

Auxiliary variables can greatly improve performance when using models in
survey data analyses, for example in contexts like survey calibration,
imputation or prediction. Traditional methods for variable selection,
such as the commonly used stepwise forward selection method, have
several drawbacks: they are greedy/myopic, sensitive to small changes in
data, and may often lead to selecting irrelevant variables due to
inflated Type I Errors.

In contrast, the LASSO (Least Absolute Shrinkage and Selection Operator)
offers several advantages: simultaneous selection and shrinkage,
stability even in high-dimensional settings, computational efficiency,
and discouraging overfitting, leading to models that tend to generalize
better to new data. In short, viewing auxiliary variable selection as a
prediction/statistical learning problem can offer many advantages (see also, 
for example, Caughey & Hartman 
(2017) [doi:10.2139/ssrn.3494436](https://doi.org/10.2139/ssrn.3494436)
and Mainzer et al. 
(2024) [PMC7615727](https://doi.org/10.1002/bimj.202200291). 

The `auxvecLASSO` package provides tools for selecting auxiliary
variables using LASSO and conducting diagnostics related to survey
calibration. The main function
`selection_auxiliary_variables_lasso_cv()` allows users to perform
LASSO-penalized regression (logistic regression for binary outcomes or
linear regression for continuous outcomes) with cross-validation to
select auxiliary variables for modeling one or more outcome variables.
It allows for the inclusion of all two-way interactions among the
auxiliary variables and the option to enforce a hierarchical structure
by keeping certain variables in the model through the use of zero
penalty factors.

A relative strength of the package is the accompanying function
`assess_aux_vector()`, which can be used to evaluate the performance of
candidate auxiliary vectors.

In this vignette, we demonstrate how to apply the key functions in the
`auxvecLASSO` package using the `api` dataset from the `survey` package.

In this example, we will:

1.  Select auxiliary variables using LASSO

2.  Evaluate the chosen auxiliary vector

# Dataset

We will work with the `api` data from the `survey` package. It contains
survey data about the API scores of a population of schools and
additional auxiliary variables. Based on the `apipop` dataset, we will
draw a stratified sample which will serve as the point of departure for
our modeling.

We begin by using the `apipop` dataset and add some binary variables to
it to make calculations as easy as possible. Auxiliary variables can
also contain more than two categories, and even be continuous/numerical.
See the `survey` package documentation for more information on allowed
formats for auxiliary variables.

```{r, create_pop_file}
# Load the population data
data(api)

# Load the population data file and add binary variables
api_pop <- apipop
api_pop$api00_bin <- as.factor(ifelse(api_pop$api00 > 650, 1, 0))
api_pop$growth_bin <- as.factor(ifelse(api_pop$growth > 25, 1, 0))
api_pop$meals_bin <- as.factor(ifelse(api_pop$meals > 40, 1, 0))
api_pop$ell_bin <- as.factor(ifelse(api_pop$ell > 15, 1, 0))
api_pop$hsg_bin <- as.factor(ifelse(api_pop$hsg > 20, 1, 0))
api_pop$full_bin <- as.factor(ifelse(api_pop$full > 90, 1, 0))
api_pop$sch.wide_bin <- as.factor(ifelse(api_pop$sch.wide == "Yes", 1, 0))
api_pop$awards_bin <- as.factor(ifelse(api_pop$awards == "Yes", 1, 0))
api_pop$comp.imp_bin <- as.factor(ifelse(api_pop$comp.imp == "Yes", 1, 0))
api_pop$stype <- as.factor(api_pop$stype)

# Keep only relevant variables
api_pop <- api_pop |>
  dplyr::select("cds", "stype", "enroll", ends_with("_bin"))
```

Next, we draw a stratified sample based on **stype** to create a sample
file.

```{r, create_sample}
strata_sample <- sampling::strata(
  api_pop, # The population dataset
  stratanames = c("stype"), # Stratification variable
  size = c(150, 200, 175), # Desired sample sizes per stratum
  method = "srswor" # Sampling without replacement
)

api_sample <- getdata(api_pop, strata_sample)
api_sample$SamplingWeight <- 1 / api_sample$Prob
```

We will also add a response indicator to the sample file.

```{r dataimport}
# Artificial logistic regression coefficients (for both main effects and interactions)
coefficients <- c(
  "api00_bin" = 0.5,
  "growth_bin" = -0.3,
  "ell_bin" = -0.6,
  "interaction1" = 0.9, # Interaction term api00_bin:growth_bin
  "interaction2" = 1.2 # Interaction term api00_bin:ell_bin
)

# Create interaction terms for logistic model
api_sample$interaction1 <- as.numeric(api_sample$api00_bin) * as.numeric(api_sample$growth_bin)
api_sample$interaction2 <- as.numeric(api_sample$api00_bin) * as.numeric(api_sample$ell_bin)

# Calculate the logit (log-odds) based on the artificial coefficients and interaction terms
logit <- -1.2 +
  coefficients["api00_bin"] * as.numeric(api_sample$api00_bin) +
  coefficients["growth_bin"] * as.numeric(api_sample$growth_bin) +
  coefficients["ell_bin"] * as.numeric(api_sample$ell_bin) +
  coefficients["interaction1"] * api_sample$interaction1 +
  coefficients["interaction2"] * api_sample$interaction2

# Compute the logistic probabilities
api_sample$response_prob <- 1 / (1 + exp(-logit)) # Logistic function

# Simulate the response variable (1 for response, 0 for non-response)
api_sample$response <- as.factor(rbinom(n = nrow(api_sample), size = 1, prob = api_sample$response_prob))

# --- Check the summary of the sample ---
summary(api_sample)
```

# Selecting Auxiliary Variables via LASSO

Variables in our analyses will have different purposes:

-   Outcome variables [*the response indicator and central survey
    variables*] (the response indicator together with survey variables
    used to evaluate point estimates and standard errors where unknown
    population totals make it hard to evaluate bias/MSE and to use these
    as auxiliary variables)

-   Candidate auxiliary variables (variables that might be selected to
    the auxiliary vector)

When evaluating auxiliary vector performance, we will also need:

-   Register variables (variables with known population totals - these
    are assumed to be good proxies of central survey variables)

-   Domain variables (variables that delineate subsets of the population
    for which we want to compute/evaluate estimates)

In this example, we will use the following variables:

-   The response indicator **response** and **enroll** are outcome
    variables, where **enroll** is assumed to be a survey variable.

-   The variables **api00_bin**, **growth_bin**, **meals_bin**,
    **meals_bin**, **ell_bin**, **hsg_bin**, **avg.ed_bin**,
    **full_bin**, **comp.imp_bin** and **stype** are candidate auxiliary
    variables.

-   The variable **sch.wide_bin** is assumed to be a register variable.

-   The stratification variable **stype** and **awards_bin** are domain
    variables in this example.

## Variable selection using auxvecLASSO

The `select_auxiliary_variables_lasso_cv()` function can be used to
perform LASSO-based variable selection for binary and continuous
outcomes. The LASSO variable selection is made through one call of the
function `select_auxiliary_variables_lasso_cv()`. In this example, we
will demand that **stype** be included in the auxiliary vector since it
is the stratification variable.

```{r lassomodeling}
# Apply LASSO for selecting auxiliary variables
lasso_result <- select_auxiliary_variables_lasso_cv(
  df = api_sample,
  outcome_vars = c("response", "enroll"),
  auxiliary_vars = c(
    "api00_bin", "growth_bin", "comp.imp_bin", "meals_bin",
    "meals_bin", "ell_bin", "hsg_bin", "full_bin", "stype"
  ),
  must_have_vars = "stype", # Include the domain variable (stype) in the model
  check_twoway_int = TRUE, # Include two-way interactions
  nfolds = 5, # Use 5-fold cross-validation
  verbose = FALSE, # Don't print progress messages
  standardize = TRUE, # Standardize the predictors
  return_models = FALSE, # Don't return the models, only the selection results
  parallel = FALSE # Run without parallel processing
)

# Display the results
lasso_result
```

## Understanding the Output

The output from the `select_auxiliary_variables_lasso_cv()` function
includes several key components:

### Selected variables

This is the list of all variables (main effects and interaction terms)
that were selected by the LASSO model. Variables like **api00_bin0**,
**growth_bin1**, **stypeH**, **stypeM**, etc., represent the individual
effects of the auxiliary variables (binary or continuous) used in the
model. **api00_bin1:comp.imp_bin1**, **growth_bin1:stypeH**, etc.,
represent interactions between two variables. These are considered
important for predicting the outcome and have non-zero coefficients.

### By outcome

**response** and **enroll**: These are the two outcomes modeled
separately. This breakdown helps you see which variables are more
relevant for each specific outcome.

### Selected lambdas

**Lambda** is a regularization parameter used in LASSO to control the
strength of the penalty.

### Penalty factors

Two variables have been forced into the model (must-keep) with a penalty
factor of zero. These are variables that must stay in the model
regardless of the regularization. Please note that "variables" in this
case refers to the columns of the model matrix. In our example,
**stype** was stated as a must-have variable. This variable is a factor
with three levels, where the first level has been removed in the
modeling phase to avoid the dummy variable trap.

The remaining 43 "variables" are subject to regularization and can
potentially be shrunk to zero depending on the strength of the lambda
penalty.

### Goodness-of-fit

This section describes the model fit at the chosen `lambda.min` (the
lambda that minimizes cross-validation error). Note that the set of
goodness-of-fit measures are different between binary and continuous
outcomes.

#### Metrics

For **response**:

-   **Cross-validated error**: 0.6052 with standard deviation 0.0135.
    This indicates the average prediction error across folds during
    cross-validation.

-   **Deviance explained**: 0.1303. The deviance explained is a measure
    of the goodness-of-fit, indicating that the model explains some of
    the variance in the **response** variable.

-   **AUC (Area Under the Curve)**: 0.8122. AUC is a measure of the
    model's ability to distinguish between the two classes (response = 0
    or 1). The obtained value suggests that the model has good
    discriminatory power.

-   **Accuracy**: 0.8857. The model correctly classifies 96.19% of the
    observations, which is quite high, although it should be noted that
    the response indicator is imbalanced in our toy example.

-   **Brier Score**: 0.0892. This is a measure of the accuracy of
    probabilistic predictions. A lower Brier score indicates better
    calibration of predicted probabilities.

For **enroll**:

-   **Cross-validated error**: 188,129.5 with standard deviation
    13,594.59. This is the model's average prediction error across folds
    for the **enroll** outcome.

-   **R-squared**: 0.4676. This means the model explains 46.76% of the
    variance in the **enroll** outcome. R-squared is a measure of how
    well the independent variables explain the variation in the
    dependent variable.

-   **MSE (Mean Squared Error)**: 171,872.7. A lower MSE indicates a
    better fit.

-   **RMSE (Root Mean Squared Error)**: 414.5753. A higher RMSE
    indicates worse predictive accuracy, as it represents the standard
    deviation of the prediction errors.

-   **MAE (Mean Absolute Error)**: 305.8876. This is the average
    magnitude of the prediction errors. The lower the MAE, the better.

#### Coefficients at Lambda Min

This section displays the values of the coefficients (betas) at the
selected `lambda.min`. These coefficients represent the effect of each
variable in the model at the chosen regularization strength. For
**response**, variables like **api00_bin0** and **growth_bin1** have
non-zero coefficients, suggesting they have a significant relationship
with **response**. For **enroll**, several variables have large
coefficients, including **stypeH**, **stypeM**, and
**comp.imp_bin1:stypeH**, which seem to significantly influence the
prediction of **enroll**.

Variables with a zero coefficient have been regularized out of the
model, meaning they didn't contribute to the prediction at the chosen
`lambda.min`.

### Interaction metadata

This section gives an overview of the results concerning interaction
terms and the full formula tested.

**Summary**

This output demonstrates how LASSO performs both feature selection and
regularization to reduce the model complexity and improve
generalization. The model for **response** performs quite well, whereas
the **enroll** model has room for improvement in terms of predictive
power.

# Assessing Auxiliary Vector Calibration and Diagnostics

## Choice of auxiliary vector to examine

Next, we will assess the calibration of auxiliary variables using the
`assess_aux_vector()` function. This function provides a comprehensive
assessment of auxiliary vector calibration performance for a given
survey design. The function can also calibrate weights based on a
specified calibration formula and population totals. The function
returns a detailed list of diagnostics, which can help to evaluate and
improve survey weights.

Based on the results from the LASSO modeling, together with a
domain-expertise judgment of the statistical relationships, data
sufficiency, and what usually works in survey analysis and calibration,
the following auxiliary vector is assessed:

-   **Main effects**:\
    **stype (all levels), growth_bin, api100_bin, ell_bin, comp.imp_bin,
    hsg_bin**

-   **Interaction effects**:\
    **comp.imp_bin *x* stype, api00_bin *x* stype, hsg_bin *x* stype,
    api00_bin *x* growth_bin**

## Preparations

To make the example as realistic as possible, the response set from the
sample file is analyzed (since, in practice, we don't have survey
variable information for non-respondents).

```{r, create_resp-set}
api_resp <- api_sample[api_sample$response == 1, ]
```

We will also create a survey design object to pass to the assessment
function.

```{r, svydesign}
design <- survey::svydesign(
  ids     = ~1,
  data    = api_resp,
  strata  = ~stype,
  weights = ~SamplingWeight
)
```

Since we are considering register variable diagnostics, register
variable means need to be calculated.

```{r, cal_reg_means}
register_pop_means <- list(
  total = list(sch.wide_bin = mean(api_pop$sch.wide_bin == "1")),
  by_domain = list(
    stype = tapply(
      api_pop$sch.wide_bin, api_pop$stype,
      function(x) mean(x == "1")
    ),
    awards_bin = tapply(
      api_pop$sch.wide_bin, api_pop$awards_bin,
      function(x) mean(x == "1")
    )
  )
)
```

We also need to calculate population totals for the weight calibration
to be possible. This can be done using the package function
`generate_population_totals` which requires a calibration formula as
input.

```{r, calib_totals}
## --- Define the calibration formula ---
calibration_formula <- ~ stype + growth_bin + api00_bin + ell_bin + comp.imp_bin + hsg_bin + api00_bin:stype + hsg_bin:stype + comp.imp_bin:stype + api00_bin:growth_bin

## --- Calculate population totals with a single terms object built on the POPULATION ---
pop_totals <- generate_population_totals(
  population_df = api_pop,
  calibration_formula
)
```

## Calling the assessment function

We use the objects created in the preparations phase and pass them as
arguments to the assessment function.

Please note that the assessment function can be used to compare
different auxiliary vectors by passing different calibration formulae
(for example, on with and one without interaction terms) and matching
population totals to it.

```{r assess_vec}
result_diagnostics <- assess_aux_vector(
  design = design,
  df = api_resp,
  already_calibrated = FALSE,
  calibration_formula = calibration_formula,
  calibration_pop_totals = pop_totals$population_totals,
  register_vars = c("sch.wide_bin"),
  register_pop_means = register_pop_means,
  survey_vars = c("enroll", "full_bin"),
  domain_vars = c("stype", "awards_bin"),
  diagnostics = c(
    "weight_variation",
    "register_diagnostics",
    "survey_diagnostics"
  ),
  verbose = FALSE
)

## --- Display output ---
result_diagnostics
```

## Understanding the Output

The function `assess_aux_vector()` produces a structured summary of
diagnostics that help you understand how your calibration or auxiliary
vector adjustment performed. The output is organized into three main
sections:

### Weight variation metrics

This block describes how the analysis weights are distributed. Stable,
well-behaved weights improve the reliability of estimates.

-   **min / max / median / mean / sd / range**: Basic distribution of
    weights.\
    *Here:* weights range from **3.18** to **38.64** with mean **13.32**
    and sd **10.89** = relatively wide spread.

-   **coefficient_of_variation (CV)** = sd / mean. Rules of thumb: CV ≲
    0.5 is comfortable; CV \> 1 suggests heavy dispersion.\
    *Here:* **0.82** = notable dispersion but not extreme.

-   **gini_index** (0–1): inequality of weights (0 = equal).\
    *Here:* **0.416** = moderate inequality.

-   **entropy**: higher = more uniform weights (units depend on log
    base).\
    *Here:* **5.84**.

-   **skewness / kurtosis**: shape diagnostics (skewness \> 0 means long
    right tail; kurtosis \< 0 is flatter than normal).\
    *Here:* **skewness 0.81** (some high-weight tail), **kurtosis
    −1.11** (flatter).

-   **bottom_1pct / top_1pct**: extreme percentiles for quick outlier
    check.\
    *Here:* 1st pct **3.37**, 99th pct **38.64**.

### Register variable diagnostics

For each *register variable* supplied with population means, the
function reports design-based estimates, uncertainty, and bias relative
to the benchmark.

Reported statistics per variable (and per domain, if requested):

-   **Mean**: weighted sample estimate.

-   **SE**: standard error of the estimate.

-   **RSE** = SE / Mean (relative SE).

-   **Bias** = (Weighted mean − Population mean).

-   **MSE**: mean squared error ([Bias]^2 + Var).

-   **p(Bias)**: two-sided test that Bias = 0 (large values imply no
    detectable bias).

Based on the output, the calibrated estimates match the register means
closely (no evidence of significant biases). For example at the total
level, the bias of the estimate of the mean **sch.wide_bin** is −0.0027
with a p-value of p=0.849, which indicates that the chosen auxiliary
vector consists of valuable auxiliary variables. Similar results are
shown at the domain-level. As a test, the domain **awards_bin=1** was
included, in which all rows have **sch.wide_bin=1** - the output mean
estimate of exactly 1.000 is consistent with the given data structure.

### Survey variable diagnostics

The assessment function provides design-based estimates and precision
for the survey variable estimates. Bias/MSE/p-values are NA because
there is no benchmark to compare against.

Overall mean enrollment was 603.6 students (RSE = 2.1%), and the
proportion of full schools was 0.516 (RSE = 5.5%). Domain-level RSEs
generally remained below 10%, supporting reliable comparisons across
*stype* and *awards_bin*.

# Conclusion

The `auxvecLASSO` package is designed to help survey practitioners treat
auxiliary variable selection as a statistical learning problem. In this
vignette, we walked through two key components of the package:

-   **Variable selection with LASSO**
    (`select_auxiliary_variables_lasso_cv()`), which identifies
    promising auxiliary variables (and interactions) for modeling survey
    outcomes.

-   **Calibration diagnostics** (`assess_aux_vector()`), which evaluates
    how a chosen auxiliary vector performs in practice, using metrics on
    weight distribution, register variable alignment, and survey
    estimate precision.

Together, these tools provide a workflow for:

1.  generating candidate auxiliary vectors, and

2.  checking whether the resulting weights behave sensibly and lead to
    accurate, stable estimates.

While this vignette illustrated the functions on a simple dataset, the
same approach extends to larger and more complex surveys. Readers are
encouraged to adapt the demonstrated steps to their own data, paying
special attention to the diagnostics: stable weights, low bias on
register variables, and acceptable precision on survey estimates are all
signs that the chosen auxiliary vector is working well.

For further details, advanced options, and examples, consult the package
documentation.