---
title: "Cram Bandit Helpers"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Cram Bandit Helpers}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(cramR)
library(data.table)
```

# 🌟 What is this article about?

In order to use `cram_bandit()`, users must supply a matrix of **action selection probabilities** \( \pi_t(X_j, A_j) \) for each combination of policy update \( t\) and context \( j\) in the historical dataset.

While some environments log these probabilities directly, many contextual bandit libraries (such as [`contextual`](https://github.com/Nth-iteration-labs/contextual)) only store **policy parameters** (e.g., regression coefficients) without explicit probability tracking.

This article explains how **Cram Bandit Helpers** reconstruct \( \pi_t(X_j, A_j) \) from these parameters for common policies:


| Policy Type     | Class Name                               |
|------------------|------------------------------------------|
| Epsilon-Greedy   | `BatchContextualEpsilonGreedyPolicy`     |
| LinUCB Disjoint with \( \varepsilon \)-greedy exploration           | `BatchLinUCBDisjointPolicyEpsilon`       |
| Thompson Sampling| `BatchContextualLinTSPolicy`             |


Both **theoretical formulas** and **practical code snippets** are provided.

---

# πŸ› οΈPolicy parameters explained

When using linear bandit algorithms like Epsilon-Greedy, LinUCB, or Thompson Sampling, each arm \(k\) maintains **summary statistics** (parameters) to estimate the expected reward:

- \( A_k \) is the **Gram matrix**:  
  \[
  A_k = X_k^T X_k
  \]
  where \(X_k\) is the matrix of feature vectors (contexts) for all rounds where arm \(k\) was selected.  
  βž” **Interpretation**: \(A_k\) captures the amount of information (and correlation structure) about the features for arm \(k\). It plays the role of a "feature covariance matrix."

- \( b_k \) is the **response vector**:  
  \[
  b_k = X_k^T y_k
  \]
  where \(y_k\) are the observed rewards for arm \(k\).  
  βž” **Interpretation**: \(b_k\) captures the relationship between the observed rewards and the contexts for arm \(k\).

These sufficient statistics allow the policy to compute the **Least Squares estimate** for the reward model:

\[
\theta_k = A_k^{-1} b_k
\]

where:

- \(\theta_k\) is the estimated coefficient vector that predicts the expected reward of arm \(k\) as a function of the context.

Thus:

- \(A_k\) tells us **how confident** we are about \(\theta_k\) (smaller eigenvalues of \(A_k\) imply more uncertainty).
- \(b_k\) provides the **observed signal** (reward-weighted context features) to fit \(\theta_k\).

The policy selects an action based on the \(\theta_k\) of each arm \( k \) and then observe the reward associated with this choice, which is used to update the parameters \(A_k\) and \(b_k\) of the policy.

---

# ✨ Epsilon-Greedy Policy

### πŸ€– Theoretical computation

In Epsilon-Greedy, with exploration rate \( \varepsilon \), the probability of selecting one of the best arms is:

\[ P(A_t | X_t) = (1 - \varepsilon) \times \frac{1}{\# \text{best arms}} + \varepsilon \times \frac{1}{K} \]

While the probability of selecting an arm that is not among the best arms is:

\[ P(A_t | X_t) = \varepsilon \times \frac{1}{K} \]

where:

- Best arms are those with maximal estimated rewards.
- \( K \) is the total number of available arms.

We define the least squares estimate as:

\[ \theta_k = A_k^{-1} b_k \quad \text{(Least Squares estimate)} \]

where:

- \( A_k \) is the Gram matrix \( X_k^T X_k \)
- \( b_k \) is the vector \( X_k^T Y_k \)

Best arms are identified via the estimated expected reward:

\[ \text{Expected reward} = X_t^T \theta_k \]


### πŸ“Š Code helper

In `cramR`, this is implemented by:

```r
get_proba_c_eps_greedy(eps, A_list, b_list, contexts, chosen_arms)
```

This function:

- Computes \( \theta_k \) for each arm
- Calculates expected rewards \( X_t^T \theta_k \)
- Identifies the best arms
- Applies the above formula


---

# πŸ”’ LinUCB Disjoint Policy with \( \varepsilon \)-Greedy

### πŸ€– Theoretical computation

LinUCB selects arms based on **Upper Confidence Bounds (UCBs)**:

\[ \text{UCB}_k(X_t) = \mu_k(X_t) + \alpha \sigma_k(X_t) \]

where:

- \( \mu_k(X_t) = X_t^T \theta_k \)
- \( \sigma_k(X_t) = \sqrt{X_t^T A_k^{-1} X_t} \) measures uncertainty
- \( \alpha \) controls the exploration strength

The action probabilities follow the same structure as Epsilon-Greedy but with UCB scores instead of plain expected rewards i.e. the probability to select one of the best arms is:

\[ P(A_t | X_t) = (1 - \varepsilon) \times \frac{1}{\# \text{best arms}} + \varepsilon \times \frac{1}{K} \]

While the probability to select an arm that is not among the best arms is:

\[ P(A_t | X_t) = \varepsilon \times \frac{1}{K} \]

where "best arms" are those with highest UCB scores.

### πŸ“Š Code helper

In `cramR`, this is implemented by:

```r
get_proba_ucb_disjoint(alpha, eps, A_list, b_list, contexts, chosen_arms)
```

This function:

- Computes \( \theta_k \)
- Computes \( \mu_k(X_t) \) and \( \sigma_k(X_t) \)
- Identifies arms maximizing \( \text{UCB}_k(X_t) \)
- Applies the Epsilon-Greedy selection formula

---

# πŸ€“ Thompson Sampling (LinTS)

### πŸ€– Theoretical computation

In Thompson Sampling, actions are sampled according to posterior draws and the action associated with the maximum value is chosen. The probability that the arm \( A_t \) is optimal is:

\[ P(A_t | X_t) = \mathbb{P}\left( \theta_{A_t}^T X_t > \theta_{k}^T X_t \quad \forall k \neq A_t \right) \]

where \( \theta_k \sim \mathcal{N}(A_k^{-1} b_k, \sigma^2 A_k^{-1}) \).

This requires **computing a multivariate probability**, which we approximate via **adaptive numerical integration**.

### πŸ“Š Code helper

In `cramR`, this is implemented by:

```r
get_proba_thompson(sigma, A_list, b_list, contexts, chosen_arms)
```

This function:

- Computes posterior means and variances
- Integrates over the space where chosen arm \( A_t \) has the highest sampled reward
- Returns clipped probabilities for numerical stability


---

# πŸ‘¨β€πŸ’» Practical Workflow

When using your bandit policy in practice:

1. Record action choices, contexts, and policy parameters (e.g., \( A \), \( b \))
2. Calculate the action selection probabilities. If your policy is within the ones presented above, please feel free to rely on our helper functions to build \( \pi \).
3. Feed `pi`, `arm`, and `reward` into `cram_bandit()` for evaluation of your policy.

---

# πŸ§ͺ Estimand Calculation in `cram_bandit_sim()`

The following only concerns the simulation framework we implemented for benchmarking purposes.

Once the policies are reconstructed, we compute their true expected value β€” referred to as the estimand β€” by applying the learned policy to independent contexts and evaluating it against the known reward function used in the simulation.

This is done via:

```r
compute_estimand(data_group, list_betas, policy, policy_name, batch_size, bandit)
```

Accurately computing the estimand is critical for properly assessing the bias and confidence interval coverage of the Cram estimate in our simulations.


# πŸ“‚ Useful Links

- [`contextual`](https://github.com/Nth-iteration-labs/contextual) package: original framework
- `cram_bandit()`: Cram evaluation for contextual bandits
- `cram_bandit_sim()`: Full simulation engine with automatic pi estimation

---

# 🌟 Acknowledgments

These helper functions were designed to faithfully reconstruct action probabilities for the policies implemented in [`contextual`](https://github.com/Nth-iteration-labs/contextual), while enabling reproducible Cram-based evaluation.


```{r cleanup-autograph, include=FALSE}
autograph_files <- list.files(tempdir(), pattern = "^__autograph_generated_file.*\\.py$", full.names = TRUE)
if (length(autograph_files) > 0) {
  try(unlink(autograph_files, recursive = TRUE, force = TRUE), silent = TRUE)
}