---
title: "Model Selection in Two-Way Fixed Effects Models"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Model Selection Between TWFE and ETWFE}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>",
  fig.width  = 7,
  fig.height = 4.5
)
```

## Introduction

The two-way fixed effects (TWFE) model has long been a standard approach to staggered difference-in-differences (DID) designs. However, TWFE can produce biased estimates of the average treatment effect (ATT) if treatment effects are heterogeneous across time or treatment cohorts.

Wooldridge (2025) shows that an "extended" TWFE (ETWFE) model can eliminate bias due to heterogeneity. The ETWFE model lets the treatment interact with time and cohort dummies. It also compares newly treated units only to units that have not yet been treated, or will never be treated – eliminating "forbidden comparisons" with units that have already been treated.

When heterogeneity is substantial and the sample size is adequate to detect it, ETWFE can produce highly informative estimates than TWFE. ETWFE estimates are less biased and can be more precise as well. Inspection of interactions can reveal how effects vary across cohorts and time.

Under some circumstances, however, ETWFE can also make estimates in worse. If there is no heterogeneity, or if the heterogeneity is too mild and the sample is too small to detect it reliably, then two problems may arise:

1.  ETWFE estimates of the ATT may be more variable than TWFE estimates—with larger standard errors, and larger mean squared error if the increase in variance exceeds any reduction in bias. The increase in variance comes from ETWFE using a smaller analytic sample and adding interactions that may not be necessary.

2.  The interaction effects may be mostly or entirely noise rather than signal. That is, most of the apparent variance across time or cohorts may be estimation error rather than true heterogeneity.

Under these circumstances, the simpler TWFE model may be preferable.

The `selectTWFE` package addresses these issues in three ways

1.  `selectTWFE` chooses between the TWFE and ETWFE models using Cochran's (1954) **Q** statistic, which tests whether treatment effects vary significantly across cohort-time cells. `selectTWFE` chooses ETWFE when heterogeneity is significant and selects TWFE if the null hypothesis of homogeneity cannot be rejected.

2.  `selectTWFE` estimates how much of the effects variance across time or cohorts represents true heterogeneity rather than estimation error. It estimats this using the heterogeneity fraction statistic, $I^2$ (Higgins et al., 2003).

3.  To reduce noise, `selectTWFE` uses empirical Bayes to shrink time- and cohort-specific effects toward the average treatment effect (Efron & Morris, 1975). It also widens confidence intervals around the time- and cohort-specific effects, using the Bonferroni correction to account for multiple estimates.

## Installation

```{r install, eval=FALSE}
# Install dependencies
install.packages(c("etwfe", "fixest", "ggplot2", "scales"))

# Install selectTWFE (once on CRAN)
install.packages("selectTWFE")

# Or from GitHub
devtools::install_github("your-username/selectTWFE")
```

## Example 1: Real Data (mpdta)

We use the `mpdta` dataset included in `selectTWFE`, which contains log teen employment in 500 US counties from 2003-2007, with staggered state-level minimum wage increases.

```{r mpdta, message=FALSE, warning=FALSE}
library(selectTWFE)

data("mpdta", package = "selectTWFE")

result_mpdta <- select_twfe(
  fml    = lemp ~ lpop,
  tvar   = year,
  gvar   = first.treat,
  data   = mpdta,
  ivar   = countyreal,
  cgroup = "never",
  vcov   = ~countyreal
)

print(result_mpdta)
```

The Cochran's Q test shows significant evidence of heterogeneity (p-value \< 0.05), so ETWFE is selected. Note that the ATT estimates from the two models are similar (-0.042 vs -0.037), but ETWFE has a lower standard error. The heterogeneity fraction $I^2 = 0.77$ indicates that roughly three-quarters of the variance in cohort-time estimates reflects true heterogeneity rather than sampling noise.

```{r mpdta-plot, message=FALSE, warning=FALSE}
plot(result_mpdta)
```

The event study suggests a growing negative effect over time. This is only visible in the 2004 and 2006 cohorts since the 2007 cohort has only one year post-treatment. Note that all cohort-time estimates are shrunk toward the ATT and have confidence intervals widened to adjust for multiple estimates.

## Example 2: Simulation --- Homogeneous Effects (TWFE Should Win)

When treatment effects are constant across cohorts and time periods, vanilla TWFE is correctly specified and ETWFE wastes degrees of freedom on interaction terms that are all equal. The Q test should find no significant heterogeneity and select TWFE.

```{r sim-homogeneous, message=FALSE, warning=FALSE}
set.seed(42)
n_units   <- 500
n_periods <- 6

# Assign cohorts: treated in period 3, 4, or 5; some never treated
cohort  <- sample(c(3, 4, 5, NA), n_units, replace = TRUE,
                  prob = c(.25, .25, .25, .25))
unit_fe <- rnorm(n_units, sd = 1)
time_fe <- rnorm(n_periods, sd = 0.5)

panel <- expand.grid(unit = 1:n_units, period = 1:n_periods)
panel$cohort  <- cohort[panel$unit]
panel$unit_fe <- unit_fe[panel$unit]
panel$time_fe <- time_fe[panel$period]
panel$treated <- !is.na(panel$cohort) & panel$period >= panel$cohort

# Homogeneous treatment effect = 1.0 for all cohorts and periods
panel$y    <- panel$unit_fe + panel$time_fe +
              1.0 * panel$treated + rnorm(nrow(panel), sd = 0.5)
panel$gvar <- ifelse(is.na(panel$cohort), Inf, panel$cohort)

result_homo <- select_twfe(
  fml    = y ~ 1,
  tvar   = period,
  gvar   = gvar,
  data   = panel,
  ivar   = unit,
  cgroup = "never",
  vcov   = ~unit
)

print(result_homo)
```

As expected, TWFE is selected. The Cochran's Q test finds no significant evidence of heterogeneity (p-value ≥ 0.05), since the true effects are homogeneous. TWFE is also more precise because the never-treated group is large, making the restriction to the smaller ETWFE comparison group costlier than any bias reduction benefit.

## Example 3: Simulation --- Substantial Heterogeneous Effects (ETWFE Should Win)

When treatment effects vary substantially across cohorts, TWFE is biased and the Q test detects significant heterogeneity, leading to selection of ETWFE.

```{r sim-heterogeneous, message=FALSE, warning=FALSE}
set.seed(42)

# Substantial heterogeneous treatment effects: 0.5, 1.0, 2.0 by cohort
effect_map <- c("3" = 0.5, "4" = 1.0, "5" = 2.0)
effects <- ifelse(
  is.na(panel$cohort), 0,
  effect_map[as.character(panel$cohort)]
)
panel$y <- panel$unit_fe + panel$time_fe +
           effects * panel$treated + rnorm(nrow(panel), sd = 0.5)

result_hetero <- select_twfe(
  fml    = y ~ 1,
  tvar   = period,
  gvar   = gvar,
  data   = panel,
  ivar   = unit,
  cgroup = "never",
  vcov   = ~unit
)

print(result_hetero)
plot(result_hetero)
```

The Cochran's Q test shows highly significant heterogeneity (small p-value), so ETWFE is selected. This is correct because the true effects vary substantially (0.5, 1.0, 2.0 by cohort), causing TWFE to be biased.

## Example 4: Simulation --- Very Mild Heterogeneity with Small Sample

When treatment effects differ only slightly across cohorts and the sample is small, the Q test may or may not detect heterogeneity depending on the noise level and realized cohort composition. We illustrate with very mild heterogeneity (cohort effects of 0.90, 1.0, and 1.10) in a small sample (30 units).

Note that, while the Q test selects the "wrong" model here, the selected TWFE model actually estimates the ATT more accurately than the "right" ETWFE model. The TWFE point estimate is closer to the true ATT of 1, and the TWFE has a smaller standard error as well. This illustrates the general point that in small samples a simpler model can produce better estimates even when it is not quite correct (Hastie, Tibshirani, and Friedman 2009; the bias-variance tradeoff).

```{r sim-mild, message=FALSE, warning=FALSE}
set.seed(42)
n_units_small <- 30  # Very small sample

# Assign cohorts and build panel with small sample
cohort_small  <- sample(c(3, 4, 5, NA), n_units_small, replace = TRUE,
                        prob = c(.25, .25, .25, .25))
unit_fe_small <- rnorm(n_units_small, sd = 1)
time_fe_small <- rnorm(n_periods, sd = 0.5)

panel_small <- expand.grid(unit = 1:n_units_small, period = 1:n_periods)
panel_small$cohort  <- cohort_small[panel_small$unit]
panel_small$unit_fe <- unit_fe_small[panel_small$unit]
panel_small$time_fe <- time_fe_small[panel_small$period]
panel_small$treated <- !is.na(panel_small$cohort) & panel_small$period >= panel_small$cohort

# Very mild heterogeneous treatment effects
effect_map_veryMild <- c("3" = 0.90, "4" = 1.0, "5" = 1.10)
effects_veryMild <- ifelse(
  is.na(panel_small$cohort), 0,
  effect_map_veryMild[as.character(panel_small$cohort)]
)
panel_small$y <- panel_small$unit_fe + panel_small$time_fe +
                 effects_veryMild * panel_small$treated + rnorm(nrow(panel_small), sd = 0.5)
panel_small$gvar <- ifelse(is.na(panel_small$cohort), Inf, panel_small$cohort)

result_mild <- select_twfe(
  fml    = y ~ 1,
  tvar   = period,
  gvar   = gvar,
  data   = panel_small,
  ivar   = unit,
  cgroup = "never",
  vcov   = ~unit
)

print(result_mild)
```

## Monte Carlo Validation

For a comprehensive Monte Carlo study that validates the Q criterion across many replications and sample sizes, see `paper/monte_carlo.R` in the package repository. That script runs a 3×2 factorial design (3 heterogeneity levels × 2 sample sizes) with 1,000 replications per cell. The results confirm that the Q test correctly detects heterogeneity: rejection rates are low under homogeneity and high under true heterogeneity. The results confirm that the Q test outperforms an alternative model-selection criterion that tries to minimize the MSE of the ATT.

## Appendix

This appendix describes in more detail the statistics used by `selectTWFE`.

### Cochran's Q Test

Cochran's Q statistic measures heterogeneity across cohort-time effects:

$$Q = \sum_k w_k(\hat{\delta}_k - \bar{\delta}_w)^2$$

where $w_k = 1/\hat{\sigma}_k^2$ are inverse-variance weights, $\hat{\delta}_k$ are the cohort-time effects, and $\bar{\delta}_w$ is the precision-weighted mean effect.

Under the null hypothesis of homogeneous effects, $Q \sim \chi^2_{k-1}$ where $k$ is the number of cohort-time cells.

**Decision rule:** - If p-value $< \alpha$ (default $\alpha = 0.05$): Evidence of significant heterogeneity → **Select ETWFE** - If p-value $\geq \alpha$: No significant heterogeneity → **Select TWFE**

Users can adjust the significance level via the `alpha` parameter (e.g., `alpha = 0.01` for conservative testing, `alpha = 0.10` for exploratory analysis).

### The Heterogeneity Fraction ($I^2$)

When ETWFE is selected, `selectTWFE` also reports the **heterogeneity fraction** $I^2$ (Higgins et al. 2003):

$$I^2 = \max\!\left(0,\; \frac{Q - (K-1)}{Q}\right)$$

where $Q$ is Cochran's heterogeneity statistic and $K$ is the number of cohort-time cells. $I^2$ ranges from 0 (all variance is sampling noise) to 1 (all variance is genuine heterogeneity).

### Empirical Bayes Shrinkage

When ETWFE is selected and `shrink_heterogeneity = TRUE` (the default), the cohort-time ATT estimates shown in the event study plot are shrunk toward the grand mean ATT using the empirical Bayes formula:

$$\hat{\theta}_i^{EB} = \bar{\theta} + \left(1 - \frac{\sigma^2_i}{\sigma^2_i + \hat{\tau}^2}\right)(\hat{\theta}_i - \bar{\theta})$$

Precisely estimated cells shrink least; noisy cells shrink most. This produces a more honest picture of treatment effect heterogeneity.

The standard errors of the cohort-time effects are also shrunk by a factor of $\sqrt{1 - B_i}$, where $B_i = \sigma^2_i / (\sigma^2_i + \hat\tau^2)$ is the shrinkage factor. This is the naive empirical Bayes posterior SD, $\sqrt{(1 - B_i)\sigma^2_i}$. The original (unshrunk) SEs are retained in the `se_raw` column of the returned `estimates_shrunk` data frame.

### Confidence Intervals on the Event Study Plot

The error bars on the event study plot are Bonferroni-adjusted 95% confidence intervals. With $K$ cohort-time cells displayed, each interval uses the critical value $z_{1 - 0.025/K}$ rather than $1.96$, so that the familywise probability of at least one interval failing to cover its target is at most 5% under a global null. The plot also shows a solid horizontal reference line at the estimated ATT.