Working with household survey microdata involves a great deal of repetitive processing: recoding categorical variables, building indicators, joining external data, and computing weighted estimates. Each researcher writes their own version of these transformations, and the code is rarely shared or documented in a way that others can reuse.
metasurvey addresses this problem by providing a
metaprogramming layer on top of the survey package (Lumley 2004).
Instead of writing ad hoc scripts, you build a pipeline
of transformations that is:
The pipeline has three levels:
svymean, svytotal, svyby) that
produce the final tablesThe package handles the survey design —stratification, clusters,
replicate weights— automatically through the Survey object.
The user focuses on the substantive analysis; metasurvey takes care of
the infrastructure.
A Survey object groups the microdata together with
metadata about weights, edition, and survey type. We use a sample of
real microdata from Uruguay’s Encuesta Continua de Hogares
(ECH, Continuous Household Survey) 2023, published by the Instituto
Nacional de Estadistica (INE).
The ECH is a rotating-panel household survey. Key variables include:
library(metasurvey)
library(data.table)
# Load a sample of real ECH 2023 microdata (200 households, ~500 persons)
dt <- fread(system.file("extdata", "ech_2023_sample.csv", package = "metasurvey"))
# Create Survey object
svy <- Survey$new(
data = dt,
edition = "2023",
type = "ech",
engine = "data.table",
weight = add_weight(annual = "W_ANO")
)The add_weight() function maps periodicity labels (e.g.,
“annual”, “monthly”) to the weight column names in the data. This lets
the same recipe work across different survey editions.
You can inspect the data at any time:
head(get_data(svy), 3)
#> ID nper anio mes region dpto nom_dpto e26 e27 e30 e51_2
#> <int> <int> <int> <int> <int> <int> <char> <int> <int> <int> <int>
#> 1: 34561 1 2023 1 1 1 Montevideo 2 26 1 6
#> 2: 34561 2 2023 1 1 1 Montevideo 2 45 7 6
#> 3: 34561 3 2023 1 1 1 Montevideo 2 7 4 1
#> POBPCOAC SUBEMPLEO HT11 pobre06 W_ANO
#> <int> <int> <num> <int> <int>
#> 1: 2 0 55429.6 0 57
#> 2: 4 0 55429.6 0 57
#> 3: 1 0 55429.6 0 57Steps are lazy by default: they are recorded but not
executed until bake_steps() is called. This allows you
to:
Use step_compute() to create derived variables. The
package automatically:
svy <- step_compute(svy,
# Convert income to thousands for readability
ht11_thousands = HT11 / 1000,
# Create employment indicator following ILO definitions
employed = ifelse(POBPCOAC == 2, 1, 0),
# Working age population (14+ years, ECH standard)
working_age = ifelse(e27 >= 14, 1, 0),
comment = "Basic labor force indicators"
)You can group computations using the .by parameter
(similar to data.table):
Use step_recode() to create categorical variables from
conditions. Conditions are evaluated top to bottom, and
the first match applies.
# Recode labor force status (POBPCOAC) into meaningful categories
svy <- step_recode(svy, labor_status,
POBPCOAC == 2 ~ "Employed",
POBPCOAC %in% 3:5 ~ "Unemployed",
POBPCOAC %in% 6:10 ~ "Inactive",
.default = "Not classified",
comment = "Labor force status - ILO standard"
)
# Create standard age groups for labor statistics
svy <- step_recode(svy, age_group,
e27 < 25 ~ "Youth (14-24)",
e27 < 45 ~ "Adult (25-44)",
e27 < 65 ~ "Mature (45-64)",
.default = "Elderly (65+)",
.to_factor = TRUE, # Convert to factor
ordered = TRUE, # Ordered factor
comment = "Age groups for labor analysis"
)
# Recode sex into descriptive labels
svy <- step_recode(svy, gender,
e26 == 1 ~ "Male",
e26 == 2 ~ "Female",
.default = "Other",
comment = "Gender classification"
)Use step_filter() to subset rows based on logical
conditions. Multiple conditions are combined with AND. Like other steps,
filters are lazy by default.
Rename variables for clarity or consistency:
Remove variables that are no longer needed:
Use step_join() to merge in external reference data.
This is useful for adding:
Note that real ECH microdata already includes nom_dpto
and region. Here we demonstrate a join with a poverty-line
lookup as an example:
# Poverty lines by region (illustrative values in UYU, 2023)
poverty_lines <- data.table(
region = 1:3,
poverty_line = c(19000, 12500, 11000),
region_name = c("Montevideo", "Interior loc. >= 5000", "Interior loc. < 5000")
)
svy <- step_join(svy,
poverty_lines,
by = "region",
type = "left",
comment = "Add poverty lines by region"
)Call bake_steps() to execute all pending
transformations:
svy <- bake_steps(svy)
head(get_data(svy), 3)
#> region dpto ID nper anio mes nom_dpto sex_code age e30 e51_2
#> <int> <int> <int> <int> <int> <int> <char> <int> <int> <int> <int>
#> 1: 1 1 34561 1 2023 1 Montevideo 2 26 1 6
#> 2: 1 1 34561 2 2023 1 Montevideo 2 45 7 6
#> 3: 1 1 34678 1 2023 1 Montevideo 1 48 1 6
#> POBPCOAC SUBEMPLEO HT11 pobre06 W_ANO ht11_thousands employed
#> <int> <int> <num> <int> <int> <num> <num>
#> 1: 2 0 55429.6 0 57 55.4296 1
#> 2: 4 0 55429.6 0 57 55.4296 0
#> 3: 2 0 152958.8 0 54 152.9588 1
#> labor_status age_group gender poverty_line region_name poverty_line.y
#> <char> <fctr> <char> <num> <char> <num>
#> 1: Employed Mature (45-64) Female 19000 Montevideo 19000
#> 2: Unemployed Mature (45-64) Female 19000 Montevideo 19000
#> 3: Employed Mature (45-64) Male 19000 Montevideo 19000
#> region_name.y
#> <char>
#> 1: Montevideo
#> 2: Montevideo
#> 3: MontevideoThe step history is preserved for documentation and reproducibility:
You can visualize the transformation pipeline as a directed graph:
The interactive DAG is not rendered in this vignette to keep the
package size small. Run view_graph() in your R session to
get an interactive graph showing:
Once the data is prepared, use workflow() to compute
survey estimates. The function wraps estimators from the
survey package (Lumley 2004) and returns tidy results
with standard errors and coefficients of variation.
Important: The survey object must be passed inside a
list().
# Estimate mean household income
result <- workflow(
list(svy),
survey::svymean(~HT11, na.rm = TRUE),
estimation_type = "annual"
)
result
#> stat value se cv confint_lower
#> <char> <num> <num> <num> <num>
#> 1: survey::svymean: HT11 110389.3 4076.154 0.03692528 102400.1
#> confint_upper
#> <num>
#> 1: 118378.4The output includes:
estimate: Point estimatese: Standard errorcv: Coefficient of variationvar_name: Variable namelevel: Factor level (for categorical variables)You can compute several statistics in a single call:
results <- workflow(
list(svy),
survey::svymean(~HT11, na.rm = TRUE),
survey::svytotal(~employed, na.rm = TRUE),
survey::svymean(~labor_status, na.rm = TRUE),
estimation_type = "annual"
)
results
#> stat value se cv
#> <char> <num> <num> <num>
#> 1: survey::svymean: HT11 1.103893e+05 4.076154e+03 0.03692528
#> 2: survey::svytotal: employed 1.426200e+04 6.756206e+02 0.04737208
#> 3: survey::svymean: labor_statusEmployed 5.551576e-01 2.610113e-02 0.04701571
#> 4: survey::svymean: labor_statusInactive 3.860646e-01 2.551750e-02 0.06609646
#> 5: survey::svymean: labor_statusUnemployed 5.877773e-02 1.309115e-02 0.22272291
#> confint_lower confint_upper
#> <num> <num>
#> 1: 1.024001e+05 1.183784e+05
#> 2: 1.293781e+04 1.558619e+04
#> 3: 5.040004e-01 6.063149e-01
#> 4: 3.360512e-01 4.360780e-01
#> 5: 3.311956e-02 8.443591e-02Compute estimates for subpopulations using
survey::svyby():
# Mean income by gender
income_by_gender <- workflow(
list(svy),
survey::svyby(~HT11, ~gender, survey::svymean, na.rm = TRUE),
estimation_type = "annual"
)
income_by_gender
#> stat value se cv
#> <char> <num> <num> <num>
#> 1: survey::svyby: HT11 [gender=Female] 111107.9 5955.676 0.05360263
#> 2: survey::svyby: HT11 [gender=Male] 109608.9 5519.291 0.05035439
#> confint_lower confint_upper gender
#> <num> <num> <char>
#> 1: 99434.99 122780.8 Female
#> 2: 98791.31 120426.5 MaleThe coefficient of variation (CV) measures the reliability of estimates. A lower CV indicates more precise estimates. Following the INE Uruguay guidelines (Instituto Nacional de Estadística (INE) 2023):
| CV range | Quality category | Recommendation |
|---|---|---|
| < 5% | Excellent | Use without restrictions |
| 5%–10% | Very good | Use with confidence |
| 10%–15% | Good | Use for most purposes |
| 15%–25% | Acceptable | Use with caution, noting limitations |
| 25%–35% | Poor | Use only for general trends |
| >= 35% | Unreliable | Do not publish |
Use evaluate_cv() to classify estimate quality:
# Check quality of mean income estimate
cv_percentage <- results$cv[1] * 100
quality <- evaluate_cv(cv_percentage)
cat("CV:", round(cv_percentage, 2), "%\n")
#> CV: 3.69 %
cat("Quality:", quality, "\n")
#> Quality: ExcellentFor official statistics, always report:
Recipes bundle transformation steps for reproducibility and sharing. Once you have developed a working pipeline, you can convert it into a recipe that can be:
Create a recipe from the steps you have developed:
# Convert current steps to a recipe
labor_recipe <- steps_to_recipe(
name = "ECH Labor Force Indicators",
user = "National Statistics Office",
svy = svy,
description = paste(
"Standard labor force indicators following ILO definitions.",
"Creates employment status, age groups, and gender classifications."
),
steps = get_steps(svy),
topic = "labor_statistics"
)
class(labor_recipe)
#> [1] "Recipe" "R6"
labor_recipe$name
#> [1] "ECH Labor Force Indicators"Or you can define a recipe from scratch:
minimal_recipe <- recipe(
name = "Basic Demographics - ECH",
user = "analyst",
svy = survey_empty(type = "ech", edition = "2023"),
description = "Basic demographic recoding for ECH microdata",
topic = "demographics",
# Define steps inline
step_recode(
gender,
e26 == 1 ~ "Male",
e26 == 2 ~ "Female",
.default = "Other"
),
step_recode(
age_group,
e27 < 14 ~ "Child",
e27 < 65 ~ "Adult",
.default = "Senior"
)
)Once published, anyone can retrieve a recipe by ID and apply it to their data:
Recipes automatically document their transformations:
doc <- labor_recipe$doc()
names(doc)
#> [1] "meta" "input_variables" "output_variables" "pipeline"
# Input variables required
doc$input_variables
#> [1] "HT11" "POBPCOAC" "e27" "e26" "region"
# Output variables created
doc$output_variables
#> [1] "ht11_thousands" "employed" "working_age" "mean_income_dept"
#> [5] "labor_status" "age_group" "gender" "mapping"metasurvey provides global settings that can be adjusted to suit your workflow:
# Check current lazy-processing setting
lazy_default() # TRUE = steps recorded but not executed immediately
#> [1] TRUE
# Check data-copy behavior
use_copy_default() # TRUE = operate on copies (safer but slower)
#> [1] TRUE
# View available computation engines
show_engines() # "data.table", "dplyr", etc.
#> [1] "data.table" "tidyverse" "dplyr"You can modify settings for the current session:
The sample data used in this vignette comes from the Encuesta Continua de Hogares (ECH) 2023, published by Uruguay’s Instituto Nacional de Estadistica (INE). The full microdata is available at INE.
The ech
package by Gabriela Mathieu and Richard Detomasi was an important
inspiration for metasurvey. While ech provides ready-to-use
functions for computing socioeconomic indicators, metasurvey takes a
different approach: it provides a metaprogramming layer that lets users
define, share, and reproduce their own processing pipelines.
Now that you understand the basics, you can explore these guides:
workflow(),
RecipeWorkflow, and publishable estimatesRotativePanelSurvey and PoolSurvey