This vignette accompanies the deprecation of rMIDAS. Existing projects can keep using rMIDAS, but new development should move to rMIDAS2. The source repository for the successor package is https://github.com/MIDASverse/rMIDAS2.
rMIDAS2 is the successor to rMIDAS. It re-implements the MIDAS multiple imputation algorithm with several improvements:
| rMIDAS | rMIDAS2 | |
|---|---|---|
| Backend | TensorFlow (Python, via reticulate) |
PyTorch (Python, via local HTTP API) |
Runtime R dependency on
reticulate |
Yes | No |
| Preprocessing | Manual (convert()) |
Automatic |
| Python versions | 3.6–3.10 | 3.9+ |
| TensorFlow required | Yes (< 2.12) | No |
The API is deliberately simpler: most pipelines that required four function calls in rMIDAS need just one or two in rMIDAS2.
rMIDAS required configuring a
reticulate Python environment with TensorFlow:
# --- rMIDAS ---
library(rMIDAS)
# Python environment configured automatically on first load,
# or manually via set_python_env()rMIDAS2 uses a standalone Python server – no reticulate needed at runtime:
rMIDAS required explicit preprocessing with
convert(), where you had to specify which columns were
binary and which were categorical:
# --- rMIDAS ---
data(adult)
adult_conv <- convert(adult,
bin_cols = c("income"),
cat_cols = c("workclass", "marital_status"),
minmax_scale = TRUE)rMIDAS2 detects column types automatically – just pass your data frame directly:
rMIDAS used train():
# --- rMIDAS ---
mid <- train(adult_conv,
training_epochs = 20L,
layer_structure = c(256, 256, 256),
input_drop = 0.8,
learn_rate = 0.0004,
seed = 89L)rMIDAS2 uses midas_fit() (or the
all-in-one midas()):
# --- rMIDAS2 ---
fit <- midas_fit(adult,
epochs = 20L,
hidden_layers = c(256L, 128L, 64L),
corrupt_rate = 0.8,
lr = 0.001,
seed = 89L)Parameter name changes:
rMIDAS (train()) |
rMIDAS2 (midas_fit()) |
Notes |
|---|---|---|
training_epochs |
epochs |
|
layer_structure |
hidden_layers |
Default changed from 256-256-256 to 256-128-64 |
input_drop |
corrupt_rate |
|
learn_rate |
lr |
Default changed from 0.0004 to 0.001 |
dropout_level |
dropout_prob |
|
train_batch |
batch_size |
Default changed from 16 to 64 |
cont_adj |
num_adj |
|
softmax_adj |
cat_adj |
|
binary_adj |
bin_adj |
rMIDAS used complete():
rMIDAS2 uses midas_transform():
# --- rMIDAS2 ---
imps <- midas_transform(fit, m = 10)
# Returns a list of 10 data.frames
head(imps[[1]])Or skip midas_fit() + midas_transform()
entirely and use the all-in-one midas():
The combine() interface has changed:
rMIDAS took a formula and a list of completed data frames:
rMIDAS2 takes a model ID and an outcome variable name. Independent variables default to all other columns:
# --- rMIDAS2 ---
combine(fit, y = "income")
# Specify predictors explicitly:
combine(fit, y = "income", ind_vars = c("age", "hours_per_week"))The output format is the same: a data frame with columns
term, estimate, std.error,
statistic, df, and p.value.
rMIDAS required re-specifying the data and column types:
# --- rMIDAS ---
overimpute(adult,
binary_columns = c("income"),
softmax_columns = c("workclass", "marital_status"),
training_epochs = 20L,
spikein = 0.3)rMIDAS2 runs overimputation on an already-fitted model:
rMIDAS2 adds imp_mean(), which computes the element-wise
mean across all imputations – useful as a quick single point
estimate:
Below is a full rMIDAS pipeline and its rMIDAS2 equivalent.
library(rMIDAS)
data(adult)
adult <- adult[1:1000, ]
# 1. Preprocess
adult_conv <- convert(adult,
bin_cols = c("income"),
cat_cols = c("workclass", "marital_status"),
minmax_scale = TRUE)
# 2. Train
mid <- train(adult_conv, training_epochs = 20L, seed = 89L)
# 3. Generate imputations
imps <- complete(mid, m = 5)
# 4. Analyse
combine("income ~ age + hours_per_week", imps)| Task | rMIDAS | rMIDAS2 |
|---|---|---|
| Install Python env | Automatic / set_python_env() |
install_backend() |
| Preprocess data | convert(data, bin_cols, cat_cols) |
Not needed |
| Train model | train(data, training_epochs, ...) |
midas_fit(data, epochs, ...) |
| Generate imputations | complete(model, m) |
midas_transform(model, m) |
| Train + impute (one step) | Not available | midas(data, m, epochs, ...) |
| Mean imputation | Not available | imp_mean(model) |
| Rubin’s rules | combine(formula, df_list) |
combine(model, y, ind_vars) |
| Overimputation | overimpute(data, ...) |
overimpute(model, mask_frac) |
| Shutdown | Not needed | stop_server() |