Migrating from rMIDAS to rMIDAS2

This vignette accompanies the deprecation of rMIDAS. Existing projects can keep using rMIDAS, but new development should move to rMIDAS2. The source repository for the successor package is https://github.com/MIDASverse/rMIDAS2.

Why rMIDAS2?

rMIDAS2 is the successor to rMIDAS. It re-implements the MIDAS multiple imputation algorithm with several improvements:

rMIDAS rMIDAS2
Backend TensorFlow (Python, via reticulate) PyTorch (Python, via local HTTP API)
Runtime R dependency on reticulate Yes No
Preprocessing Manual (convert()) Automatic
Python versions 3.6–3.10 3.9+
TensorFlow required Yes (< 2.12) No

The API is deliberately simpler: most pipelines that required four function calls in rMIDAS need just one or two in rMIDAS2.

Installation

# Remove rMIDAS (optional -- it can coexist)
# remove.packages("rMIDAS")

# Install rMIDAS2
install.packages("rMIDAS2")

# One-time Python backend setup
library(rMIDAS2)
install_backend()

Side-by-side comparison

1. Setup

rMIDAS required configuring a reticulate Python environment with TensorFlow:

# --- rMIDAS ---
library(rMIDAS)
# Python environment configured automatically on first load,
# or manually via set_python_env()

rMIDAS2 uses a standalone Python server – no reticulate needed at runtime:

# --- rMIDAS2 ---
library(rMIDAS2)
install_backend()        # one-time setup
# The server starts automatically when you call any imputation function

2. Data preparation

rMIDAS required explicit preprocessing with convert(), where you had to specify which columns were binary and which were categorical:

# --- rMIDAS ---
data(adult)
adult_conv <- convert(adult,
                      bin_cols = c("income"),
                      cat_cols = c("workclass", "marital_status"),
                      minmax_scale = TRUE)

rMIDAS2 detects column types automatically – just pass your data frame directly:

# --- rMIDAS2 ---
# No convert() step needed. Pass raw data to midas() or midas_fit().

3. Training

rMIDAS used train():

# --- rMIDAS ---
mid <- train(adult_conv,
             training_epochs = 20L,
             layer_structure = c(256, 256, 256),
             input_drop      = 0.8,
             learn_rate      = 0.0004,
             seed            = 89L)

rMIDAS2 uses midas_fit() (or the all-in-one midas()):

# --- rMIDAS2 ---
fit <- midas_fit(adult,
                 epochs        = 20L,
                 hidden_layers = c(256L, 128L, 64L),
                 corrupt_rate  = 0.8,
                 lr            = 0.001,
                 seed          = 89L)

Parameter name changes:

rMIDAS (train()) rMIDAS2 (midas_fit()) Notes
training_epochs epochs
layer_structure hidden_layers Default changed from 256-256-256 to 256-128-64
input_drop corrupt_rate
learn_rate lr Default changed from 0.0004 to 0.001
dropout_level dropout_prob
train_batch batch_size Default changed from 16 to 64
cont_adj num_adj
softmax_adj cat_adj
binary_adj bin_adj

4. Generating imputations

rMIDAS used complete():

# --- rMIDAS ---
imps <- complete(mid, m = 10)
# Returns a list of 10 data.frames
head(imps[[1]])

rMIDAS2 uses midas_transform():

# --- rMIDAS2 ---
imps <- midas_transform(fit, m = 10)
# Returns a list of 10 data.frames
head(imps[[1]])

Or skip midas_fit() + midas_transform() entirely and use the all-in-one midas():

# --- rMIDAS2 (all-in-one) ---
result <- midas(adult, m = 10, epochs = 20)
head(result$imputations[[1]])

5. Rubin’s rules regression

The combine() interface has changed:

rMIDAS took a formula and a list of completed data frames:

# --- rMIDAS ---
combine("income ~ age + hours_per_week", imps)

rMIDAS2 takes a model ID and an outcome variable name. Independent variables default to all other columns:

# --- rMIDAS2 ---
combine(fit, y = "income")

# Specify predictors explicitly:
combine(fit, y = "income", ind_vars = c("age", "hours_per_week"))

The output format is the same: a data frame with columns term, estimate, std.error, statistic, df, and p.value.

6. Overimputation diagnostic

rMIDAS required re-specifying the data and column types:

# --- rMIDAS ---
overimpute(adult,
           binary_columns  = c("income"),
           softmax_columns = c("workclass", "marital_status"),
           training_epochs = 20L,
           spikein = 0.3)

rMIDAS2 runs overimputation on an already-fitted model:

# --- rMIDAS2 ---
diag <- overimpute(fit, mask_frac = 0.1)
diag$mean_rmse
diag$rmse     # per-column RMSE

7. Mean imputation (new in rMIDAS2)

rMIDAS2 adds imp_mean(), which computes the element-wise mean across all imputations – useful as a quick single point estimate:

# --- rMIDAS2 only ---
mean_df <- imp_mean(fit)
head(mean_df)

8. Cleanup

rMIDAS2 runs a background Python server that should be stopped when you are done:

# --- rMIDAS2 ---
stop_server()

Complete migration example

Below is a full rMIDAS pipeline and its rMIDAS2 equivalent.

rMIDAS (old)

library(rMIDAS)

data(adult)
adult <- adult[1:1000, ]

# 1. Preprocess
adult_conv <- convert(adult,
                      bin_cols  = c("income"),
                      cat_cols  = c("workclass", "marital_status"),
                      minmax_scale = TRUE)

# 2. Train
mid <- train(adult_conv, training_epochs = 20L, seed = 89L)

# 3. Generate imputations
imps <- complete(mid, m = 5)

# 4. Analyse
combine("income ~ age + hours_per_week", imps)

rMIDAS2 (new)

library(rMIDAS2)

data(adult)
adult <- adult[1:1000, ]

# 1. Fit and impute (no preprocessing needed)
result <- midas(adult, m = 5, epochs = 20, seed = 89L)

# 2. Analyse
combine(result, y = "income", ind_vars = c("age", "hours_per_week"))

# 3. Clean up
stop_server()

Quick-reference cheat sheet

Task rMIDAS rMIDAS2
Install Python env Automatic / set_python_env() install_backend()
Preprocess data convert(data, bin_cols, cat_cols) Not needed
Train model train(data, training_epochs, ...) midas_fit(data, epochs, ...)
Generate imputations complete(model, m) midas_transform(model, m)
Train + impute (one step) Not available midas(data, m, epochs, ...)
Mean imputation Not available imp_mean(model)
Rubin’s rules combine(formula, df_list) combine(model, y, ind_vars)
Overimputation overimpute(data, ...) overimpute(model, mask_frac)
Shutdown Not needed stop_server()