Version: 0.2.0 Author: Mick Mioduszewski Date: 2026-06-01
The analytics organisation ingests external datasets from multiple suppliers on a recurring basis. Verifying that each new delivery is complete, structurally correct, and consistent with prior deliveries is currently slow and largely manual. Analysts spend time on repetitive visual inspection instead of higher-value work, and quality problems risk being missed or caught late.
dqcheckr is a purpose-built R package that automates
this verification process. For each dataset arrival it produces a
single, plain-English HTML report that tells data management staff:
Results are also stored in a local SQLite database so that quality trends can be tracked over time across many deliveries.
A data officer runs a single command for each arriving dataset:
The tool finds the two most recent files in the dataset’s folder, checks the newer one, compares it to the older one, writes a timestamped HTML report to a reports folder, and prints a one-line summary to the console. The report opens automatically in the browser.
Files can also be identified explicitly by naming a current and a previous file, rather than relying on folder modification times.
Scale: the tool is designed for up to 10 dataset arrivals per day across all dataset types.
Data completeness
Data validity
Data integrity
Data consistency
Schema contract (when an expected column list is configured)
Custom checks (when a custom checks file is configured for a dataset)
HTML report — one file per run, self-contained (no internet required to view). Contains:
SQLite database — a single database shared across all datasets. Records summary statistics for every run, enabling long-term trend queries without reading the source files again.
dqcheckr/
├── R/
│ ├── run_check.R # main entry point: run_dq_check()
│ ├── ingest.R # read CSV / FWF; trim whitespace
│ ├── snapshot.R # SQLite write + read_recent_snapshots()
│ ├── drift.R # compare_snapshots(), list_snapshots(), drift helpers
│ ├── checks_generic.R # QC-01..15 and SC-01..02
│ ├── checks_custom.R # source and call custom_checks(df)
│ ├── compare.R # CP-01..08
│ ├── report.R # quarto::quarto_render() wrapper
│ ├── utils.R # dq_result(), load_config(), infer_col_type(), helpers
│ └── dqcheckr-package.R # package-level documentation
├── inst/
│ ├── templates/
│ │ ├── report.qmd # main HTML report template
│ │ └── drift_report.qmd # snapshot drift report template
│ ├── config/
│ │ ├── dqcheckr.yml
│ │ └── example_dataset.yml
│ └── demonstrations/
│ ├── demo.R # named-file mode demo (CSV + FWF)
│ ├── demo2.R # folder-scan mode demo with version comparison
│ ├── README.md
│ ├── config/ # global + starwars_csv + starwars_fwf configs
│ ├── config2/ # global + starwars_folder configs
│ ├── custom2/ # starwars_custom.R (human-specific checks)
│ ├── data/ # starwars.csv, starwars.fwf
│ └── data2/ # starwars_v1.csv (original), starwars_v2.csv (perturbed)
├── tests/testthat/
│ ├── test-utils.R
│ ├── test-checks.R
│ ├── test-compare.R
│ ├── test-ingest.R
│ ├── test-snapshot.R
│ ├── test-drift.R
│ └── test-integration.R
├── vignettes/
│ ├── dqcheckr.Rmd # user guide
│ └── specification.Rmd # this document
├── DESCRIPTION
└── NAMESPACE
Global config — dqcheckr.yml
snapshot_db: "data/snapshots.sqlite"
report_output_dir: "reports/"
default_rules:
# Single-snapshot thresholds
max_missing_rate: 0.05 # FAIL if column missing rate exceeds this
max_non_numeric_rate: 0.01 # FAIL if non-numeric rate in numeric column exceeds this
min_row_count: 0 # FAIL if row count below this (0 = disabled)
# max_row_count: 1000000 # optional FAIL if row count exceeds this
# max_file_size_mb: 500 # optional FAIL if file size exceeds this
# max_z_score: 3.0 # optional QC-15 outlier detection (Z-score)
# iqr_fence_multiplier: 1.5 # optional QC-15 outlier detection (IQR fence)
# Version comparison thresholds
max_row_count_change_pct: 0.10
max_numeric_mean_shift_pct: 0.20
max_missing_rate_change_pp: 2.0 # percentage points
max_non_numeric_rate_change_pp: 1.0
missing_rate_change_severity: warn # "warn" or "fail" — severity for CP-03 breaches
# Schema flags
flag_new_columns: true # set false to suppress new-column warnings in CP-02
flag_dropped_columns: true # set false to suppress dropped-column warnings in CP-02
flag_type_changes: true # set false to suppress type-change warnings in CP-02
flag_column_order_change: true # set false to skip CP-08 entirely
column_order_severity: warn # "warn" or "fail" — overrides format default for CP-08Per-dataset config —
<dataset_name>.yml
dataset_name: "customer_accounts"
folder: "data/incoming/customer_accounts/"
format: csv # csv | fwf
encoding: "UTF-8"
delimiter: "," # csv only
fwf_widths: ~ # fwf only
fwf_col_names: ~ # fwf only
fwf_skip: 0 # fwf only — header rows to skip (default 0)
# Optional: explicit file paths override folder-scan version detection
# current_file: "data/incoming/customer_accounts/20260410.csv"
# previous_file: "data/incoming/customer_accounts/20260401.csv"
# Optional: schema contract
expected_columns:
- id
- name
- country_code
- account_status
- account_balance
- created_date
# Optional: key columns that must be unique
key_columns:
- id
# Optional rule overrides (merged over global defaults)
rule_overrides:
max_missing_rate: 0.02
max_row_count_change_pct: 0.05
max_non_numeric_rate: 0.005
min_row_count: 1000
# Optional per-column rules
column_rules:
country_code:
allowed_values: ["GB", "US", "DE", "FR"]
pattern: "^[A-Z]{2}$"
account_status:
allowed_values: ["ACTIVE", "CLOSED", "SUSPENDED"]
account_balance:
min_value: 0
max_value: 1000000
# Optional custom checks
custom_checks_file: "custom/customer_accounts_checks.R"Two modes:
folder are used. Newest = current,
second-newest = previous.current_file (and
optionally previous_file) are set in the dataset config,
those paths are used directly. Useful when file naming conventions do
not reflect delivery order, or when re-running a check for a specific
pair of files.If only a current file is available (either via folder scan or explicit config), the run proceeds in single-file mode: version comparison and vs-previous schema checks are skipped.
readr::read_delim() or readr::read_fwf().Applied to each column of the trimmed data frame. Rules applied in order:
| Priority | Condition | Inferred type |
|---|---|---|
| 1 | All non-null/non-empty values parse as a known date format | date |
| 2 | ≥ 90% of non-null/non-empty values coerce to numeric | numeric |
| 3 | Otherwise | character |
| — | All values null or empty | unknown |
Known date formats tested: %Y-%m-%d,
%d/%m/%Y, %m/%d/%Y, %Y%m%d,
%d-%m-%Y.
The 90% threshold means a column with up to 10% bad (non-numeric) values is still classified as numeric, making those bad values visible to QC-11.
Each check returns one or more dq_result objects:
dq_result(check_id, check_name, column = NA, status, observed, threshold = NA, message)
# status: "PASS" | "WARN" | "FAIL" | "INFO"Single-snapshot checks (QC series)
| ID | Name | Applies to | FAIL condition | WARN condition |
|---|---|---|---|---|
| QC-01 | Missing rate | Every column | missing_rate > max_missing_rate |
— |
| QC-02 | Empty column | Every column | Column is 100% empty | — |
| QC-03 | Duplicate rows | Table | — | Any fully-duplicate rows exist |
| QC-04 | Row count | Table | — | — (INFO) |
| QC-05 | Column count | Table | — | — (INFO) |
| QC-06 | Inferred type | Every column | — | — (INFO) |
| QC-07 | Numeric stats | Numeric columns | — | — (INFO: min/max/mean/SD) |
| QC-08 | Distinct value count | Character columns | — | — (INFO) |
| QC-09 | Allowed values | Configured columns | Any value outside allowed_values |
— |
| QC-10 | Numeric bounds | Configured columns | Any value outside min_value/max_value |
— |
| QC-11 | Non-numeric values | Numeric columns | non_numeric_rate > max_non_numeric_rate |
non_numeric_rate > warn_non_numeric_rate |
| QC-12 | Key uniqueness | Configured key_columns (single or composite) |
Any duplicate value in key | — |
| QC-13 | Pattern / regex | Configured columns | Any value not matching pattern |
— |
| QC-14 | Row count bounds | Table | row_count < min_row_count (if > 0);
row_count > max_row_count (if set); file size >
max_file_size_mb (if set) |
— |
| QC-15 | Outlier detection | Numeric columns | Value exceeds Z-score (max_z_score) or IQR fence
(iqr_fence_multiplier) threshold |
— (skipped if neither threshold is configured) |
Schema contract checks (SC series) — only when
expected_columns is configured
| ID | Name | FAIL condition |
|---|---|---|
| SC-01 | Unexpected column | Column in file but not in expected_columns |
| SC-02 | Missing expected column | Column in expected_columns but absent from file |
Version comparison checks (CP series) — skipped in single-file mode
| ID | Name | Flag condition | Severity |
|---|---|---|---|
| CP-01 | Row count change | abs(pct_change) > max_row_count_change_pct (WARN if
previous has 0 rows) |
WARN |
| CP-02a | New columns | Columns in current not in previous (suppressible via
flag_new_columns: false) |
WARN |
| CP-02b | Dropped columns | Columns in previous not in current (suppressible via
flag_dropped_columns: false) |
WARN |
| CP-02c | Column type changes | Type of a common column changed (suppressible via
flag_type_changes: false) |
WARN |
| CP-03 | Missing rate change | Increase > max_missing_rate_change_pp pp per column;
severity controlled by missing_rate_change_severity |
WARN (default) / FAIL |
| CP-04 | Numeric mean shift | Shift > max_numeric_mean_shift_pct per column |
WARN |
| CP-05 | New distinct values | Values in current not in previous (categorical columns) | INFO |
| CP-06 | Dropped distinct values | Values in previous not in current (categorical columns) | INFO |
| CP-07 | Non-numeric rate change | Increase > max_non_numeric_rate_change_pp pp per
column; always emits a result |
WARN |
| CP-08 | Column order change | Column order differs from previous (suppressible via
flag_column_order_change: false); severity overridable via
column_order_severity |
WARN (CSV) / FAIL (FWF) |
CP-02 is split into three separate results (CP-02a, CP-02b, CP-02c) so each type of schema change can be independently read in the report and suppressed via config. The
new_cols_vs_previousandmissing_cols_vs_previouscolumns in thesnapshotstable are derived directly from the CP-02a/b results — they are not computed separately.
custom_checks_file in dataset config points to an
.R file.baseenv() (not the global env).custom_checks(df) returning a list of
dq_result objects.dq_result() is explicitly injected and available
without qualification.resolve_col_type())
require the dqcheckr:: prefix.column_snapshots (see C.8).Database and tables created automatically on first run via
init_snapshot_db().
Table: snapshots
| Column | Type | Notes |
|---|---|---|
id |
INTEGER PK | Auto-increment |
dataset_name |
TEXT | |
run_timestamp |
TEXT | YYYY-MM-DD HH:MM:SS |
file_name |
TEXT | Basename of current file |
row_count |
INTEGER | |
col_count |
INTEGER | |
check_pass_count |
INTEGER | |
check_warn_count |
INTEGER | |
check_fail_count |
INTEGER | |
check_info_count |
INTEGER | |
overall_status |
TEXT | FAIL > WARN > PASS >
INFO |
new_cols_vs_previous |
TEXT | Comma-separated; NULL if none or no previous |
missing_cols_vs_previous |
TEXT | Comma-separated; NULL if none or no previous |
new_cols_vs_schema |
TEXT | Comma-separated; NULL if none or expected_columns not
set |
missing_cols_vs_schema |
TEXT | Comma-separated; NULL if none or expected_columns not
set |
comparison_mode |
TEXT | "comparison" or "single" (single-file runs
have no previous) |
render_status |
TEXT | "success" or "failed" (updated if Quarto
render fails) |
type_changed_cols_vs_previous |
TEXT | Comma-separated col (prev->curr) pairs; NULL if
none |
Table: column_snapshots
| Column | Type | Notes |
|---|---|---|
id |
INTEGER PK | Auto-increment |
snapshot_id |
INTEGER FK | → snapshots.id |
column_name |
TEXT | Column name, or check_id for custom checks |
dq_check |
TEXT | Stat name or custom check_id |
value |
TEXT | Always stored as text |
threshold |
TEXT | NULL if no threshold applies |
severity_on_breach |
TEXT | FAIL, WARN, or NULL if informational |
Rows written per column type:
dq_check |
Written for | threshold |
severity_on_breach |
|---|---|---|---|
inferred_type |
All columns | NULL | NULL |
missing_count |
All columns | NULL | NULL |
missing_rate |
All columns | max_missing_rate |
FAIL |
distinct_count |
All columns | NULL | NULL |
numeric_parseable_mean |
Numeric only | NULL | NULL |
numeric_sd |
Numeric only | NULL | NULL |
numeric_min |
Numeric only | NULL | NULL |
numeric_max |
Numeric only | NULL | NULL |
non_numeric_count |
Numeric only | NULL | NULL |
non_numeric_rate |
Numeric only | max_non_numeric_rate |
FAIL |
Custom check rows: one row per dq_result with a non-NA
column value. dq_check = check_id,
value = observed, threshold =
threshold, severity_on_breach = status if
WARN/FAIL, else NULL.
Standard run:
run_dq_check(
dataset_name, # matches <dataset_name>.yml in config_dir
config_dir = ".", # contains dqcheckr.yml and <dataset_name>.yml
open_report = TRUE # open in browser when interactive
)
# Returns invisibly: list(status, report_path, snapshot_id)Report filename:
<dataset_name>_<YYYYMMDD_HHMMSS>.html
Snapshot drift analysis (compare any two historical snapshots):
compare_snapshots(
dataset_name,
snapshot_id_prev = NULL, # defaults to second-most-recent
snapshot_id_curr = NULL, # defaults to most-recent
db_path = NULL, # defaults to snapshot_db from dqcheckr.yml
config_dir = ".",
report = TRUE, # render HTML drift report
open_report = interactive()
)
# Returns invisibly: named list with dataset_name, snap_prev, snap_curr,
# table_drift, schema_changes, missing_rate_changes, non_numeric_changes,
# mean_shifts, distinct_changes
list_snapshots(dataset_name = NULL, db_path)
# Returns a data frame of snapshot records; dataset_name = NULL returns all datasetsConsole output:
[dqcheckr] customer_accounts: FAIL - 0 warning(s), 2 failure(s). Report: reports/customer_accounts_20260410_143022.html
| Situation | Behaviour |
|---|---|
| Folder does not exist | Stop; no output written |
| No files in folder | Stop; no output written |
| Explicit file path not found | Stop; no output written |
| Only one file available | Single-file mode; comparison and vs-previous schema skipped |
| File cannot be parsed | Stop with filename and parse error |
| SC-01 / SC-02 violations | FAIL recorded; run continues; report and snapshot written |
| Custom checks file missing | Stop with message |
custom_checks() not defined |
Stop with message |
custom_checks() runtime error |
Stop with R error detail |
| SQLite write fails | Warning emitted; HTML report still written |
| Report render fails | Warning emitted; snapshot marked
render_status = "failed" |
snapshot_db path missing |
Created automatically by init_snapshot_db() |
| Package | Role |
|---|---|
readr |
CSV / FWF ingestion |
DBI + RSQLite |
Snapshot store |
quarto |
Report rendering (Quarto CLI required separately) |
knitr |
Template execution during Quarto render |
kableExtra |
Styled HTML tables in report templates |
ggplot2 + gridExtra |
Trend charts in report templates |
yaml |
Config parsing |
dplyr |
Pipe operator (%>%) used in report templates |
tidyr |
Data reshaping for trend charts in report templates |
rlang |
Structured error conditions |
All on CRAN. pointblank and dataCompareR
were evaluated and not used — all logic is implemented natively.
rmarkdown is listed in Suggests for vignette
compilation only; it is no longer used for report rendering.
| # | Issue | Detail |
|---|---|---|
| D-01 | Explicit file naming — partial spec | The dataset config allows current_file and
previous_file to override folder-scan detection, but the
behaviour when one is set and the other is not has not been fully
specified. Implemented as: current_file set but
previous_file absent runs in single-file mode. |
Resolved in v0.2.0: column_order_severity config key
overrides the format-based default for CP-08. |
||
Resolved in v0.2.0: QC-12 now supports composite keys via a
key_columns list. QC-03 retains full-row duplicate
detection. |
| # | Feature |
|---|---|
| B-01 | Cross-column consistency rules — conditional rules
between columns (e.g. if status = CLOSED then
close_date must not be null) |
| B-02 | Date column recency check — flag if the most recent value in a date column is older than N days |
| B-04 | post_run_hook parameter —
user-supplied function for email/Slack/S3 delivery after a run |
| B-05 | batch_run(config_dir) — run all
configured datasets in sequence |
| B-06 | Additional input formats — Parquet
(arrow), Excel, database tables via DBI |
| B-08 | Report themes — organisation branding via a
theme parameter |
| # | Limitation |
|---|---|
| L-01 | type_inference_threshold in
rule_overrides |
| L-02 | Date formats tested are hardcoded to five common patterns; exotic formats will be misclassified as character |
| L-03 | The report appendix reads from in-memory col_stats, not
from SQLite. If the SQLite write fails but the report succeeds, the
appendix and the database can show slightly different values in edge
cases. Report render failures are non-fatal from v0.2.0: the snapshot is
preserved and marked render_status = "failed". |
| L-04 | Performance scales linearly with rows but super-linearly with columns. Benchmarked on macOS / aarch64 / R 4.6.0 with 1 M-row synthetic data: 10 cols → 47.6 s (21,031 rows/s); 20 cols → 91.9 s (10,881 rows/s); 49 cols → 138.9 s (7,200 rows/s). Very wide files (100+ columns) will be substantially slower. |