dqcheckr — Software Specification

Part C — Technical Specification

C.1 Package structure

dqcheckr/
├── R/
│   ├── run_check.R          # main entry point: run_dq_check()
│   ├── ingest.R             # read CSV / FWF; trim whitespace
│   ├── snapshot.R           # SQLite write + read_recent_snapshots()
│   ├── drift.R              # compare_snapshots(), list_snapshots(), drift helpers
│   ├── checks_generic.R     # QC-01..15 and SC-01..02
│   ├── checks_custom.R      # source and call custom_checks(df)
│   ├── compare.R            # CP-01..08
│   ├── report.R             # quarto::quarto_render() wrapper
│   ├── utils.R              # dq_result(), load_config(), infer_col_type(), helpers
│   └── dqcheckr-package.R   # package-level documentation
├── inst/
│   ├── templates/
│   │   ├── report.qmd       # main HTML report template
│   │   └── drift_report.qmd # snapshot drift report template
│   ├── config/
│   │   ├── dqcheckr.yml
│   │   └── example_dataset.yml
│   └── demonstrations/
│       ├── demo.R           # named-file mode demo (CSV + FWF)
│       ├── demo2.R          # folder-scan mode demo with version comparison
│       ├── README.md
│       ├── config/          # global + starwars_csv + starwars_fwf configs
│       ├── config2/         # global + starwars_folder configs
│       ├── custom2/         # starwars_custom.R (human-specific checks)
│       ├── data/            # starwars.csv, starwars.fwf
│       └── data2/           # starwars_v1.csv (original), starwars_v2.csv (perturbed)
├── tests/testthat/
│   ├── test-utils.R
│   ├── test-checks.R
│   ├── test-compare.R
│   ├── test-ingest.R
│   ├── test-snapshot.R
│   ├── test-drift.R
│   └── test-integration.R
├── vignettes/
│   ├── dqcheckr.Rmd         # user guide
│   └── specification.Rmd    # this document
├── DESCRIPTION
└── NAMESPACE

C.2 Configuration

Global config — dqcheckr.yml

snapshot_db: "data/snapshots.sqlite"
report_output_dir: "reports/"

default_rules:
  # Single-snapshot thresholds
  max_missing_rate: 0.05            # FAIL if column missing rate exceeds this
  max_non_numeric_rate: 0.01        # FAIL if non-numeric rate in numeric column exceeds this
  min_row_count: 0                  # FAIL if row count below this (0 = disabled)
  # max_row_count: 1000000          # optional FAIL if row count exceeds this
  # max_file_size_mb: 500           # optional FAIL if file size exceeds this
  # max_z_score: 3.0                # optional QC-15 outlier detection (Z-score)
  # iqr_fence_multiplier: 1.5       # optional QC-15 outlier detection (IQR fence)

  # Version comparison thresholds
  max_row_count_change_pct: 0.10
  max_numeric_mean_shift_pct: 0.20
  max_missing_rate_change_pp: 2.0   # percentage points
  max_non_numeric_rate_change_pp: 1.0
  missing_rate_change_severity: warn  # "warn" or "fail" — severity for CP-03 breaches

  # Schema flags
  flag_new_columns: true            # set false to suppress new-column warnings in CP-02
  flag_dropped_columns: true        # set false to suppress dropped-column warnings in CP-02
  flag_type_changes: true           # set false to suppress type-change warnings in CP-02
  flag_column_order_change: true    # set false to skip CP-08 entirely
  column_order_severity: warn       # "warn" or "fail" — overrides format default for CP-08

Per-dataset config — <dataset_name>.yml

dataset_name: "customer_accounts"
folder: "data/incoming/customer_accounts/"
format: csv                 # csv | fwf
encoding: "UTF-8"
delimiter: ","              # csv only
fwf_widths: ~               # fwf only
fwf_col_names: ~            # fwf only
fwf_skip: 0                 # fwf only — header rows to skip (default 0)

# Optional: explicit file paths override folder-scan version detection
# current_file: "data/incoming/customer_accounts/20260410.csv"
# previous_file: "data/incoming/customer_accounts/20260401.csv"

# Optional: schema contract
expected_columns:
  - id
  - name
  - country_code
  - account_status
  - account_balance
  - created_date

# Optional: key columns that must be unique
key_columns:
  - id

# Optional rule overrides (merged over global defaults)
rule_overrides:
  max_missing_rate: 0.02
  max_row_count_change_pct: 0.05
  max_non_numeric_rate: 0.005
  min_row_count: 1000

# Optional per-column rules
column_rules:
  country_code:
    allowed_values: ["GB", "US", "DE", "FR"]
    pattern: "^[A-Z]{2}$"
  account_status:
    allowed_values: ["ACTIVE", "CLOSED", "SUSPENDED"]
  account_balance:
    min_value: 0
    max_value: 1000000

# Optional custom checks
custom_checks_file: "custom/customer_accounts_checks.R"

C.3 File version detection

Two modes:

Folder scan (default): the two most recently modified files in folder are used. Newest = current, second-newest = previous.
Explicit naming: if current_file (and optionally previous_file) are set in the dataset config, those paths are used directly. Useful when file naming conventions do not reflect delivery order, or when re-running a check for a specific pair of files.

If only a current file is available (either via folder scan or explicit config), the run proceeds in single-file mode: version comparison and vs-previous schema checks are skipped.

C.4 Ingestion and whitespace trimming

File is read entirely as character columns via readr::read_delim() or readr::read_fwf().
Leading and trailing whitespace is trimmed from every value immediately after reading.
The trimmed data frame is used for all subsequent checks and comparisons. The raw file is never modified.

C.5 Type inference

Applied to each column of the trimmed data frame. Rules applied in order:

Priority	Condition	Inferred type
1	All non-null/non-empty values parse as a known date format	`date`
2	≥ 90% of non-null/non-empty values coerce to numeric	`numeric`
3	Otherwise	`character`
—	All values null or empty	`unknown`

Known date formats tested: %Y-%m-%d, %d/%m/%Y, %m/%d/%Y, %Y%m%d, %d-%m-%Y.

The 90% threshold means a column with up to 10% bad (non-numeric) values is still classified as numeric, making those bad values visible to QC-11.

C.6 Quality checks

Each check returns one or more dq_result objects:

dq_result(check_id, check_name, column = NA, status, observed, threshold = NA, message)
# status: "PASS" | "WARN" | "FAIL" | "INFO"

Single-snapshot checks (QC series)

ID	Name	Applies to	FAIL condition	WARN condition
QC-01	Missing rate	Every column	`missing_rate > max_missing_rate`	—
QC-02	Empty column	Every column	Column is 100% empty	—
QC-03	Duplicate rows	Table	—	Any fully-duplicate rows exist
QC-04	Row count	Table	—	— (INFO)
QC-05	Column count	Table	—	— (INFO)
QC-06	Inferred type	Every column	—	— (INFO)
QC-07	Numeric stats	Numeric columns	—	— (INFO: min/max/mean/SD)
QC-08	Distinct value count	Character columns	—	— (INFO)
QC-09	Allowed values	Configured columns	Any value outside `allowed_values`	—
QC-10	Numeric bounds	Configured columns	Any value outside `min_value`/`max_value`	—
QC-11	Non-numeric values	Numeric columns	`non_numeric_rate > max_non_numeric_rate`	`non_numeric_rate > warn_non_numeric_rate`
QC-12	Key uniqueness	Configured `key_columns` (single or composite)	Any duplicate value in key	—
QC-13	Pattern / regex	Configured columns	Any value not matching `pattern`	—
QC-14	Row count bounds	Table	`row_count < min_row_count` (if > 0); `row_count > max_row_count` (if set); file size > `max_file_size_mb` (if set)	—
QC-15	Outlier detection	Numeric columns	Value exceeds Z-score (`max_z_score`) or IQR fence (`iqr_fence_multiplier`) threshold	— (skipped if neither threshold is configured)

Schema contract checks (SC series) — only when expected_columns is configured

ID	Name	FAIL condition
SC-01	Unexpected column	Column in file but not in `expected_columns`
SC-02	Missing expected column	Column in `expected_columns` but absent from file

Version comparison checks (CP series) — skipped in single-file mode

ID	Name	Flag condition	Severity
CP-01	Row count change	`abs(pct_change) > max_row_count_change_pct` (WARN if previous has 0 rows)	WARN
CP-02a	New columns	Columns in current not in previous (suppressible via `flag_new_columns: false`)	WARN
CP-02b	Dropped columns	Columns in previous not in current (suppressible via `flag_dropped_columns: false`)	WARN
CP-02c	Column type changes	Type of a common column changed (suppressible via `flag_type_changes: false`)	WARN
CP-03	Missing rate change	Increase > `max_missing_rate_change_pp` pp per column; severity controlled by `missing_rate_change_severity`	WARN (default) / FAIL
CP-04	Numeric mean shift	Shift > `max_numeric_mean_shift_pct` per column	WARN
CP-05	New distinct values	Values in current not in previous (categorical columns)	INFO
CP-06	Dropped distinct values	Values in previous not in current (categorical columns)	INFO
CP-07	Non-numeric rate change	Increase > `max_non_numeric_rate_change_pp` pp per column; always emits a result	WARN
CP-08	Column order change	Column order differs from previous (suppressible via `flag_column_order_change: false`); severity overridable via `column_order_severity`	WARN (CSV) / FAIL (FWF)

CP-02 is split into three separate results (CP-02a, CP-02b, CP-02c) so each type of schema change can be independently read in the report and suppressed via config. The new_cols_vs_previous and missing_cols_vs_previous columns in the snapshots table are derived directly from the CP-02a/b results — they are not computed separately.

C.7 Custom checks

custom_checks_file in dataset config points to an .R file.
Sourced into an isolated environment whose parent is baseenv() (not the global env).
Must define custom_checks(df) returning a list of dq_result objects.
dq_result() is explicitly injected and available without qualification.
All other dqcheckr exports (e.g. resolve_col_type()) require the dqcheckr:: prefix.
Receives the trimmed data frame.
Errors (missing file, missing function, runtime error) stop the run.
Results appear in the “Custom Checks” report section and are written to column_snapshots (see C.8).

C.8 SQLite schema

Database and tables created automatically on first run via init_snapshot_db().

Table: snapshots

Column	Type	Notes
`id`	INTEGER PK	Auto-increment
`dataset_name`	TEXT
`run_timestamp`	TEXT	`YYYY-MM-DD HH:MM:SS`
`file_name`	TEXT	Basename of current file
`row_count`	INTEGER
`col_count`	INTEGER
`check_pass_count`	INTEGER
`check_warn_count`	INTEGER
`check_fail_count`	INTEGER
`check_info_count`	INTEGER
`overall_status`	TEXT	`FAIL` > `WARN` > `PASS` > `INFO`
`new_cols_vs_previous`	TEXT	Comma-separated; NULL if none or no previous
`missing_cols_vs_previous`	TEXT	Comma-separated; NULL if none or no previous
`new_cols_vs_schema`	TEXT	Comma-separated; NULL if none or `expected_columns` not set
`missing_cols_vs_schema`	TEXT	Comma-separated; NULL if none or `expected_columns` not set
`comparison_mode`	TEXT	`"comparison"` or `"single"` (single-file runs have no previous)
`render_status`	TEXT	`"success"` or `"failed"` (updated if Quarto render fails)
`type_changed_cols_vs_previous`	TEXT	Comma-separated `col (prev->curr)` pairs; NULL if none

Table: column_snapshots

Column	Type	Notes
`id`	INTEGER PK	Auto-increment
`snapshot_id`	INTEGER FK	→ `snapshots.id`
`column_name`	TEXT	Column name, or check_id for custom checks
`dq_check`	TEXT	Stat name or custom check_id
`value`	TEXT	Always stored as text
`threshold`	TEXT	NULL if no threshold applies
`severity_on_breach`	TEXT	`FAIL`, `WARN`, or NULL if informational

Rows written per column type:

`dq_check`	Written for	`threshold`	`severity_on_breach`
`inferred_type`	All columns	NULL	NULL
`missing_count`	All columns	NULL	NULL
`missing_rate`	All columns	`max_missing_rate`	`FAIL`
`distinct_count`	All columns	NULL	NULL
`numeric_parseable_mean`	Numeric only	NULL	NULL
`numeric_sd`	Numeric only	NULL	NULL
`numeric_min`	Numeric only	NULL	NULL
`numeric_max`	Numeric only	NULL	NULL
`non_numeric_count`	Numeric only	NULL	NULL
`non_numeric_rate`	Numeric only	`max_non_numeric_rate`	`FAIL`

Custom check rows: one row per dq_result with a non-NA column value. dq_check = check_id, value = observed, threshold = threshold, severity_on_breach = status if WARN/FAIL, else NULL.

C.9 Main entry points

Standard run:

run_dq_check(
  dataset_name,        # matches <dataset_name>.yml in config_dir
  config_dir = ".",    # contains dqcheckr.yml and <dataset_name>.yml
  open_report = TRUE   # open in browser when interactive
)
# Returns invisibly: list(status, report_path, snapshot_id)

Report filename: <dataset_name>_<YYYYMMDD_HHMMSS>.html

Snapshot drift analysis (compare any two historical snapshots):

compare_snapshots(
  dataset_name,
  snapshot_id_prev = NULL,   # defaults to second-most-recent
  snapshot_id_curr = NULL,   # defaults to most-recent
  db_path          = NULL,   # defaults to snapshot_db from dqcheckr.yml
  config_dir       = ".",
  report           = TRUE,   # render HTML drift report
  open_report      = interactive()
)
# Returns invisibly: named list with dataset_name, snap_prev, snap_curr,
#   table_drift, schema_changes, missing_rate_changes, non_numeric_changes,
#   mean_shifts, distinct_changes

list_snapshots(dataset_name = NULL, db_path)
# Returns a data frame of snapshot records; dataset_name = NULL returns all datasets

Console output:

[dqcheckr] customer_accounts: FAIL - 0 warning(s), 2 failure(s). Report: reports/customer_accounts_20260410_143022.html

C.10 Error handling

Situation	Behaviour
Folder does not exist	Stop; no output written
No files in folder	Stop; no output written
Explicit file path not found	Stop; no output written
Only one file available	Single-file mode; comparison and vs-previous schema skipped
File cannot be parsed	Stop with filename and parse error
SC-01 / SC-02 violations	FAIL recorded; run continues; report and snapshot written
Custom checks file missing	Stop with message
`custom_checks()` not defined	Stop with message
`custom_checks()` runtime error	Stop with R error detail
SQLite write fails	Warning emitted; HTML report still written
Report render fails	Warning emitted; snapshot marked `render_status = "failed"`
`snapshot_db` path missing	Created automatically by `init_snapshot_db()`

C.11 Dependencies

Package	Role
`readr`	CSV / FWF ingestion
`DBI` + `RSQLite`	Snapshot store
`quarto`	Report rendering (Quarto CLI required separately)
`knitr`	Template execution during Quarto render
`kableExtra`	Styled HTML tables in report templates
`ggplot2` + `gridExtra`	Trend charts in report templates
`yaml`	Config parsing
`dplyr`	Pipe operator (`%>%`) used in report templates
`tidyr`	Data reshaping for trend charts in report templates
`rlang`	Structured error conditions

All on CRAN. pointblank and dataCompareR were evaluated and not used — all logic is implemented natively. rmarkdown is listed in Suggests for vignette compilation only; it is no longer used for report rendering.

#	Issue	Detail
D-01	Explicit file naming — partial spec	The dataset config allows `current_file` and `previous_file` to override folder-scan detection, but the behaviour when one is set and the other is not has not been fully specified. Implemented as: `current_file` set but `previous_file` absent runs in single-file mode.
~~D-02~~	~~CP-08 column order — WARN vs FAIL severity~~	Resolved in v0.2.0: `column_order_severity` config key overrides the format-based default for CP-08.
~~D-03~~	~~QC-03 duplicate rows — no key scope~~	Resolved in v0.2.0: QC-12 now supports composite keys via a `key_columns` list. QC-03 retains full-row duplicate detection.

#	Feature
B-01	Cross-column consistency rules — conditional rules between columns (e.g. if `status = CLOSED` then `close_date` must not be null)
B-02	Date column recency check — flag if the most recent value in a date column is older than N days
~~B-03~~	~~Multi-column key uniqueness~~
B-04	`post_run_hook` parameter — user-supplied function for email/Slack/S3 delivery after a run
B-05	`batch_run(config_dir)` — run all configured datasets in sequence
B-06	Additional input formats — Parquet (`arrow`), Excel, database tables via `DBI`
~~B-07~~	~~CP-03 severity~~
B-08	Report themes — organisation branding via a `theme` parameter

#	Limitation
L-01	~~Type inference uses a fixed 90% threshold~~ — resolved in v0.1.1; configurable via `type_inference_threshold` in `rule_overrides`
L-02	Date formats tested are hardcoded to five common patterns; exotic formats will be misclassified as character
L-03	The report appendix reads from in-memory `col_stats`, not from SQLite. If the SQLite write fails but the report succeeds, the appendix and the database can show slightly different values in edge cases. Report render failures are non-fatal from v0.2.0: the snapshot is preserved and marked `render_status = "failed"`.
L-04	Performance scales linearly with rows but super-linearly with columns. Benchmarked on macOS / aarch64 / R 4.6.0 with 1 M-row synthetic data: 10 cols → 47.6 s (21,031 rows/s); 20 cols → 91.9 s (10,881 rows/s); 49 cols → 138.9 s (7,200 rows/s). Very wide files (100+ columns) will be substantially slower.

dqcheckr — Software Specification

Mick Mioduszewski

2026-06-01

Part A — Purpose and Intent

Why this exists

What we are building

Design principles

Part B — Business-Level Description

B.1 How it is used

B.2 What is checked

B.3 What is compared to the previous delivery

B.4 What is produced