The derive_* functions convert raw UKB columns into
analysis-ready variables. This vignette covers the disease phenotype
derivation pipeline:
| Step | Function(s) | Purpose |
|---|---|---|
| 1 | derive_missing() |
Handle “Do not know” / “Prefer not to answer” |
| 2 | derive_covariate() |
Convert types; summarise covariates |
| 3 | derive_cut() |
Bin continuous variables into groups |
| 4 | derive_selfreport() |
Self-reported disease status + date |
| 5 | derive_hes() |
HES inpatient ICD-10 status + date |
| 6 | derive_first_occurrence() |
First Occurrence field status + date |
| 7 | derive_cancer_registry() |
Cancer registry status + date |
| 8 | derive_death_registry() |
Death registry ICD-10 status + date |
| 9 | derive_icd10() |
Combine any subset of sources (wrapper) |
| 10 | derive_case() |
Merge self-report + ICD-10 into final case definition |
All functions accept a data.frame or
data.table and return a data.table. For
data.table input, new columns are added by
reference (no copy); data.frame input is converted
to data.table internally before modification.
In production, replace
ops_toy()withextract_batch()followed bydecode_values()anddecode_names(). Seevignette("decode"). Column names below use the RAP raw format (p{field}_{instance}_{array}) as returned byops_toy()andextract_batch()before decoding.
UKB uses special labels such as "Do not know" and
"Prefer not to answer" to distinguish refusal from true
missing data. derive_missing() converts these to
NA (default) or retains them as "Unknown" for
modelling.
Performance:
derive_missing()usesdata.table::set()for in-place replacement — no column copies are made regardless of dataset size.
To keep non-response as a model category:
To add custom labels beyond the built-in list:
derive_covariate() converts categorical columns to
factor and prints a distribution summary for each.
df <- derive_covariate(
df,
as_factor = c(
"p31", # sex
"p20116_i0", # smoking_status_i0
"p1558_i0" # alcohol_intake_frequency_i0
),
factor_levels = list(
p20116_i0 = c("Never", "Previous", "Current")
)
)derive_cut() creates a new factor column by binning a
continuous variable into quantile-based or custom groups.
df <- derive_cut(
df,
col = "p21001_i0", # body_mass_index_bmi_i0
n = 4,
breaks = c(18.5, 25, 30),
labels = c("Underweight", "Normal", "Overweight", "Obese"),
name = "bmi_cat"
)
df <- derive_cut(
df,
col = "p22189", # townsend_deprivation_index_at_recruitment
n = 4,
labels = c("Q1 (least deprived)", "Q2", "Q3", "Q4 (most deprived)"),
name = "tdi_cat"
)derive_selfreport() searches UKB self-reported
non-cancer illness (field 20002) or cancer (field 20001) columns for a
disease label matching a regex, then returns binary status and the
earliest report date. Column detection is automatic from field IDs.
# Non-cancer: type 2 diabetes (field 20002)
df <- derive_selfreport(df,
name = "dm",
regex = "type 2 diabetes"
)# Cancer: lung cancer (field 20001)
df <- derive_selfreport(df,
name = "lung_cancer",
regex = "lung cancer",
field = "cancer"
)This adds two columns per call:
| Column | Type | Description |
|---|---|---|
dm_selfreport |
logical | TRUE if any instance matched |
dm_selfreport_date |
IDate | Earliest report date |
derive_hes() scans UKB Hospital Episode Statistics
ICD-10 codes (field 41270, stored as a JSON array per participant) and
matches the earliest corresponding date from field 41280.
# Prefix match: codes starting with "I10" (hypertension)
df <- derive_hes(df, name = "htn", icd10 = "I10")
# Exact match
df <- derive_hes(df, name = "dm_hes", icd10 = "E11", match = "exact")
# Regex: E10 and E11 simultaneously
df <- derive_hes(df, name = "dm_broad", icd10 = "^E1[01]", match = "regex")The match argument controls how codes are compared:
match |
Behaviour | Example |
|---|---|---|
"prefix" (default) |
Code starts with pattern | "E11" matches "E110",
"E119" |
"exact" |
Full 3- or 4-digit match | "E11" matches only "E11" |
"regex" |
Full regular expression | "^E1[01]" |
UKB First Occurrence fields (p131xxx) record the earliest date a condition was observed across all linked sources — self-report, HES inpatient, GP records, and death registry — pre-integrated by UKB. Look up your disease in the UKB Field Finder.
# ops_toy includes p131742 as a representative First Occurrence column
df <- derive_first_occurrence(df, name = "htn", field = 131742L, col = "p131742")derive_cancer_registry() searches the cancer registry
ICD-10 field (40006) and optionally filters by histology (field 40011)
and behaviour (field 40012).
# ICD-10 only
df <- derive_cancer_registry(df,
name = "skin_cancer",
icd10 = "^C44"
)
# With histology and behaviour filters
df <- derive_cancer_registry(df,
name = "scc",
icd10 = "^C44",
histology = c(8070L, 8071L, 8072L),
behaviour = 3L # 3 = malignant
)derive_death_registry() searches primary (field 40001)
and secondary (field 40002) causes of death for ICD-10 codes.
df <- derive_death_registry(df, name = "mi", icd10 = "I21")
df <- derive_death_registry(df, name = "dm", icd10 = "E11")
df <- derive_death_registry(df, name = "lung", icd10 = "C34")derive_icd10()derive_icd10() is a high-level wrapper that calls any
combination of the source-specific functions above and merges their
outputs into a single status column and earliest date. This is the
recommended approach for multi-source ascertainment.
# Non-cancer disease: HES + death + First Occurrence
df <- derive_icd10(df,
name = "dm",
icd10 = "E11",
source = c("hes", "death", "first_occurrence"),
fo_col = "p131742"
)
# Cancer outcome: cancer registry
df <- derive_icd10(df,
name = "lung",
icd10 = "^C3[34]",
match = "regex",
source = "cancer_registry",
behaviour = 3L
)Intermediate source columns are retained alongside the combined result:
| Column | Type | Description |
|---|---|---|
dm_icd10 |
logical | TRUE if positive in any specified source |
dm_icd10_date |
IDate | Earliest date across all sources |
dm_hes |
logical | HES status |
dm_hes_date |
IDate | HES date |
dm_fo |
logical | First Occurrence status |
dm_fo_date |
IDate | First Occurrence date |
dm_death |
logical | Death registry status |
dm_death_date |
IDate | Death registry date |
derive_case() merges the self-report and ICD-10 flags
into a unified case status, with the earliest date across both sources
taken via pmin().
Output columns:
| Column | Type | Description |
|---|---|---|
dm_status |
logical | TRUE if positive in self-report OR ICD-10 |
dm_date |
IDate | Earliest date across all sources
(pmin) |
Why the earliest date matters:
dm_dateis the direct input toderive_timing(),derive_age(), andderive_followup()— it is the chronological anchor of every downstream survival analysis. Seevignette("derive-survival").
?derive_missing, ?derive_covariate,
?derive_cut?derive_selfreport, ?derive_hes,
?derive_first_occurrence?derive_cancer_registry,
?derive_death_registry?derive_icd10, ?derive_casevignette("derive-survival") — timing, age at event,
follow-upvignette("decode") — decoding column names and
values