1 Introduction

This vignette aims to illustrate the use of pubh functions for typical regression analysis in Public Health. In particular, the vignette shows the use of the following functions from pubh:

cosm_reg for reporting tables of coefficients.
cross_tbl for reporting tables of descriptive statistics by exposure of interest.
multiple for adjusting confidence intervals and p-values for post-hoc analysis.
get_r2 for accessing \(R^2\) or pseudo-\(R^2\) from regression models.
glm_coef for some special cases of regression models.

The advantages and limitations of glm_coef are:

Recognises the main models used in epidemiology/public health.
Automatically back-transforms estimates and confidence intervals, when the model requires it.
Can use robust standard errors for the calculation of confidence intervals.
- Standard errors are used by default.
- The use of standard errors is restricted by the following classes of objects (models): gee, glm and survreg.
Can display nice labels for the names of the parameters.
Returns a data frame that can be modified and/or exported as tables for publications (with further editing).

We start by loading relevant packages and setting the theme for the plots (as suggested in the Template of this package):

rm(list = ls())
library(dplyr)
library(rstatix)
library(easystats)
library(pubh)
library(sjlabelled)

theme_set(see::theme_lucid(base_size = 10))
theme_update(legend.position = "top")
options('huxtable.knit_print_df' = FALSE)
options('huxtable.autoformat_number_format' = list(numeric = "%5.2f"))
knitr::opts_chunk$set(comment = NA)

2 Multiple Linear Regression

There is no need to exponentiate the results for continuous outcomes unless the outcome fits in the log scale. In our first example, we want to estimate the effect of smoking and race on babies’ birth weights.

We can generate factors and assign labels in the same pipe stream:

data(birthwt, package = "MASS")
birthwt <- birthwt |>
  mutate(
    smoke = factor(smoke, labels = c("Non-smoker", "Smoker")),
    race = factor(race, labels = c("White", "African American", "Other"))
    ) |>
  var_labels(
    bwt = 'Birth weight (g)',
    smoke = 'Smoking status',
    race = 'Race'
    )

Is good to start with some basic descriptive statistics, so we can compare the birth weight between groups.

Graphical analysis:

birthwt |>
  box_plot(bwt ~ smoke, fill = ~ race)

Another way to compare the means between the groups is with gen_bst_df, which estimates the means of corresponding bootstrapped CIs.

birthwt |>
  gen_bst_df(bwt ~ race|smoke) |>
  as_hux() |> theme_pubh()

Birth weight (g)	LowerCI	UpperCI	Race	smoke
3428.75	3228.14	3635.42	White	Non-smoker
2826.85	2666.18	2996.39	White	Smoker
2854.50	2536.03	3139.63	African American	Non-smoker
2504.00	2124.68	2853.81	African American	Smoker
2815.78	2623.95	2998.48	Other	Non-smoker
2757.17	2292.10	3136.92	Other	Smoker

We fit a linear model.

model_norm <- lm(bwt ~ smoke + race, data = birthwt)

Note: Model diagnostics are not be discussed in this vignette.

Traditional output from the model:

model_norm |> Anova()

Anova Table (Type II tests)

Response: bwt
            Sum Sq  Df F value    Pr(>F)    
smoke      7322575   1 15.4588 0.0001191 ***
race       8712354   2  9.1964 0.0001557 ***
Residuals 87631356 185                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model_norm |> parameters()

Parameter               | Coefficient |     SE |             95% CI | t(185) |      p
-------------------------------------------------------------------------------------
(Intercept)             |     3334.95 |  91.78 | [3153.89, 3516.01] |  36.34 | < .001
smoke [Smoker]          |     -428.73 | 109.04 | [-643.86, -213.60] |  -3.93 | < .001
race [African American] |     -450.36 | 153.12 | [-752.45, -148.27] |  -2.94 | 0.004 
race [Other]            |     -452.88 | 116.48 | [-682.67, -223.08] |  -3.89 | < .001


Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.

model_norm |> performance()

# Indices of model performance

AIC      |     AICc |      BIC |    R2 | R2 (adj.) |    RMSE |   Sigma
----------------------------------------------------------------------
3012.223 | 3012.551 | 3028.432 | 0.123 |     0.109 | 680.924 | 688.246

Table of coefficients for publication:

model_norm |>
  tbl_regression() |> 
  cosm_reg() |> theme_pubh() |> 
  add_footnote(get_r2(model_norm), font_size = 9)

Variable	Beta	95% CI	p-value
Smoking status			<0.001
Non-smoker	—	—
Smoker	-429	-644, -214
Race			<0.001
White	—	—
African American	-450	-752, -148
Other	-453	-683, -223
R2 = 0.123

model_norm |> 
  glm_coef(labels = model_labels(model_norm)) |>
  as_hux() |> set_align(everywhere, 2:3, "right") |>
  theme_pubh() |>
  add_footnote(get_r2(model_norm), font_size = 9)

Parameter	Coefficient	Pr(>\|t\|)
Constant	3334.95 (3153.89, 3516.01)	< 0.001
Smoking status: Smoker	-428.73 (-643.86, -213.6)	< 0.001
Race: African American	-450.36 (-752.45, -148.27)	0.004
Race: Other	-452.88 (-682.67, -223.08)	< 0.001
R2 = 0.123

Function glm_coef allows the use of robust standard errors.

model_norm |>
  glm_coef(se_rob = TRUE, labels = model_labels(model_norm)) |>
  as_hux() |> set_align(everywhere, 2:3, "right") |>
  theme_pubh() |>
  add_footnote(paste(
    get_r2(model_norm), "\n",
    "CIs and p-values estimated with robust standard errors."),
    font_size = 9)

Parameter	Coefficient	Pr(>\|t\|)
Constant	3334.95 (3144.36, 3525.53)	< 0.001
Smoking status: Smoker	-428.73 (-652.88, -204.58)	< 0.001
Race: African American	-450.36 (-734.09, -166.63)	0.002
Race: Other	-452.88 (-701.4, -204.35)	< 0.001
R2 = 0.123 CIs and p-values estimated with robust standard errors.

We can use functions from easystats: estimate_means calculates the mean birth weight for each combination with corresponding confidence intervals.

model_norm |>
  estimate_means(c("race", "smoke"))

Estimated Marginal Means

race             |      smoke |    Mean |     SE |             95% CI
---------------------------------------------------------------------
White            | Non-smoker | 3334.95 |  91.78 | [3153.89, 3516.01]
African American | Non-smoker | 2884.59 | 141.34 | [2605.74, 3163.44]
Other            | Non-smoker | 2882.07 |  86.32 | [2711.77, 3052.37]
White            |     Smoker | 2906.22 |  86.21 | [2736.14, 3076.30]
African American |     Smoker | 2455.86 | 150.74 | [2158.48, 2753.24]
Other            |     Smoker | 2453.34 | 122.81 | [2211.05, 2695.63]

Marginal means estimated at race

We can plot the output from estimate_means using a recipe from see:

model_norm |> 
  estimate_means(by = c("race", "smoke")) |> 
  plot() |> 
  gf_labs(
    x = "", y = "Birth weight (g)", title = "",
    col = "Smoking status"
  )

We have more flexibility and construct a more tidy plot too:

model_norm |>
  estimate_means(c("race", "smoke")) |>
  gf_point(Mean ~ race) |>
  gf_errorbar(CI_low + CI_high ~ race, width = 0.3) |>
  gf_facet_wrap(~smoke) |>
  gf_labs(
    x = "Race",
    y = "Birth weight (g)"
  )

When the explanatory variables are categorical, another option is emmip from the emmeans package. We can include CIs in emmip but as estimates are connected, the resulting plots look messy, so I recommend emmip to look at the trace.

emmip(model_norm, smoke ~ race) |>
  gf_labs(y = get_label(birthwt$bwt), x = "", col = "Smoking status")

3 Logistic Regression

For logistic regression, we are interested in odds ratios. We will examine the effect of fibre intake on the development of coronary heart disease.

data(diet, package = "Epi")
diet <- diet |>
  mutate(
    chd = factor(chd, labels = c("No CHD", "CHD"))
  ) |>
  var_labels(
    chd = "Coronary Heart Disease",
    fibre = "Fibre intake (10 g/day)"
    )

We start with descriptive statistics:

diet |> estat(~ fibre|chd) |>
  as_hux() |> theme_pubh()

	Coronary Heart Disease	N	Min	Max	Mean	Median	SD	CV
Fibre intake (10 g/day)	No CHD	288.00	0.60	5.35	1.75	1.69	0.58	0.33
	CHD	45.00	0.76	2.43	1.49	1.51	0.40	0.27

diet |> na.omit() |>
  copy_labels(diet) |>
  box_plot(fibre ~ chd)

We fit a linear logistic model:

model_binom <- glm(chd ~ fibre, data = diet, family = binomial)

Summary:

model_binom |> parameters(exponentiate = TRUE)

Parameter   | Odds Ratio |   SE |       95% CI |     z |     p
--------------------------------------------------------------
(Intercept) |       0.95 | 0.58 | [0.30, 3.16] | -0.08 | 0.936
fibre       |       0.33 | 0.12 | [0.15, 0.66] | -2.94 | 0.003


Uncertainty intervals (profile-likelihood) and p-values (two-tailed)
  computed using a Wald z-distribution approximation.

model_binom |> performance()

# Indices of model performance

AIC     |    AICc |     BIC | Tjur's R2 |  RMSE | Sigma | Log_loss | Score_log | Score_spherical |   PCP
--------------------------------------------------------------------------------------------------------
257.535 | 257.571 | 265.151 |     0.028 | 0.337 | 1.000 |    0.381 |    -6.359 |           0.017 | 0.773

Table of coefficients for publication:

model_binom |> 
  tbl_regression(exponentiate = TRUE) |> 
  cosm_reg() |> theme_pubh() |> 
  add_footnote(get_r2(model_binom), font_size = 9)

Variable	OR	95% CI	p-value
Fibre intake (10 g/day)	0.33	0.15, 0.66	0.001
Tjur's R2 = 0.028

Effect plot:

model_binom |>
  estimate_means(by = "fibre") |>
  gf_line(Probability ~ fibre) |>
  gf_ribbon(CI_low + CI_high ~ fibre) |>
  gf_labs(
    x = "Fibre intake (10 g/day)",
    y = "Probability of CHD"
  )

3.1 Matched Case-Control Studies: Conditional Logistic Regression

We will look at a matched case-control study on the effect of oestrogen use and history of gall bladder disease on the development of endometrial cancer.

data(bdendo, package = "Epi") 
bdendo <- bdendo |>
  mutate(
    cancer = factor(d, labels = c('Control', 'Case')),
    gall = factor(gall, labels = c("No GBD", "GBD")),
    est = factor(est, labels = c("No oestrogen", "Oestrogen"))
  ) |>
  var_labels(
    cancer = 'Endometrial cancer',
    gall = 'Gall bladder disease',
    est = 'Oestrogen'
  )

We start with descriptive statistics:

bdendo |> 
  mutate(
    cancer = relevel(cancer, ref = "Case"),
    gall = relevel(gall, ref = "GBD"),
    est = relevel(est, ref = "Oestrogen")
  ) |>
  copy_labels(bdendo) |>
  select(cancer, gall, est) |> 
  tbl_strata(
    strata = est,
    .tbl_fun = ~ .x |>
      tbl_summary(by = gall)
  ) |> 
  cosm_sum(bold = TRUE, head_label = " ") |> 
  theme_pubh(2) |> 
  set_align(1, everywhere, "center")

	Oestrogen		No oestrogen
	GBD N = 29	No GBD N = 154	GBD N = 12	No GBD N = 120
Endometrial cancer
Case	13 (45%)	43 (28%)	4 (33%)	3 (2.5%)
Control	16 (55%)	111 (72%)	8 (67%)	117 (98%)

bdendo |>
  gf_percents(~ cancer|gall, fill = ~ est, position = "dodge", alpha = 0.6) |>
  gf_labs(
    y = "Percent",
    x = "",
    fill = ""
  )

We fit the conditional logistic model:

require(survival, quietly = TRUE)
model_clogit <- clogit(cancer == 'Case' ~ est * gall + strata(set), data = bdendo)

model_clogit |>
  glm_coef(labels = c("Oestrogen/No oestrogen", "GBD/No GBD",
                      "Oestrogen:GBD Interaction")) |>
  as_hux() |> set_align(everywhere, 2:3, "right") |>
  theme_pubh() |>
  add_footnote(get_r2(model_clogit), font_size = 9)

Parameter	Odds ratio	Pr(>\|z\|)
Oestrogen/No oestrogen	14.88 (4.49, 49.36)	< 0.001
GBD/No GBD	18.07 (3.2, 102.01)	0.001
Oestrogen:GBD Interaction	0.13 (0.02, 0.9)	0.039
Nagelkerke's R2 = 0.305

3.2 Ordinal Logistic Regression

We use data about house satisfaction.

#require(MASS, quietly = TRUE)
data(housing, package = "MASS")
housing <- housing |>
  var_labels(
    Sat = "Satisfaction",
    Infl = "Perceived influence",
    Type = "Type of rental",
    Cont = "Contact"
    )

We fit the ordinal logistic model:

model_ord <- MASS::polr(Sat ~ Infl + Type + Cont, weights = Freq, data = housing, Hess = TRUE)

model_ord |> 
  tbl_regression(exponentiate = TRUE) |> 
  cosm_reg() |> theme_pubh() |> 
  add_footnote(get_r2(model_ord), font_size = 9)

Variable	OR	95% CI	p-value
Perceived influence			<0.001
Low	—	—
Medium	1.76	1.44, 2.16
High	3.63	2.83, 4.66
Type of rental			<0.001
Tower	—	—
Apartment	0.56	0.45, 0.71
Atrium	0.69	0.51, 0.94
Terrace	0.34	0.25, 0.45
Contact			<0.001
Low	—	—
High	1.43	1.19, 1.73
Nagelkerke's R2 = 0.108

Effect plots:

model_ord |>
  estimate_means(by = c("Infl", "Cont", "Type")) |>
  mutate(
    Mean = inv_logit(Mean),
    CI_low = inv_logit(CI_low),
    CI_high = inv_logit(CI_high)
  ) |>
  gf_errorbar(CI_low + CI_high ~ Infl, width = 0.2) |>
  gf_point(Mean ~ Infl, size = 0.7) |>
  gf_facet_grid(~ Type + Cont) |>
  gf_labs(
    x = "Perceived influence",
    y = "Satisfaction"
  )

emmip(model_ord, Infl ~ Type |Cont) |>
  gf_labs(x = "Type of rental", col = "Perceived influence")

4 Poisson Regression

For Poisson regression, we are interested in incidence rate ratios. We will examine the effect of sex, ethnicity, and age group on the number of absent days from school in a year.

data(quine, package = "MASS")
levels(quine$Eth) <- c("Aboriginal", "White")
levels(quine$Sex) <- c("Female", "Male")

quine <- quine |>
  var_labels(
    Days = "Number of absent days",
    Eth = "Ethnicity",
    Age = "Age group"
    )

Descriptive statistics:

quine |> 
  cross_tbl(by = "Eth") |> 
  theme_pubh(2) |> 
  add_footnote("n (%); Median (IQR)", font_size = 9)

	Ethnicity
	Aboriginal N = 69	White N = 77	Overall N = 146
Sex
Female	38 (55%)	42 (55%)	80 (55%)
Male	31 (45%)	35 (45%)	66 (45%)
Age group
F0	13 (19%)	14 (18%)	27 (18%)
F1	20 (29%)	26 (34%)	46 (32%)
F2	20 (29%)	20 (26%)	40 (27%)
F3	16 (23%)	17 (22%)	33 (23%)
Lrn
AL	40 (58%)	43 (56%)	83 (57%)
SL	29 (42%)	34 (44%)	63 (43%)
Number of absent days	15 (6, 32)	7 (4, 16)	11 (5, 23)
n (%); Median (IQR)

quine |>
  box_plot(Days ~ Age|Sex, fill = ~ Eth)

We start by fitting a standard Poisson linear regression model:

model_pois <- glm(Days ~ Eth + Sex + Age, family = poisson, data = quine)

model_pois |> 
  tbl_regression(exponentiate = TRUE) |> 
  cosm_reg() |> theme_pubh() |> 
  add_footnote(get_r2(model_pois), font_size = 9)

Variable	IRR	95% CI	p-value
Ethnicity			<0.001
Aboriginal	—	—
White	0.59	0.54, 0.64
Sex			0.012
Female	—	—
Male	1.11	1.02, 1.21
Age group			<0.001
F0	—	—
F1	0.80	0.70, 0.91
F2	1.42	1.26, 1.60
F3	1.35	1.19, 1.53
Nagelkerke's R2 = 0.896

model_pois |> performance()

# Indices of model performance

AIC      |     AICc |      BIC | Nagelkerke's R2 |   RMSE | Sigma | Score_log | Score_spherical
-----------------------------------------------------------------------------------------------
2342.982 | 2343.586 | 2360.883 |           0.896 | 14.940 | 1.000 |    -7.983 |           0.049

4.1 Negative-binomial

The assumption is that the mean is equal to the variance. If that is the case, deviance should be close to the degrees of freedom of the residuals. We can check for overdispersion:

model_pois |> check_overdispersion()

# Overdispersion test

       dispersion ratio =   13.890
  Pearson's Chi-Squared = 1944.553
                p-value =  < 0.001

Overdispersion detected.

Thus, we have overdispersion. One option is to use a negative binomial distribution.

model_negbin <- MASS::glm.nb(Days ~ Eth + Sex + Age, data = quine)

model_negbin |> 
  tbl_regression(exponentiate = TRUE) |> 
  cosm_reg() |> theme_pubh() |> 
  add_footnote(get_r2(model_negbin), font_size = 9)

Variable	IRR	95% CI	p-value
Ethnicity			<0.001
Aboriginal	—	—
White	0.57	0.41, 0.77
Sex			0.7
Female	—	—
Male	1.07	0.77, 1.47
Age group			0.023
F0	—	—
F1	0.69	0.43, 1.09
F2	1.20	0.75, 1.89
F3	1.29	0.80, 2.06
Nagelkerke's R2 = 0.21

model_negbin |> performance()

# Indices of model performance

AIC      |     AICc |      BIC | Nagelkerke's R2 |   RMSE | Sigma | Score_log | Score_spherical
-----------------------------------------------------------------------------------------------
1109.653 | 1110.464 | 1130.538 |           0.210 | 15.087 | 1.000 |    -3.908 |           0.067

Effect plot:

model_negbin |>
  estimate_means(by = c("Age", "Eth", "Sex")) |>
  gf_errorbar(CI_low + CI_high ~ Age, width = 0.2) |>
  gf_point(Mean ~ Age, size = 1) |>
  gf_facet_wrap(~ Eth + Sex) |>
  gf_labs(
    x = "Age group",
    y = "Number of absent days"
  )

emmip(model_negbin, Eth ~ Age|Sex) |>
  gf_labs(y = "Number of absent days", x = "Age group", col = "Ethnicity")

4.2 Adjusting CIs and p-values for multiple comparisons

We adjust for multiple comparisons:

multiple(model_negbin, ~ Age|Eth)$df

   contrast        Eth ratio   SE null z.ratio p.value lower.CL upper.CL
1   F1 / F0 Aboriginal  0.69 0.16    1   -1.57    0.40     0.38     1.26
2   F2 / F0 Aboriginal  1.20 0.28    1    0.77    0.86     0.66     2.17
3   F2 / F1 Aboriginal  1.73 0.35    1    2.65    0.04     1.02     2.92
4   F3 / F0 Aboriginal  1.29 0.31    1    1.04    0.72     0.69     2.40
5   F3 / F1 Aboriginal  1.86 0.40    1    2.89    0.02     1.07     3.21
6   F3 / F2 Aboriginal  1.08 0.23    1    0.34    0.99     0.62     1.88
7   F1 / F0      White  0.69 0.16    1   -1.57    0.40     0.38     1.26
8   F2 / F0      White  1.20 0.28    1    0.77    0.86     0.66     2.17
9   F2 / F1      White  1.73 0.35    1    2.65    0.04     1.02     2.92
10  F3 / F0      White  1.29 0.31    1    1.04    0.72     0.69     2.40
11  F3 / F1      White  1.86 0.40    1    2.89    0.02     1.07     3.21
12  F3 / F2      White  1.08 0.23    1    0.34    0.99     0.62     1.88

We can see the comparison graphically with:

multiple(model_negbin, ~ Age|Eth)$fig_ci |>
  gf_labs(x = "IRR")

5 Survival Analysis

We will use an example on the effect of thiotepa versus placebo on the development of bladder cancer.

data(bladder)

Warning in data(bladder): data set 'bladder' not found

bladder <- bladder |>
  mutate(times = stop,
         rx = factor(rx, labels=c("Placebo", "Thiotepa"))
         ) |>
  var_labels(times = "Survival time",
             rx = "Treatment")

5.1 Parametric method

model_surv <- survreg(Surv(times, event) ~ rx, data = bladder)

Using robust standard errors:

model_surv |>
  glm_coef(labels = c("Treatment: Thiotepa/Placebo", "Scale"), se_rob = TRUE) |>
  as_hux() |> set_align(everywhere, 2:3, "right") |>
  theme_pubh(1) |>
  add_footnote(get_r2(model_surv), font_size = 9)

Parameter	Survival time ratio	Pr(>\|z\|)
Treatment: Thiotepa/Placebo	1.64 (0.89, 3.04)	0.116
Scale	1 (0.85, 1.18)	0.992
Nagelkerke's R2 = 0.02

In this example, the scale parameter is not statistically different from one, meaning the hazard is constant, and thus, we can use the exponential distribution:

model_exp <- survreg(Surv(times, event) ~ rx, data = bladder, dist = "exponential")

model_exp |>
  glm_coef(labels = c("Treatment: Thiotepa/Placebo"), se_rob = TRUE) |>
  as_hux() |> set_align(everywhere, 2:3, "right") |>
  theme_pubh() |>
  add_footnote(get_r2(model_exp), font_size = 9)

Parameter	Survival time ratio	Pr(>\|z\|)
Treatment: Thiotepa/Placebo	1.64 (0.85, 3.16)	0.139
Nagelkerke's R2 = 0.02

Interpretation: Patients receiving Thiotepa live on average 64% more than those in the Placebo group.

Using naive standard errors:

model_exp |>
  glm_coef(labels = c("Treatment: Thiotepa/Placebo")) |>
  as_hux() |> set_align(everywhere, 2:3, "right") |>
  theme_pubh() |>
  add_footnote(get_r2(model_exp), font_size = 9)

Parameter	Survival time ratio	Pr(>\|z\|)
Treatment: Thiotepa/Placebo	1.64 (1.11, 2.41)	0.012
Nagelkerke's R2 = 0.02

model_exp |>
  estimate_means(by = "rx") |>
  gf_errorbar(CI_low + CI_high ~ rx, width = 0.3) |>
  gf_point(Mean ~ rx, size = 1) |>
  gf_labs(
    x = "Treatment",
    y = "Survival time"
  )

5.2 Cox proportional hazards regression

model_cox <-  coxph(Surv(times, event) ~ rx, data = bladder)

model_cox |> 
  tbl_regression(exponentiate = TRUE) |> 
  cosm_reg() |> theme_pubh() |> 
  add_footnote(get_r2(model_cox), font_size = 9)

Variable	HR	95% CI	p-value
Treatment			0.022
Placebo	—	—
Thiotepa	0.64	0.44, 0.94
Nagelkerke's R2 = 0.016

model_cox |>
  estimate_means(by = "rx")

Estimated Marginal Means

rx       | Mean |   SE |       95% CI
-------------------------------------
Placebo  | 1.00 | 0.00 | [1.00, 1.00]
Thiotepa | 0.64 | 0.13 | [0.44, 0.94]

Marginal means estimated at rx

Interpretation: Patients receiving Thiotepa are 64% less likely of dying than those in the Placebo group.

Regression Examples

Josie Athens

2024-08-17