An Introduction to AssocBin

This vignette introduces the main features of the AssocBin package. It begins with a high-level overview of the basic functions and their uses before examining ways to customize the package behaviour. The examples are interspersed with relevant theory to motivate the package, and a full treatment can be found in Salahub and Oldford, 2025.

Basic use

It’s easiest to understand the use of AssocBin in the context of exploring a data set. Included in the package is a version of the the heart disease data from the UCI machine learning data repository.

Heart data

The heart data can be loaded from the package using:

data(heart)

Here is a summary of its structure:

str(heart)

## 'data.frame':    920 obs. of  15 variables:
##  $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 1 1 2 2 ...
##  $ cp      : Factor w/ 4 levels "atypical","non-angina",..: 4 3 3 2 1 1 3 3 3 3 ...
##  $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
##  $ restecg : Factor w/ 3 levels "hypertrophy",..: 1 1 1 2 1 2 1 2 1 1 ...
##  $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : logi  FALSE TRUE TRUE FALSE FALSE FALSE ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : Factor w/ 3 levels "down","flat",..: 1 2 2 1 3 3 1 3 2 1 ...
##  $ ca      : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
##  $ thal    : Factor w/ 3 levels "normal","fixed",..: 2 1 3 1 1 1 1 1 3 3 ...
##  $ num     : Factor w/ 5 levels "0","1","2","3",..: 1 3 2 1 1 1 4 1 3 2 ...
##  $ study   : chr  "cleveland" "cleveland" "cleveland" "cleveland" ...

It contains 920 observations of 15 variables collected on patients referred to various hospitals around the world to undergo a series of measurements of heart function in order to relate them to the presence of coronary heart disease. The variables are:

age: age
sex: sex
cp: clinical description of any chest pain
trestbps: resting blood pressure on hospital admission
chol: blood serum cholesterol concentration
fbs: indicator of whether fasting blood sugar is greater than 120 mg/dl
restecg: classification of heart waves at rest as measured by an electrocardiogram
thalach: maximum heart rate achieved in an exercise test
exang: whether the exercise test induced angina
oldpeak: ST heart wave depression induced by the exercise test
slope: the slope of the ST heart wave peak during the exercise test
ca: the count of calcified major blood vessels in the heart identified by fluoroscopic imaging
thal: categorization of any defects in heart circulation induced by exercise as measured by thallium scintigraphy
num: count of major blood vessels in the heart with a narrowing of greater than 50%
study: the location of the patient’s testing

Of particular interest is the num variable, the original response in the study which collected the data (Detrano et al., 1989). It counts the number of diseased of coronary vessels, where the presence of disease is defined as a narrowing of the vessel by more than 50% from a healthy baseline. Basically, patients with num=0 have hearts without serious coronary artery disease and the severity of disease increases with each integer increase of num due to more blood vessels being blocked significantly.

For simplicity, we’ll clean the data somewhat by removing mostly missing variables and dropping incomplete observations for the rest of the vignette.

heartClean <- heart
heartClean$thal <- NULL
heartClean$ca <- NULL
heartClean$slope <- NULL
heartClean <- na.omit(heartClean)
str(heartClean)

## 'data.frame':    740 obs. of  12 variables:
##  $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 1 1 2 2 ...
##  $ cp      : Factor w/ 4 levels "atypical","non-angina",..: 4 3 3 2 1 1 3 3 3 3 ...
##  $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
##  $ restecg : Factor w/ 3 levels "hypertrophy",..: 1 1 1 2 1 2 1 2 1 1 ...
##  $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : logi  FALSE TRUE TRUE FALSE FALSE FALSE ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ num     : Factor w/ 5 levels "0","1","2","3",..: 1 3 2 1 1 1 4 1 3 2 ...
##  $ study   : chr  "cleveland" "cleveland" "cleveland" "cleveland" ...
##  - attr(*, "na.action")= 'omit' Named int [1:180] 306 331 335 338 348 369 376 379 385 390 ...
##   ..- attr(*, "names")= chr [1:180] "306" "331" "335" "338" ...

Exploration using basic functions

The simplest ways to use the AssocBin package to explore a data set are the DepSearch and depDisplay functions. DepSearch performs all pairwise comparisons between variables using recursive random binning and returns the results in a DepSearch S3 object. depDisplay generates a departure display, a two-dimensional histogram highlighting areas of high and low density, for a given variable pair.

Dual categorical variables

We start by comparing a pair of variables directly. Using depDisplay, we can inspect the relationship between patient sex and num using a departure display

depDisplay(heartClean$sex, heartClean$num)

Optional arguments can be supplied to change plot features following the plot naming conventions.

SexVsNum <- depDisplay(heartClean$sex, heartClean$num, xlab = "Sex", 
                       ylab = "Number of arteries >50% obstructed", 
                       pch = 20)

Labels and point types aside, reading this plot requires a basic understanding of the underlying algorithm. sex and num are both categorical variables, and so the departure display is a particular way of encoding the contingency table between them. Explicitly:

rbind(cbind(table(num = heartClean$num, sex = heartClean$sex), total = table(heartClean$num)),
            total = c(table(heartClean$sex), nrow(heartClean)))

##       female male total
## 0        131  226   357
## 1         26  178   204
## 2          7   72    79
## 3          8   70    78
## 4          2   20    22
## total    174  566   740

Each coloured cell, or bin, in the departure display corresponds to a count in the table excluding the columns and rows labelled total, which provide the marginal distributions. The width and height of each bin reflect these distributions and are proportional to the corresponding row and column totals respectively. The area of each bin is therefore proportional to the expected proportion of points it contains under the assumption of independence (when the joint distribution is proportional to the product of the marginal distributions). Saturation and hue communicate how severely the observed counts exceed or fall short of this expected count.

Take, for example, the bin labelled ‘female’ horizontally and ‘0’ vertically. The width of this bin is given by the count of female patients (174) divided by the total number of patients (740) then multiplied by the width of the plotting area. This means it occupies a relative width of \(w= 174/740 = 0.235\) of the plot width. Its height is similarly determined by the count of patients without any coronary artery disease (CAD) divided by the total number and it has a relative height of \(h= 357/740 = 0.482\) to the plot.

Under independence, the joint probability \(P(\text{sex}=x, \text{num}=y)\) obeys the factorization \[P(\text{sex}=x, \text{num}=y) = P(\text{sex}=x) P(\text{num}=y)\] and so the expected count of patients in our example bin, female patients without CAD, is given by \[\frac{357}{740}*\frac{174}{740}*740=83.9.\] Referring to the analogous bin in the contingency table, we have observed 131. As this is a larger number than expected, the bin is given a red hue (blue-shaded bins indicate fewer observations in a bin than expected). The saturation of this shading is determined by the magnitude of the standardized Pearson residual. For bin \(i\) with expected count \(e_i\), observed count of \(o_i\), relative width \(w_i\), and relative height \(h_i\) this is defined as \[r_i = \frac{o_{i} - e_{i}}{\sqrt{e_{i}(1 - w_i)(1 - h_i)}}.\] The standardized Pearson residuals are a corrected version of the typical Pearson residuals for contingency tables which follow a standard normal distribution. This fact is used in the departure display to determine the saturation, where no saturation is applied to standardized residuals which have an absolute value less than 2 and a colour ramp applied which achieves its deepest saturation at 4. For our example bin, the standardized residual is \[\frac{131 - 83.9}{\sqrt{83.9 \left ( 1 - \frac{174}{740} \right ) \left ( 1 - \frac{357}{740} \right )}} = 8.17,\] which is quite a bit larger than the upper part of the colour ramp and so receives the deepest possible saturation.

The same process as has been applied to this example bin is applied to all other bins to obtain their hues and saturations before \(o_i\) points are overlaid at randomly chosen positions within each bin to add a second visual display of density. In this way, the departure display communicates visually the departure of the observed counts from what we would expect if the two variables were independent. Areas of deep red saturation indicate regions with far more points and areas of deep blue indicate areas with far fewer points than we would expect under typical sampling variation. These therefore draw our attention to these areas that the model of independence does not explain well.

In the example bin, we can see the model of independence does not describe the observed pattern well: many more female patients lack CAD and many more male patients have CAD than we would expect under independence. Note the SexVsNum <- assignment in the depDisplay call. Aside from plotting, the function passes the resulting bins invisibly to allow further exploration. As each bin is stored as a list of features, these are not very easy to inspect:

str(SexVsNum, 1)

## List of 10
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7
##  $ :List of 7

str(SexVsNum[[1]])

## List of 7
##  $ x      : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
##  $ y      : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ bnds   :List of 2
##   ..$ x: num [1:2] 0 174
##   ..$ y: num [1:2] 0 357
##  $ expn   : num 83.9
##  $ n      : int 131
##  $ depth  : num 1
##  $ stopped: logi TRUE

Helper functions allow us to compute aggregate and individual bin statistics, however. For example, to compute the \(\chi^2\) test statistic for independence, simply call binChi.

binChi(SexVsNum)

## $residuals
##  [1]  5.1360485 -3.1718170 -2.6858023 -2.4145554 -1.3950709 -2.8477085
##  [7]  1.7586302  1.4891569  1.3387626  0.7735042
## 
## $stat
## [1] 67.23966
## 
## $nbins
## [1] 10

This computes the \(\chi^2\) statistic and Pearson residuals for the bins. The correct degrees of freedom for this statistic are returned by other helper functions, but more on that later.

Comparisons involving continuous variables

When one or more variable is continuous, the output of depDisplay changes in a few important ways even though it is read the same. Consider a comparison of age and num.

set.seed(1235) # more on this later
# the depDisplay function also has a method for data.frames
AgeVsNum <- depDisplay(x = heartClean, pair="age:num", xlab = "Age", 
                       ylab = "Number of arteries >50% obstructed", 
                       pch = 20, col = adjustcolor('gray50', alpha.f=0.5))

Or, for a pair of continuous variables, thalach and oldpeak.

set.seed(812)
AgeVsChol <- depDisplay(heartClean$thalach, heartClean$oldpeak, 
                        xlab = "Maximum heart rate during exercise",
                        ylab = "ST wave depression during exercise",
                        pch = 20, col = adjustcolor('gray50', alpha.f=0.5))

Note that in both of these cases, the bins no longer sit on a simple grid. Adding borders makes this even clearer (with the side effect of making this statistical graphic look a bit like a Piet Mondrian piece).

set.seed(812)
AgeVsNum <- depDisplay(heartClean$thalach, heartClean$oldpeak, 
                       xlab = "Maximum heart rate during exercise",
                       ylab = "ST wave depression during exercise",
                       pch = 20, col = adjustcolor('gray50', alpha.f=0.5),
                       border = "black")

As before, the labels on the axis of the categorical variable denote the relative sizes of each labelled category. In contrast, the continuous margins are rather chaotic and worth discussing.

When both variables are categorical, the joint distribution can be fully described by the joint probabilities of each bin. When one, or both, of the variables being compared are continuous, representing the joint distribution between the two is more complicated. No single contingency table fully represents their joint distribution because aggregation obscures variation at finer resolutions. As well, any constant grid applied to every data set will have blind spots: patterns which it lacks power to detect. To create a set of bins to display and measure continuous data, then, we need a dynamic algorithm to build a bivariate histogram for a given data set.

Creating such a histogram can be done in many ways (see Chapter 2.3 here for a brief survey), but there are advantages to constructing them using random recursive splits (see Salahub and Oldford, 2025). These splits occur in a stepwise fashion, where each bin is split at each step until a set of stop criteria are satisfied. In the case of random recursive splits to measure association, natural stop criteria are based on the size of the bin which is proportional to the number of points we expect it to contain.

Of course, this requires that we know the expected count of each bin. We can accomplish this by converting continuous margins to their ranks, thereby ensuring a uniform distribution along the corresponding axis. To give a sense of the original distribution, the axis therefore displays the five number summary of the data at the corresponding ranks to give the minimum, maximum, median, and quartiles.

With the construction understood, the interpretation of these plots continues largely the same as in the dual categorical case. In the plot of age and num, we can see dark red areas in the top right and bottom left corners and light blue areas in the bottom right and top left, suggesting that the number of blocked arteries tends to increase in the patients for this study. For thalach and oldpeak, the opposite trend is shown. In both cases, the saturation is much lighter than the case of sex and num, suggesting weaker associations for these latter two comparisons.

All variables at once

Instead of exploring the data piecemeal using pairs chosen one at a time, we can assess the associations between all pairs with one call to the DepSearch (for Dependence Search) function.

heartAssociations <- DepSearch(heartClean)

This returns a DepSearch object, which contains the generated bins for all pairs of variables in the dataset along with details such as the degrees of freedom of the binning, the number of bins, the \(\chi^2\) statistic for each pair, and the \(p\)-value of that statistic. These results can then be viewed at a high level using the associated summary method.

summary(heartAssociations)

## All 66 pairs in heartClean recursively binned with type distribution: 
## 
##   factor:factor  factor:numeric numeric:numeric 
##              21              35              10 
## 
## 52 pairs are significant at 5% and 42 pairs are significant at 1%
## 
## Most significant  10  pairs:
## study:chol  (1.7e-70) 
## study:restecg  (4.7e-57) 
## study:num  (1.3e-38) 
## num:exang  (1.5e-38) 
## num:cp  (2.3e-38) 
## exang:cp  (1.8e-35) 
## study:thalach  (2.7e-24) 
## exang:oldpeak  (1.4e-22) 
## study:age  (9.6e-21) 
## num:oldpeak  (1.6e-20)

Triplet plots which display the original data, the rank data, and the bins which form the basis of each \(p\)-value can be inspected using plot. By default, this displays the top five strongest associations.

plot(heartAssociations)

The indices of the pairs to display can be specified by the which argument. Note that values given to which specify the indices of the pairs when placed in order from strongest to weakest association, so that plot(heartAssociations, which = 1:5) produces the same plot as the default call. As there are 66 pairs in this data, the weakest associations can be displayed by specifying which=61:66.

plot(heartAssociations, which = 62:66)

By providing the data on the original scale, the rank scale, and as it is ‘seen’ by the algorithm through the binning of the ranks, an analyst can quickly understand the structure of any dependence between a pair of variables. Moreover, as all pairs are evaluated using \(p\)-values, comparisons between all pairs are fair regardless of the data types of each pair.

It should be noted here that these \(p\)-values are computed only approximately. As explored in Salahub and Oldford, 2025, the rank margins vary less than uniformly distributed margins because they lay on a lattice. Therefore, the classical \(\chi^2\) test based on arbitrary partitions which takes \[df = K - 1\] produces smaller statistics than would be expected for truly uniform data. This creates overly conservative \(p\)-values in the case of comparisons involving one or more continuous variables. Extensive simulations carried out using different approximations found that a simple approximation inspired by contingency tables works quite well to account for this.

For a contingency table with \(R\) rows and \(C\) columns, we account for the constrained row and column totals by subtracting a degree of freedom from each. So, supposing \(RC = K\) (the total number of bins), the degrees of freedom are not given by \(K-1\) but instead \[df = (R-1)(C-1).\] Taking this same idea to the dual continuous case where recursive binning has generated \(K\) bins, we ignore the arbitrary and mis-aligned nature of the bins and instead treat the \(K\) bins like the result of a regular grid with its implied contingency table along rows and columns. This suggests the approximation \[df = (\sqrt{K} - 1)^2,\] which works surprisingly well in practice given how arbitrary the choice is. Similarly, when one variable is categorical on \(M\) categories and the other is continuous, the same line of thinking leads to a formula using the average number of bins per category as \[df = (M-1) \left ( \frac{K}{M} - 1 \right ).\] Optionally, the argument ptype can be set in the call to DepSearch to change the \(p\)-value approximation used. The other options include the conservative \(K-1\), a gamma approximation to the distribution, and fitted degrees of freedom based on a large empirical study.

Customizing AssocBin

For people who want to experiment with recursive binning, AssocBin offers plenty of room for customization. While the default settings split bins randomly until they reach a certain minimum size, by changing the scoring function and stop criteria, very different behaviours are possible. Several optional arguments to DepSearch control these aspects of binning: stopCriteria allows for stop criteria to be set, catCon allows specification of the splitting function to use on the continuous margin of mixed pairs with one categorical and one continuous variable, and conCon allows specification of the splitting function to use for dual continuous margins.

Stop criteria

The simplest of these to use and specify is stopCriteria, which is supported by the helper function makeCriteria. This helper captures the arguments passed to it and stores these as a single logical expression which is then parsed and evaluated within each bin to determine whether splitting should continue. As a result, they must reference one of the named bin features

x: vector giving the horizontal coordinates of observations within the bin,
y: vector giving the vertical coordinates of observations within the bin,
bnds: a list of two vectors, x and y which give the horizontal and vertical extents of the bin,
expn: the expected number of points in the bin,
n: the observed number of points in the bin, and
depth: the number of recursive splits required to create the bin from the initial bin containing all points

Arguments passed to makeCriteria which reference objects not included in this list rely on lexical scoping within R, and so should be used deliberately and with care. Generally, the stop criteria can be constructed with a simple call such as

stopCrits <- makeCriteria(depth >= 10, # maximum depth of 10
                          expn <= 10, # smallest possible bin size of 5
                          n < 1 # don't split empty bins
                          )
stopCrits

## [1] "depth >= 10 | expn <= 10 | n < 1 | stopped"

Note that it is necessary to specify a stop criterion of expn <= 2*k to restrict bin size to k, as splitting a bin with expn < 2*k will necessarily produce at least one bin with expn < k. Of course, more complicated logical expressions are also possible. For example, one could implement a splitting procedure that stops splitting any bin which achieves some threshold for the \(\chi^2\) residual in the bin to create a greedy algorithm which preserves any large departures it encounters.

greedyCrits <- makeCriteria(abs(expn - n)/sqrt(expn) > 4,
                            expn <= 10,
                            n < 1)
greedyCrits

## [1] "abs(expn - n)/sqrt(expn) > 4 | expn <= 10 | n < 1 | stopped"

Splitting functions

If splitting behaviour more complex than the random splits is desired, provided functionals can be used to construct custom splitting functions. While any splitting function could be specified so long as it accepts a bin and returns a list of two bins that partition the original, the provided splitting functions are implemented under a specific framework of optimization.

It can be proven that any convex objective function which compares \(o_i\) and \(e_i\) (the observed and expected counts within a bin) will be maximized by a split at one of the observations within the bin. Therefore, scoring functions need only consider splits at observation coordinates for many common scores like the mutual information (implemented as miScores) or the \(\chi^2\) statistic (implemented as chiScores). In each bin, the scoring functions assess the score resulting from splits at each observation (and some ‘pseudo-observations’ to allow the creation of empty bins) and identify which coordinate creates a split which optimizes the score. For this reason, the included scoring functions accept three arguments: bounds, nbelow, and n, as these alone can be used to determine the maximum for many bin-scoring functions.¹

In practice, these facts are not relevant to the user when setting up scoring. To set up the algorithm to maximize the \(\chi^2\) statistic, for example, we use the following lines.

conConChi <- function(bn) maxScoreSplit(bin = bn, scorer = chiScores)
# the univariate splitter requires an additional argument specifying which
# margin should be split
catConChi <- function(bn, on) uniMaxScoreSplit(bin = bn, scorer = chiScores,
                                               on = on)

Then, we pass them to the DepSearch call, maybe alongside our greedy stop criteria.

heartAssociations_greedy <- DepSearch(heartClean,
                                      stopCriteria=greedyCrits,
                                      catCon=catConChi,
                                      conCon=conConChi)

Plotting this greedy version of the algorithm, the top associations do not change much:

plot(heartAssociations_greedy)

Indeed, a key finding of Salahub and Oldford, 2025 is that binning algorithms based on maximization do not perform much better than random splits in the identification of patterns, and that maximization introduces systematic bias to pattern detection. An additional downside to maximization can be seen in the considerably inflated significance of the top association between study and chol in this greedy algorithm compared to the random one. By actively seeking large residual values, maximization prevents the computation of correct, or approximately correct, \(p\)-values through typical distributional approximations. Large simulations must be used instead.

Maximizing in a greedy way is not all bad, however. For one, it makes the algorithm deterministic for a given sample, while the random algorithm is inherently somewhat noisy. Additionally, and evidently in the case of the top association, it produces sharper departure displays which better highlight the areas of low and high point concentration.

Customizing plots

Aside from control over how binning is performed, plots of binnings can be customized in AssocBin. In the simplest case, this works by using the usual graphical parameters as shown previously.

# a final way to use depDisplay is on a depSearch object
depDisplay(heartAssociations, pair="thalach:oldpeak",
           xlab = "Maximum heart rate during exercise",
           ylab = "ST wave depression during exercise",
           pch = "+", col = adjustcolor('purple', alpha.f=0.5),
           border = "black")

Finer control is obtained using the lower-level plotBinning function and the different bin fill helper functions. Let’s start by saving these particular bins so we can display them in different ways.

thalachOldpeak <- heartAssociations$binnings[["thalach:oldpeak"]]

To use plotBinning, these bins must be passed in alongside a fill function to colour the bins. Fill functions must accept a list of bins and return a vector of colours that can be interpreted by R’s plotting functions. While custom fill functions can be defined to encode any aspect of a bin, the three included fill functions depthFill, residualFill, and importanceFill saturate bins based on their depth, the magnitude and sign of their residuals (based on a provided residual function), and the threshold on standardized Pearson residuals defined above. All three options lead to very different displays.

# note that plotBinning does not have access to the marginal information to plot
# quantiles and so the marginal labels give the ranks
plotBinning(thalachOldpeak, pch = 20, 
            xlab = "Maximum heart rate during exercise",
            ylab = "ST wave depression during exercise",
            showXax = TRUE, showYax = TRUE,
            fill=depthFill(thalachOldpeak))

The depth fill, for example, lets us see the path of the algorithm. Less shaded areas indicate points where splitting was stopped earlier than areas with deeper saturation. For this particular pair, it is not so striking, but a very different pattern results from strong linear structures. Consider the following example which accesses low-level functions to perform binning manually.

x <- rnorm(1000)
y <- 2*x + rnorm(1000, sd = 0.3)
rankx <- rank(x, ties.method = "random")
ranky <- rank(y, ties.method = "random")

# set up splitting criteria: depth stop limits run time (not necessary here)
criteria <- makeCriteria(expn <= 10, n == 0, depth >= 10)
# define the stop function using these criteria
stopFn <- function(bns) stopper(bns, criteria)
# use binner to run the algorithm
xyBins <- binner(x = rankx, y = ranky, stopper = stopFn, splitter = rIntSplit)

# plot with depthfill
set.seed(2119)
plotBinning(xyBins, fill=depthFill(xyBins), pch = 20)

The advantage of the recursive splitting is obvious when viewed with this plot. In contrast to regular grids, the adaptive two-dimensional histogram generated by recursive binning with stop criteria places a greater density of bins, and therefore more focus, in areas of high density than those of low density. Even when these bins are chosen randomly, this creates a more efficient use of the same number of bins.

A more typical fill can be gleaned from the residualFill function.

plotBinning(thalachOldpeak, pch = 20, 
            xlab = "Maximum heart rate during exercise",
            ylab = "ST wave depression during exercise",
            showXax = TRUE, showYax = TRUE,
            fill=residualFill(thalachOldpeak, nbr = 10))

The fill from this function simply represents the residuals. By default, blue indicates a negative residual while red indicates a positive one. A great deal of customization is possible with this function: custom colour breaks can be specified using breaks, or alternatively the number of breaks can be specified using nbr to increase or decrease the resolution. Should we want to use a different residual function to generate the saturation, the resFun can be specified. We can also change the colour range using the colrng argument.

plotBinning(thalachOldpeak, pch = 20, 
            xlab = "Maximum heart rate during exercise",
            ylab = "ST wave depression during exercise",
            showXax = TRUE, showYax = TRUE,
            fill=residualFill(thalachOldpeak, nbr = 50,
                              resFun=binMI,
                              colrng = c("orange", "pink", "blue")))

Finally, the importanceFill function implements the default fill described earlier for depDisplay. It standardizes the \(\chi^2\) residuals and applies a Bonferroni correction before shading only those bins with standardized residuals that are significant when multiple testing is accounted for.

Drawing conclusions about data from `AssocBin`

Let’s return to the associations in the cleaned heart data.

summary(heartAssociations)

## All 66 pairs in heartClean recursively binned with type distribution: 
## 
##   factor:factor  factor:numeric numeric:numeric 
##              21              35              10 
## 
## 52 pairs are significant at 5% and 42 pairs are significant at 1%
## 
## Most significant  10  pairs:
## study:chol  (1.7e-70) 
## study:restecg  (4.7e-57) 
## study:num  (1.3e-38) 
## num:exang  (1.5e-38) 
## num:cp  (2.3e-38) 
## exang:cp  (1.8e-35) 
## study:thalach  (2.7e-24) 
## exang:oldpeak  (1.4e-22) 
## study:age  (9.6e-21) 
## num:oldpeak  (1.6e-20)

The summary indicates that most of the pairs show some sign of dependence while the top ten breakdown indicates that many of these are very strongly dependent. The categorical study variable is present in five of the top ten pairs, suggesting that the patients included at each hospital were not drawn from similar populations. In particular, study and num are highly related, suggesting that the different studies performed at different hospitals targeted patients with different severity of heart disease.

The top 5 variable plots indicate that data quality questions may also be to blame for some of this.

plot(heartAssociations)

The top association between study and chol suggests that the patients selected for one study all had cholesterol values close to zero. Indeed, inspecting closer:

depDisplay(heartAssociations, pair = 1, pch = 20, 
           xlab="Study", ylab="Serum cholesterol")

We can see the short hand name of the offending study. Viewing the raw data confirms that the cholesterol values have either been lost or recorded improperly as all are zero.

heartClean$chol[heartClean$study == "switzerland"]

##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0 0 0

Noting this unreliability, we may wish to repeat our association analysis for a subset of the data corresponding to a single study where more standardization can be assumed in the data (and where there are no indications of additional missing data). The largest study is the cleveland study.

heartCleveland <- heartClean[heartClean$study == "cleveland",]
heartCleveland$study <- NULL
clevelandAssociations <- DepSearch(heartCleveland)
set.seed(90192)
summary(clevelandAssociations)

## All 55 pairs in heartCleveland recursively binned with type distribution: 
## 
##   factor:factor  factor:numeric numeric:numeric 
##              15              30              10 
## 
## 18 pairs are significant at 5% and 15 pairs are significant at 1%
## 
## Most significant  10  pairs:
## exang:cp  (3.1e-14) 
## num:cp  (7.8e-14) 
## num:exang  (1.3e-12) 
## age:thalach  (4.4e-08) 
## exang:thalach  (1.2e-07) 
## num:oldpeak  (6e-07) 
## thalach:oldpeak  (8.7e-07) 
## num:thalach  (5.4e-06) 
## cp:thalach  (3e-05) 
## cp:oldpeak  (4e-05)

Removing the impact of study, a much smaller proportion of pairs is identified as significant. Some of the top pairs remain the same, however. num:cp, num:exang, and exang:cp remain the top three pairs when study is removed.

plot(clevelandAssociations, which = 1:3)

Exercise-induced angina, coronary artery disease (CAD), and clinical chest pain are all strongly related. Inspecting the plots, we can see that both exercise-induced angina and clinical chest pain increase in the number of blocked arteries in a patient. As exercise induced angina is an induced chest pain, this supports the simple and unsurprising conclusion that patients with worse CAD have more severe chest discomfort.

plot(clevelandAssociations, which = 4:9)

The following relationships largely center around thalach: the maximum patient heart rate achieved during exercise. It seems to decrease with age, the severity of CAD, the presence of exercise-induced angina, and the ST wave depression from exercise. These pair, then, seem to reflect how CAD and age impact the maximum exertion a patient is capable of during exercise. Older patients with hearts more impacted by CAD cannot exert their hearts as much as younger, healthier patients.

plot(clevelandAssociations, which = 10:16)

The final significant relationships at 5% seem to echo these earlier themes but also highlight important sex differences between male and female patients. Generally, female patients seem to have lower incidence of CAD, less exercise-induced angina, and lower cholesterol.

More complex splitting logic based on arbitrary bin features is supported by sandboxMaxSplit, which applies the scoring function directly to the list of bins at each step.↩︎