New sccomp framework

Stefano Mangiola

2024-01-21

Lifecycle:maturing R build status

We announce the new tidy and modular interface for a sccomp, which improves modularity, and clarity. The main change is the modularisation of sccomp in functions which can be linked with the pipe operator |>.

Function Description
Estimation: sccomp_stimate() which is usually run once in the analysis (per model).
Testing: sccomp_test() which candy run multiple times, depending on how many contrasts you want to test (e.g. age, untreated vs treated).
Outlier removal: sccomp_remove_outliers() which is usually run once after sccomp_estimate() in case you want to produce estimates not influenced by outlier data points.
Unwanted variation removal: sccomp_remove_unwanted_variation() which is run after sccomp_estimate() and produces a dataset that just preserve the variability of your factor of interest.
Data replication: sccomp_replicate() which is run after sccomp_estimate() and produces a dataset representing the theoretical data distribution according to the model (from the posterior distribution).
Plotting: plot() which is run after sccomp_test and outputs a series of summary plots.

A reminder: what is sccomp

sccomp is a statistical model developed for differential variability analysis in compositional data, primarily used in cellular omics fields like single-cell genomics, proteomics, and microbiomics (Mangiola et al. 2023). It addresses limitations of existing methods in differential abundance analysis by incorporating several advanced features. sccomp effectively models compositional count data properties, which were previously not adequately addressed, and tackles cell-group-specific differential variability. This model uses a constrained Beta-binomial distribution to enable more precise analyses. Key capabilities of sccomp include improved differential abundance analyses through cross-sample information borrowing, outlier identification and exclusion, realistic data simulation, and facilitating cross-study knowledge transfer. By incorporating these features, sccomp provides a more comprehensive and accurate framework for analyzing cellular omics data, identifying crucial biological drivers such as disease progression markers in cancer and pathogen infection.

Installation

Bioconductor

if (!requireNamespace("BiocManager")) install.packages("BiocManager")
BiocManager::install("sccomp")

Github

devtools::install_github("stemangiola/sccomp")

Deprecation of the function sccomp_glm()

The new framework

outlier_free_estimate = 
  seurat_obj |>
  
  # Estimate
  sccomp_estimate( 
    formula_composition = ~ type + continuous_covariate, 
    .sample =  sample, 
    .cell_group = cell_group, 
    cores = 1
  ) |>
  
  # Remove outliers
  sccomp_remove_outliers(cores = 1) 
## 
## SAMPLING FOR MODEL 'glm_multi_beta_binomial' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0.000467 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 4.67 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 4300 [  0%]  (Warmup)
## Chain 1: Iteration:  301 / 4300 [  7%]  (Sampling)
## Chain 1: Iteration: 1300 / 4300 [ 30%]  (Sampling)
## Chain 1: Iteration: 2300 / 4300 [ 53%]  (Sampling)
## Chain 1: Iteration: 3300 / 4300 [ 76%]  (Sampling)
## Chain 1: Iteration: 4300 / 4300 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 3.609 seconds (Warm-up)
## Chain 1:                26.633 seconds (Sampling)
## Chain 1:                30.242 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'glm_multi_beta_binomial' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0.000489 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 4.89 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:     1 / 20299 [  0%]  (Warmup)
## Chain 1: Iteration:   301 / 20299 [  1%]  (Sampling)
## Chain 1: Iteration:  1300 / 20299 [  6%]  (Sampling)
## Chain 1: Iteration:  2300 / 20299 [ 11%]  (Sampling)
## Chain 1: Iteration:  3300 / 20299 [ 16%]  (Sampling)
## Chain 1: Iteration:  4300 / 20299 [ 21%]  (Sampling)
## Chain 1: Iteration:  5300 / 20299 [ 26%]  (Sampling)
## Chain 1: Iteration:  6300 / 20299 [ 31%]  (Sampling)
## Chain 1: Iteration:  7300 / 20299 [ 35%]  (Sampling)
## Chain 1: Iteration:  8300 / 20299 [ 40%]  (Sampling)
## Chain 1: Iteration:  9300 / 20299 [ 45%]  (Sampling)
## Chain 1: Iteration: 10300 / 20299 [ 50%]  (Sampling)
## Chain 1: Iteration: 11300 / 20299 [ 55%]  (Sampling)
## Chain 1: Iteration: 12300 / 20299 [ 60%]  (Sampling)
## Chain 1: Iteration: 13300 / 20299 [ 65%]  (Sampling)
## Chain 1: Iteration: 14300 / 20299 [ 70%]  (Sampling)
## Chain 1: Iteration: 15300 / 20299 [ 75%]  (Sampling)
## Chain 1: Iteration: 16300 / 20299 [ 80%]  (Sampling)
## Chain 1: Iteration: 17300 / 20299 [ 85%]  (Sampling)
## Chain 1: Iteration: 18300 / 20299 [ 90%]  (Sampling)
## Chain 1: Iteration: 19300 / 20299 [ 95%]  (Sampling)
## Chain 1: Iteration: 20299 / 20299 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 3.666 seconds (Warm-up)
## Chain 1:                129.104 seconds (Sampling)
## Chain 1:                132.77 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'glm_multi_beta_binomial' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0.000505 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 5.05 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 4300 [  0%]  (Warmup)
## Chain 1: Iteration:  301 / 4300 [  7%]  (Sampling)
## Chain 1: Iteration: 1300 / 4300 [ 30%]  (Sampling)
## Chain 1: Iteration: 2300 / 4300 [ 53%]  (Sampling)
## Chain 1: Iteration: 3300 / 4300 [ 76%]  (Sampling)
## Chain 1: Iteration: 4300 / 4300 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 3.589 seconds (Warm-up)
## Chain 1:                26.351 seconds (Sampling)
## Chain 1:                29.94 seconds (Total)
## Chain 1:
# Test
outlier_free_estimate |>
  sccomp_test(contrasts = "typehealthy") 
## # A tibble: 30 × 18
##    cell_group parameter factor c_lower c_effect  c_upper   c_pH0   c_FDR c_n_eff
##    <chr>      <chr>     <chr>    <dbl>    <dbl>    <dbl>   <dbl>   <dbl>   <dbl>
##  1 B immature typeheal… type     0.890    1.41   1.91    0       0         4296.
##  2 B mem      typeheal… type     1.09     1.72   2.34    0       0         4619.
##  3 CD4 cm S1… typeheal… type     0.589    0.995  1.43    0       0         3300.
##  4 CD4 cm hi… typeheal… type    -3.06    -1.98  -1.03    0       0         3722.
##  5 CD4 cm ri… typeheal… type    -1.79    -1.06  -0.354   0.00525 1.03e-3   4018.
##  6 CD4 em hi… typeheal… type    -2.22    -1.40  -0.621   0.00100 1.79e-4   4162.
##  7 CD4 naive  typeheal… type     0.219    0.816  1.41    0.0218  5.88e-3   4210.
##  8 CD4 ribos… typeheal… type     1.54     2.04   2.54    0       0         4627.
##  9 CD8 em 1   typeheal… type    -0.534    0.128  0.712   0.597   1.24e-1   5915.
## 10 CD8 em 2   typeheal… type    -2.15    -0.999 -0.00332 0.0625  1.62e-2   3178.
## # ℹ 20 more rows
## # ℹ 9 more variables: c_R_k_hat <dbl>, v_lower <dbl>, v_effect <dbl>,
## #   v_upper <dbl>, v_pH0 <dbl>, v_FDR <dbl>, v_n_eff <dbl>, v_R_k_hat <dbl>,
## #   count_data <list>

Replaces the old framework (that now will receive a deprecation warning)

  seurat_obj |>
  
  # Estimate
  sccomp_glm( 
    formula_composition = ~ type + continuous_covariate, 
    .sample =  sample, 
    .cell_group = cell_group, 
    check_outliers = TRUE,
    contrasts = "typehealthy", 
    cores = 1
  ) 

New functionalities

Removal of unwanted variation.

For visualisation purposes, we can select factor of interest we would like to preserve the effect for, end exclude all the rest. For example, if we want to produce a dataset with just the type effect, we can execute

outlier_free_estimate |>
  sccomp_remove_unwanted_variation(~ type)
## sccomp says: calculating residuals
## sccomp says: regressing out unwanted factors
## # A tibble: 600 × 5
##    sample       cell_group adjusted_proportion adjusted_counts logit_residuals
##    <chr>        <chr>                    <dbl>           <dbl>           <dbl>
##  1 10x_6K       B immature              0.0554           260.          -0.754 
##  2 10x_8K       B immature              0.144           1082.           0.321 
##  3 GSE115189    B immature              0.114            267.           0.0234
##  4 SCP345_580   B immature              0.0903           520.          -0.206 
##  5 SCP345_860   B immature              0.151            972.           0.377 
##  6 SCP424_pbmc1 B immature              0.113            302.          -0.0244
##  7 SCP424_pbmc2 B immature              0.202            603.           0.714 
##  8 SCP591       B immature              0.0248            14.1         -1.57  
##  9 SI-GA-E5     B immature              0.0240           100.          -0.705 
## 10 SI-GA-E7     B immature              0.0974           715.           0.784 
## # ℹ 590 more rows

Plotting

The bloating functionalities have been improved. Now, both discrete and continuous variables can be visualised overlaying the to reticle data distribution from the model. This helps the user understanding whether the model is descriptively adequate to the data.

For example, if the theoretical data distribution from the sccomp does not overlap with the observed data distribution, this is an indication that the probability distribution used by sccomp is not suitable for the data or a different model (design matrix) should be used.

outlier_free_estimate |>
  sccomp_test(contrasts = "typehealthy") |> 
  plot()
## $boxplot
## $boxplot[[1]]

plot of chunk unnamed-chunk-6

## 
## 
## $credible_intervals_1D

plot of chunk unnamed-chunk-6

Now plotting the test against the continuous covariate

outlier_free_estimate |>
  sccomp_test(contrasts = "continuous_covariate") |> 
  plot()
## $boxplot
## $boxplot[[1]]

plot of chunk unnamed-chunk-7

## 
## 
## $credible_intervals_1D

plot of chunk unnamed-chunk-7

sessionInfo()
## R Under development (unstable) (2024-01-16 r85808)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rstan_2.32.5       StanHeaders_2.32.5 tidyr_1.3.0        forcats_1.0.0     
## [5] ggplot2_3.4.4      sccomp_1.7.5       dplyr_1.1.4       
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0            farver_2.1.1               
##  [3] loo_2.6.0                   bitops_1.0-7               
##  [5] SingleCellExperiment_1.25.0 RCurl_1.98-1.14            
##  [7] digest_0.6.34               dotCall64_1.1-1            
##  [9] mime_0.12                   lifecycle_1.0.4            
## [11] SeuratObject_5.0.1          magrittr_2.0.3             
## [13] compiler_4.4.0              rlang_1.1.3                
## [15] tools_4.4.0                 yaml_2.3.8                 
## [17] utf8_1.2.4                  knitr_1.45                 
## [19] labeling_0.4.3              S4Arrays_1.3.2             
## [21] sp_2.1-2                    pkgbuild_1.4.3             
## [23] curl_5.2.0                  DelayedArray_0.29.0        
## [25] RColorBrewer_1.1-3          abind_1.4-5                
## [27] withr_3.0.0                 purrr_1.0.2                
## [29] BiocGenerics_0.49.1         grid_4.4.0                 
## [31] stats4_4.4.0                fansi_1.0.6                
## [33] colorspace_2.1-0            future_1.33.1              
## [35] inline_0.3.19               progressr_0.14.0           
## [37] globals_0.16.2              scales_1.3.0               
## [39] SummarizedExperiment_1.33.2 cli_3.6.2                  
## [41] crayon_1.5.2                generics_0.1.3             
## [43] RcppParallel_5.1.7          future.apply_1.11.1        
## [45] tzdb_0.4.0                  commonmark_1.9.0           
## [47] stringr_1.5.1               splines_4.4.0              
## [49] zlibbioc_1.49.0             parallel_4.4.0             
## [51] XVector_0.43.1              matrixStats_1.2.0          
## [53] vctrs_0.6.5                 V8_4.4.1                   
## [55] boot_1.3-28.1               Matrix_1.6-5               
## [57] jsonlite_1.8.8              IRanges_2.37.1             
## [59] hms_1.1.3                   patchwork_1.2.0            
## [61] S4Vectors_0.41.3            ggrepel_0.9.5              
## [63] listenv_0.9.0               glue_1.7.0                 
## [65] parallelly_1.36.0           spam_2.10-0                
## [67] codetools_0.2-19            stringi_1.8.3              
## [69] gtable_0.3.4                GenomeInfoDb_1.39.5        
## [71] QuickJSR_1.1.0              GenomicRanges_1.55.1       
## [73] munsell_0.5.0               tibble_3.2.1               
## [75] pillar_1.9.0                GenomeInfoDbData_1.2.11    
## [77] R6_2.5.1                    evaluate_0.23              
## [79] lattice_0.22-5              Biobase_2.63.0             
## [81] markdown_1.12               highr_0.10                 
## [83] readr_2.1.5                 rstantools_2.3.1.1         
## [85] Rcpp_1.0.12                 nlme_3.1-164               
## [87] gridExtra_2.3               SparseArray_1.3.3          
## [89] mgcv_1.9-1                  xfun_0.41                  
## [91] MatrixGenerics_1.15.0       pkgconfig_2.0.3