ecocbo-guide

ecocbo: Calculating Optimum Sampling Effort in Community Ecology

ecocbo is an R package that helps scientists determine the optimal sampling effort for community ecology studies. It can be used to calculate the minimum number of samples needed to achieve a desired level of precision, or to fit within a set economic budget. Unlike previous versions, ecocbo now incorporates its own approach for precision and cost optimization through the sim_cbo() function, while the methodology proposed by Underwood (1997, ISBN 0 521 55696 1) is implemented in the auxiliary underwood_cbo() function.

ecocbo is based on the principles of ecological simulation, with the considerations of natural variability generated by the SSP package. It is designed to work with two primary experimental models: single factor and nested symmetric designs. A pilot study can be used to estimate the natural variability of the system, which in turn is employed to calculate the optimal sampling effort. ecocbo is a valuable tool for scientists who need to design efficient sampling plans, which can save time and money by ensuring the collection of the minimum amount of data necessary to achieve their research goals.

ecocbo is composed of six functions, of which four form the standard package workflow, and two are auxiliary:

Standard workflow functions:

prep_data() formats and arranges the initial data so that it can be readily used by the other functions in the package.
sim_beta() computes the statistical power at different sampling efforts.
sim_cbo() estimates the optimal number of replicated units and replicates per unit seeking the best possible precision with the lowest cost.
plot_power() allows the user to visualize the changes in statistical power given the certain number of sampling units or sites.

Auxiliary functions:

scompvar() calculates the components of variation among sites and within samples. This information is necessary for the optimization of sampling effort.
underwood_cbo() estimatesw the optimal number of replicated units and samples per unit, depending on the major constrains of budget or precision to be achieved, as established by Underwood (1997).

Functions and sequence

0. Preparing the data

To get a feeling of the functionality of ecocbo, you can follow the next single-factor example.

The provided data epiDat is a subset of the original epibionts (Guerra-Castro, et al., 2021). It was prepared by looking for communities that were as different as possible within the dataset. The function uses the following parameters:

Argument	Description
data	Data frame with species names (columns) and samples (rows) information. The first column should indicate the site to which the sample belongs, regardless of whether a single site has been sampled.
type	Nature of the data to be processed. It may be presence / absence (“P/A”), counts of individuals (“counts”), or coverage (“cover”).
Sest.method	Method for estimating species richness. The function specpool is used for this. Available methods are the incidence-based Chao “chao”, first order jackknife “jack1”, second order jackknife “jack2” and Bootstrap “boot”. By default, the “average” of the four estimates is used.
cases	Number of data sets to be simulated.
N	Total number of samples to be simulated in each site.
M	Total number of sites to be simulated in each data set.
n	Maximum number of samples to consider.
m	Maximum number of sites.
k	Number of resamples the process will take. Defaults to 50.
transformation	Mathematical function to reduce the weight of very dominant species. Options are: ‘square root’, ‘fourth root’, ‘Log (X+1)’, ‘P/A’, ‘none’
method	The appropriate distance/dissimilarity metric that will be used by `vegan::vegdist()` (e.g. Gower, Bray–Curtis, Jaccard, etc).
dummy	Logical. It is recommended to use TRUE in cases where there are observations that are empty.
useParallel	Logical. Should R call to several cores to work on parallel computing? This is done to speed up the processing time.
model	What type of model is the function implementing? Options are “single.factor” and “nested.symmetric”.
jitter.base	Standard deviation multiplier used to add Gaussian #’ jitter to and . Defaults to 0.

The first steps in the function are the simulation of two different communities: simH0 in which the data is treated as if it came from the same site thus accepting the null hypothesis, and simHa in which the opposite is true, so that the samples come from different sites and portray different characteristics.

It is required that the resulting datasets are the same size, so the parameters M and N are used by SSP::simdata(...) to that end. For this example, the values used to calculate the simulations are M=1 and N=1000 in the case in which the null hypothesis is true, and then M=10 and N=100 for the alternative hypothesis, thus resulting in 3 matrices (cases=3) of 1000 rows and 95 columns (the number of columns depends on the simulation parameters from SSP::assempar()) for both of the resulting datasets.

This functions returns an object of class ecocbo_data which is a list that contains a dataframe that lists the results of resampling the data with several sampling efforts to calculate pseudo-F and the mean squares that are needed for the other functions in the package to work.

# Load data and adjust it.
data(epiDat)

simResults <- prep_data(data = epiDat, type = "counts", Sest.method = "average",
                        cases = 5, N = 100, M = 10,
                        n = 5, k = 30,
                        transformation = "none", method = "bray",
                        dummy = FALSE, useParallel = TRUE,
                        model = "single.factor", jitter.base = 0)

1. scompvar()

scompvar() calculates the components of variation within sites and among replicates. Its arguments are:

Argument	Description
data	Object of class “ecocbo_data” that results from prep_data().
n	Site label to be used as basis for the computation. Defaults to NULL.
m	Number of samples to be considered. Defaults to NULL.

If m or n are left as NULL, the function will calculate the components of variation using the largest available values as set in the experimental design in prep_data().

compVar <- scompvar(data = simResults)
compVar
#>     Source Est.var.comp
#> 1 Residual     0.331926

2. sim_beta()

sim_beta() computes the statistical power for a combination of up to m sites and n samples per site. Its arguments are:

Argument	Description
data	An object of class “ecocbo_data” that results from applying prep_data() to a community data frame.
alpha	Level of significance for Type I error. Defaults to 0.05.

The value of k defines the number of times that the combination of sites and samples will be considered to form a statistical distribution that will establish, along with alpha, the probability of type I error from the simulated data. Higher values of k will increase the computational intensity of the process, which may result in slower execution times.

The function returns an object of class ecocbo_beta which prints to a table showing the statistical power at different sampling efforts. The object, however, is a list that contains two data frames and the value of alpha; one data frame comprises the information about power and beta, and the second one lists the results of the resampling of values indicated by k.

betaResult <- sim_beta(simResults, alpha = 0.05)
betaResult
#> Power at different sampling efforts (n):
#>       Power
#> n = 2  0.44
#> n = 3  0.70
#> n = 4  0.93
#> n = 5  0.97

3. sim_cbo()

sim_cbo() estimates the optimal number of sites and replicates per site to perform based on either available budget or desired statistical power.

Argument	Description
data	Object obtained from sim_beta().
cn	Cost per sampling unit.
cm	Cost per sampling replicate.

cboCost <- sim_cbo(betaResult, cn = 75)
cboCost
#> Sampling designs that meet the required power:
#>  n     Power Cost Suggested
#>  2 0.4366667  150          
#>  3 0.7033333  225          
#>  4 0.9266667  300          
#>  5 0.9716194  375       ***
#> 
#> The listed cost is per treatment.

Additionally…

4. plot_power()

plot_power() makes plots for either a power curve, a density plot, or both. Its parameters are:

Argument	Description
data	Object of class “ecocbo_beta” that results from sim_beta().
cbo	Optional. Object of class obtained from . If this is included, uses the optimal values that have been already calculated.
n	Defaults to NULL, and then the function computes the number of samples (n) that results in a sampling effort close to 95% in power. If provided, said number of samples will be used.
m	Site label to be used as basis for the plot.
method	The desired plot. Options are “power”, “density” or “both”.

The power curve plot shows that the power of the study increases as the sample size increases, and the density plot shows the overlapping areas where alpha and beta are significant.

The argument n is set to NULL by default, if it is left like that, the the function looks for a number of samples that allows for a power that is close to \((1-alpha)\), otherwise it uses the value stated by the user. In either case, the computed or provided value will be marked in red in the power curve plot.

plot_power(data = betaResult, cbo = cboCost, n = NULL, method = "power")

References

Anderson, M. J. (2017). Permutational Multivariate Analysis of Variance (PERMANOVA). Wiley StatsRef: Statistics Reference Online. John Wiley & Sons, Ltd.
Guerra-Castro, E. J., J. C. Cajas, F. N. Dias Marques Simoes, J. J. Cruz-Motta, and M. Mascaro. (2020). SSP: An R package to estimate sampling effort in studies of ecological communities. bioRxiv:2020.2003.2019.996991.
Underwood, A. J. (1997). Experiments in ecology: their logical design and interpretation using analysis of variance. Cambridge university press.
Underwood, A. J., & Chapman, M. G. (2003). Power, precaution, Type II error and sampling design in assessment of environmental impacts. Journal of Experimental Marine Biology and Ecology, 296(1), 49-70.