library(tidyverse)
library(seriation)
library(phyloseq)
library(microViz)
library(shiny)
Evomics Workshop part 2
Microbiome analysis with microViz
Welcome to Part 2 👋
- Bar charts and diversity were relatively straightforward topics, conceptually.
- Let’s look at some more things we can do with microbiome composition data!
Topics for part 2:
- Dissimilarity measures
- Ordination: PCoA & PCA (with interactive data exploration)
- Differential abundance testing
Further resources
Refer to the microViz website to see help pages for every function (as well as further tutorials).
Setup ⚙️
Load the R packages we will be using:
Dissimilarity 💩 ↔︎ 💩 ?
- Ecosystem Diversity is sometimes referred to as “alpha” diversity or “within-sample” diversity
- Now we’re going to look at Dissimilarity between ecosystems
- Sometimes this is confusingly referred to as “beta diversity” analysis, or “between-sample” diversity
data("shao19") # this example data is built in to microViz
- For this part we’re going to use an infant gut microbiome shotgun metagenomics dataset.
shao19
phyloseq-class experiment-level object
otu_table() OTU Table: [ 819 taxa and 1644 samples ]
sample_data() Sample Data: [ 1644 samples by 11 sample variables ]
tax_table() Taxonomy Table: [ 819 taxa by 6 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 819 tips and 818 internal nodes ]
- Do you remember how to examine a phyloseq object?
- Look at the rank names, sample data variables etc.
#
#
#
#
# in the console, for info on the data source ?shao19
Dissimilarity measures
- What are we doing?
- Calculating the dissimilarities between pairs of microbiome samples
- We talked about these commonly-used dissimilarity measures in the lecture.
- Binary Jaccard - presence-absence
- Bray-Curtis - abundance-weighted
- UniFrac distances (unweighted / weighted / generalised)
- To simplify and speed up the analyses, we’re going to take a smaller part of the dataset.
- We’ll only look at the 300 infant fecal samples from 4 days of age.
<- shao19 %>% ps_filter(family_role == "child", infant_age == 4) infants
- We’re going to filter out rare taxa quite strictly, for similar reasons
- But we won’t overwrite our smaller dataset: we’ll do the filtering per analysis
- You will find out more about the how and why of taxa filtering later
%>%
infants tax_filter(min_prevalence = 2.5 / 100) %>%
tax_agg(rank = "genus") %>%
tax_transform("binary") %>% # converts counts to absence/presence: 0/1
dist_calc(dist = "jaccard")
Proportional min_prevalence given: 0.025 --> min 8/306 samples.
psExtra object - a phyloseq object with extra slots:
phyloseq-class experiment-level object
otu_table() OTU Table: [ 35 taxa and 306 samples ]
sample_data() Sample Data: [ 306 samples by 11 sample variables ]
tax_table() Taxonomy Table: [ 35 taxa by 5 taxonomic ranks ]
otu_get(counts = TRUE) [ 35 taxa and 306 samples ]
psExtra info:
tax_agg = "genus" tax_trans = "binary"
jaccard distance matrix of size 306
0.6666667 0.7333333 0.9375 0.8125 0.6428571 ...
- We now have our pairwise dissimilarities! 🎉
- A distance matrix is attached as an extra part on the original phyloseq object
- The object is now class “psExtra” (created by microViz)
- A psExtra also stores info about the aggregation and transformations you performed
<- infants %>%
distances tax_filter(min_prevalence = 2.5 / 100, verbose = FALSE) %>%
tax_agg(rank = "genus") %>%
tax_transform("binary") %>%
dist_calc(dist = "jaccard") %>%
dist_get()
You can extract the distance matrix with dist_get.
as.matrix(distances)[1:5, 1:5]
B01089_ba_4 B01190_ba_4 B01194_ba_4 B01196_ba_4 B01235_ba_4
B01089_ba_4 0.0000000 0.6666667 0.7333333 0.9375000 0.8125000
B01190_ba_4 0.6666667 0.0000000 0.7500000 0.9166667 0.8461538
B01194_ba_4 0.7333333 0.7500000 0.0000000 0.4615385 0.3076923
B01196_ba_4 0.9375000 0.9166667 0.4615385 0.0000000 0.4615385
B01235_ba_4 0.8125000 0.8461538 0.3076923 0.4615385 0.0000000
The Binary Jaccard dissimilarities range between 0 (identical) and 1 (no shared genera).
range(distances)
[1] 0 1
Ordination
- What can we do with these dissimilarities? 🤔
- We can make an ordination! 💡
- Conceptually, ordination refers to a process of ordering things (in our case: samples).
- Similar samples are placed closer to each other, and dissimilar samples are placed further away.
PCoA
Principal Co-ordinates Analysis is one kind of ordination.
- PCoA takes a distance matrix and finds new dimensions (a co-ordinate system, if you like)
- The new dimensions are created with the aim to preserve the original distances between samples
- And to capture the majority of this distance information in the first dimensions
- This makes it easier to visualize the patterns in your data, in 2D or 3D plots 👀
%>%
infants tax_filter(min_prevalence = 2.5 / 100, verbose = FALSE) %>%
tax_transform(trans = "identity", rank = "genus") %>%
dist_calc(dist = "bray") %>%
ord_calc(method = "PCoA") %>%
ord_plot(alpha = 0.6, size = 2) +
theme_classic(12) +
coord_fixed(0.7)
To get a little insight into what has happened here, we can colour each sample according to its dominant (most abundant) genus.
%>%
infants tax_filter(min_prevalence = 2.5 / 100, verbose = FALSE) %>%
ps_calc_dominant(rank = "genus", none = "Mixed", other = "Other") %>%
tax_transform(trans = "identity", rank = "genus") %>%
dist_calc(dist = "bray") %>%
ord_calc(method = "PCoA") %>%
ord_plot(color = "dominant_genus", alpha = 0.6, size = 2) +
scale_color_brewer(name = "Dominant Genus", palette = "Dark2") +
theme_classic(12) +
coord_fixed(0.7)
Interactive ordination!
microViz
provides a Shiny app ord_explore
to interactively create and explore PCoA plots and other ordinations. See the code below to get started. But read the instructions first.
# fire up the shiny app
# run these lines in your console (don't keep in script/notebook)
%>%
infants tax_filter(min_prevalence = 2.5 / 100, verbose = FALSE) %>%
# calculate new sample variables with dominant taxon (optional)
ps_calc_dominant(rank = "genus", none = "Mixed", other = "Other") %>%
# launch a Shiny app in your web browser!
ord_explore()
# Again, with different options
# Run these lines in your console
%>%
shao19 ps_filter(family_role == "mother") %>%
tax_filter(min_prevalence = 2.5 / 100, verbose = FALSE) %>%
# calculate a few sample variables for interest (optional)
ps_calc_dominant(rank = "genus", none = "Mixed", other = "Other") %>%
ps_calc_diversity(rank = "genus", index = "shannon") %>%
ps_calc_richness(rank = "genus", index = "observed") %>%
# launch a Shiny app in your web browser!
ord_explore()
PERMANOVA
Permutational multivariate analysis of variance.
- ANOVA - analysis of variance (statistical modelling approach)
- Multivariate - more than one dependent variable (multiple taxa!)
- Permutational - statistical significance estimates obtained by shuffling the data many times
- TLDR: Are those groups on the PCoA actually different??
%>%
infants tax_filter(min_prevalence = 2.5 / 100, verbose = FALSE) %>%
tax_agg(rank = "genus") %>%
dist_calc(dist = "bray") %>%
ord_calc(method = "PCoA") %>%
ord_plot(alpha = 0.6, size = 2, color = "birth_mode") +
theme_classic(12) +
coord_fixed(0.7) +
stat_ellipse(aes(color = birth_mode)) +
scale_color_brewer(palette = "Set1")
%>%
infants tax_filter(min_prevalence = 2.5 / 100, verbose = FALSE) %>%
tax_agg(rank = "genus") %>%
dist_calc(dist = "bray") %>%
dist_permanova(variables = "birth_mode", n_perms = 99, seed = 123) %>%
perm_get()
2024-01-15 18:04:06.973044 - Starting PERMANOVA with 99 perms with 1 processes
2024-01-15 18:04:07.105565 - Finished PERMANOVA
Permutation test for adonis under reduced model
Marginal effects of terms
Permutation: free
Number of permutations: 99
vegan::adonis2(formula = formula, data = metadata, permutations = n_perms, by = by, parallel = parall)
Df SumOfSqs R2 F Pr(>F)
birth_mode 1 13.790 0.12366 42.898 0.01 **
Residual 304 97.727 0.87634
Total 305 111.518 1.00000
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Use more permutations for a more reliable p.value in your real work (slower)
# Set a random seed number for reproducibility of this stochastic method
You can see from the model output that the p value, Pr(>F) is below 0.05. So there is good statistical evidence that the bacterial gut microbiota composition of C-section delivered infants has a different composition than vaginally delivered infants at 4 days of age.
Reporting PCoA and PERMANOVA methods
- Your methodological choices matter, you should report what you did:
- any relevant rare taxon filtering thresholds
- the taxonomic rank of aggregation
- the dissimilarity measure used to compute the pairwise distances
It’s probably a good idea to decide on a couple of appropriate distance measures up front for these tests, and report both (at least in supplementary material), as the choice of distance measure can affect results and conclusions!
Covariate-adjusted PERMANOVA
You can also adjust for covariates in PERMANOVA, and often should, depending on your study design. Let’s fit a more complex model, adjusting for infant sex, birth weight, and the total number of assigned reads.
%>%
infants tax_filter(min_prevalence = 2.5 / 100, verbose = FALSE) %>%
tax_agg(rank = "genus") %>%
dist_calc(dist = "bray") %>%
dist_permanova(
variables = c("birth_mode", "sex", "birth_weight", "number_reads"),
n_perms = 99, seed = 111
%>%
) perm_get()
Dropping samples with missings: 15
2024-01-15 18:04:07.470369 - Starting PERMANOVA with 99 perms with 1 processes
2024-01-15 18:04:08.192815 - Finished PERMANOVA
Permutation test for adonis under reduced model
Marginal effects of terms
Permutation: free
Number of permutations: 99
vegan::adonis2(formula = formula, data = metadata, permutations = n_perms, by = by, parallel = parall)
Df SumOfSqs R2 F Pr(>F)
birth_mode 1 10.794 0.10163 34.8045 0.01 **
sex 1 0.280 0.00264 0.9031 0.43
birth_weight 1 0.565 0.00532 1.8215 0.06 .
number_reads 1 2.873 0.02705 9.2656 0.01 **
Residual 286 88.696 0.83509
Total 290 106.211 1.00000
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Use more permutations for a more reliable p.value in your real work (slower)
# Set a random seed number for reproducibility of this stochastic method
PCA
- Principal Components Analysis.
- For practical purposes, PCA is quite similar to Principal Co-ordinates Analysis.
- In fact, PCA produces equivalent results to PCoA with Euclidean distances.
Code
%>%
infants tax_agg(rank = "genus") %>%
dist_calc(dist = "euclidean") %>%
ord_calc(method = "PCoA") %>%
ord_plot(alpha = 0.6, size = 2) +
geom_rug(alpha = 0.1)
Code
%>%
infants tax_agg(rank = "genus") %>%
ord_calc(method = "PCA") %>%
ord_plot(alpha = 0.6, size = 2) +
geom_rug(alpha = 0.1)
Problems with PCA (or PCoA with Euclidean Distances) on microbiome data
- These plots look weird! most samples bunch in the middle, with spindly projections..
- Sensitive to sparsity (double-zero problem) –> filter rare taxa
- Excessive emphasis on high-abundance taxa –> log transform features first
Log transformations, and CLR
- First let’s look at the abundance again, this time with heatmaps.
- Each column is a sample (from an infant), and each row is a taxon.
%>%
infants tax_sort(by = sum, at = "genus", trans = "compositional", tree_warn = FALSE) %>%
tax_transform(trans = "compositional", rank = "genus") %>%
comp_heatmap(samples = 1:20, taxa = 1:20, name = "Proportions", tax_seriation = "Identity")
- Even though we have picked the top 20 most abundant genera, there are still a lot of zeros
- Problem: We need to deal with the zeros, because
log(0)
is undefined. - Solution: add a small amount to every value (or just every zero), before applying the log transformation.
- This small value is often called a pseudo-count.
Centered Log Ratio transformation:
Remember, Microbiome Datasets Are Compositional: And This Is Not Optional.
TL;DR - the CLR transformation is useful for compositional microbiome data.
- Practically, the CLR transformation involves finding the geometric mean of each sample
- Then dividing abundance of each taxon in that sample by this geometric mean
- Finally, you take the natural log of these ratios
%>%
infants tax_sort(by = sum, at = "genus", trans = "compositional", tree_warn = FALSE) %>%
tax_agg(rank = "genus") %>%
tax_transform(trans = "clr", zero_replace = "halfmin", chain = TRUE) %>%
comp_heatmap(
samples = 1:20, taxa = 1:20, grid_lwd = 2, name = "CLR",
colors = heat_palette(sym = TRUE),
tax_seriation = "Identity"
)
PCA on CLR-transformed taxa
%>%
infants tax_filter(min_prevalence = 2.5 / 100, verbose = FALSE) %>%
tax_transform(rank = "genus", trans = "clr", zero_replace = "halfmin") %>%
ord_calc(method = "PCA") %>%
ord_plot(alpha = 0.6, size = 2, color = "birth_mode") +
theme_classic(12) +
coord_fixed(0.7)
- After the CLR transformation, the plot looks better
- We can see a pattern where the gut microbiomes of infants cluster by birth mode
So why is PCA interesting for us?
Principal components are built directly from a (linear) combination of the original features.
That means we know how much each taxon contributes to each PC dimension
We can plot this information (loadings) as arrows, alongside the sample points
<- infants %>%
pca tax_filter(min_prevalence = 2.5 / 100, verbose = FALSE) %>%
tax_transform(rank = "genus", trans = "clr", zero_replace = "halfmin") %>%
ord_calc(method = "PCA") %>%
ord_plot(
alpha = 0.6, size = 2, color = "birth_mode",
plot_taxa = 1:6, tax_vec_length = 0.275,
tax_lab_style = tax_lab_style(
type = "text", max_angle = 90, aspect_ratio = 0.7,
size = 3, fontface = "bold"
),+
) theme_classic(12) +
coord_fixed(0.7, clip = "off")
pca
Interestingly, samples on the right of the plot (which tend to be vaginally-delivered infants) seem to have relatively more Bifidobacterium, Bacteroides and Escherichia, whilst the C-section born infants have relatively more Veillonella.
Taxon stats
From the PCA loadings and barplots below, we have some strong suspicions about which taxa have a higher relative abundance in vaginally delivered infants than in c-section delivered infants, and vice versa, but we can also statistically test this. This is often called “differential abundance” (DA) testing, in the style of “differential expression” (DE) testing from the transcriptomics field.
%>%
infants comp_barplot(
tax_level = "genus", n_taxa = 12, facet_by = "birth_mode",
label = NULL, bar_outline_colour = NA
+
) coord_flip() +
theme(axis.ticks.y = element_blank())
Model one taxon
- We will start by creating a linear regression model for one genus, Bacteroides.
- We will transform the count data by first making it proportions, and then taking a base 2 logarithm, log2, after adding a pseudocount.
<- infants %>%
bacteroidesRegression1 tax_transform("compositional", rank = "genus") %>%
tax_model(
type = "lm", rank = "genus",
trans = "log2", trans_args = list(zero_replace = "halfmin"),
taxa = "Bacteroides", variables = "birth_mode",
return_psx = FALSE
%>%
) pluck(1)
Modelling: Bacteroides
- Looking at the regression results
summary(bacteroidesRegression1)
Call:
Bacteroides ~ birth_mode
Residuals:
Min 1Q Median 3Q Max
-7.7492 -0.6172 -0.6172 2.6421 18.0804
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -19.3756 0.4863 -39.84 <2e-16 ***
birth_modevaginal 7.1320 0.6812 10.47 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.957 on 304 degrees of freedom
Multiple R-squared: 0.265, Adjusted R-squared: 0.2626
F-statistic: 109.6 on 1 and 304 DF, p-value: < 2.2e-16
::tidy(bacteroidesRegression1, conf.int = TRUE) broom
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -19.4 0.486 -39.8 1.08e-122 -20.3 -18.4
2 birth_modevaginal 7.13 0.681 10.5 4.13e- 22 5.79 8.47
Covariate-adjusted model
We can fit a model with covariates, as we did for PERMANOVA
- We will convert the categorical variables into indicator (dummy) variables
- We will scale the continuous covariates to 0 mean and SD 1 (z-scores)
- You’ll see this will make our subsequent plots easier to interpret later
<- infants %>%
infants ps_mutate(
C_section = if_else(birth_mode == "c_section", true = 1, false = 0),
Female = if_else(sex == "female", true = 1, false = 0),
Birth_weight_Z = scale(birth_weight, center = TRUE, scale = TRUE),
Reads_Z = scale(number_reads, center = TRUE, scale = TRUE)
)
<- infants %>%
bacteroidesRegression2 tax_transform("compositional", rank = "genus") %>%
tax_model(
type = "lm", rank = "genus", taxa = "Bacteroides",
trans = "log2", trans_args = list(zero_replace = "halfmin"),
variables = c("C_section", "Female", "Birth_weight_Z", "Reads_Z"),
return_psx = FALSE
%>%
) pluck(1)
Warning in do.call(fun, list(txt)): 15 / 306 values are NA in Female
Warning in do.call(fun, list(txt)): 14 / 306 values are NA in Birth_weight_Z
Modelling: Bacteroides
- Looking at the regression results
summary(bacteroidesRegression2)
Call:
Bacteroides ~ C_section + Female + Birth_weight_Z + Reads_Z
Residuals:
Min 1Q Median 3Q Max
-9.4271 -2.1555 -0.4115 2.8176 18.1784
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11.7942 0.6103 -19.325 <2e-16 ***
C_section -7.5696 0.7206 -10.505 <2e-16 ***
Female -0.3809 0.7101 -0.536 0.592
Birth_weight_Z 0.3277 0.3514 0.932 0.352
Reads_Z 0.5361 0.3620 1.481 0.140
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.934 on 286 degrees of freedom
(15 observations deleted due to missingness)
Multiple R-squared: 0.2854, Adjusted R-squared: 0.2754
F-statistic: 28.55 on 4 and 286 DF, p-value: < 2.2e-16
::tidy(bacteroidesRegression2, conf.int = TRUE) broom
# A tibble: 5 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -11.8 0.610 -19.3 8.15e-54 -13.0 -10.6
2 C_section -7.57 0.721 -10.5 4.81e-22 -8.99 -6.15
3 Female -0.381 0.710 -0.536 5.92e- 1 -1.78 1.02
4 Birth_weight_Z 0.328 0.351 0.932 3.52e- 1 -0.364 1.02
5 Reads_Z 0.536 0.362 1.48 1.40e- 1 -0.176 1.25
There are many DA methods!
- This method simple method is borrowed from MaAsLin2
- Note: they call the compositional transformation “Total Sum Scaling (TSS)”)
- This is quite a straightforward method, so we will stick with this for today
- But, many statistical methods have been developed for differential abundance analyses
Microbiome abundance data are quite awkward, statistically speaking, due to their sparseness and compositionality. Each successive method claims to handle some aspect of this awkwardness “better” than previous methods.
The aim is to have a method with adequate power to detect true associations, whilst controlling the type 1 error rate, the “false positive” reporting of associations that are not “truly” present.
Results are surprisingly inconsistent across the different methods, as demonstrated this year in a fascinating analysis by Jacob Nearing and colleagues.
So, what to do?
- Filter out the noise & interpret results with caution! use multiple testing corrections
- Remember it’s all relative (abundance)
- Try 2 or 3 methods and/or use same method as a previous study if replicating
- Avoid Lefse and edgeR?
- Beware: Not all methods allow covariate adjustment & few allow random effects (for time-series)
Now model all the taxa!?
- We’re not normally interested in just one taxon!
- It’s also hard to decide which taxonomic rank we are most interested in modelling!
- Lower ranks like species or ASVs give better resolution but also more sparsity and classification uncertainty…
- Higher ranks e.g. classes, could also be more powerful if you think most taxa within that class will follow a similar pattern.
- So now we will fit a similar model for almost every taxon* at every rank from phylum to species
- *We’ll filter out species with a prevalence of less than 10%
# The code for `taxatree_models` is quite similar to tax_model.
# However, you might need to run `tax_prepend_ranks` to ensure that each taxon at each rank is always unique.
<- infants %>%
shaoModels tax_prepend_ranks() %>%
tax_transform("compositional", rank = "species", keep_counts = TRUE) %>%
tax_filter(min_prevalence = 0.1, undetected = 0, use_counts = TRUE) %>%
taxatree_models(
type = lm,
trans = "log2", trans_args = list(zero_replace = "halfmin"),
ranks = c("phylum", "class", "order", "family", "genus", "species"),
variables = c("C_section", "Female", "Birth_weight_Z", "Reads_Z")
)
shaoModels
psExtra object - a phyloseq object with extra slots:
phyloseq-class experiment-level object
otu_table() OTU Table: [ 39 taxa and 306 samples ]
sample_data() Sample Data: [ 306 samples by 15 sample variables ]
tax_table() Taxonomy Table: [ 39 taxa by 6 taxonomic ranks ]
otu_get(counts = TRUE) [ 39 taxa and 306 samples ]
psExtra info:
tax_agg = "species" tax_trans = "compositional"
taxatree_models list:
Ranks: phylum/class/order/family/genus/species
Why filter the taxa? It’s less likely that we are interested in rare taxa, and models of rare taxon abundances are more likely to be unreliable. Reducing the the number of taxa modelled also makes the process faster and makes visualizing the results easier!
Getting stats from the models
Next we will get a data.frame containing the regression coefficient estimates, test statistics and corresponding p values from all these regression models.
<- taxatree_models2stats(shaoModels)
shaoStats shaoStats
psExtra object - a phyloseq object with extra slots:
phyloseq-class experiment-level object
otu_table() OTU Table: [ 39 taxa and 306 samples ]
sample_data() Sample Data: [ 306 samples by 15 sample variables ]
tax_table() Taxonomy Table: [ 39 taxa by 6 taxonomic ranks ]
otu_get(counts = TRUE) [ 39 taxa and 306 samples ]
psExtra info:
tax_agg = "species" tax_trans = "compositional"
taxatree_stats dataframe:
89 taxa at 6 ranks: phylum, class, order, family, genus, species
4 terms: C_section, Female, Birth_weight_Z, Reads_Z
%>% taxatree_stats_get() shaoStats
# A tibble: 356 × 8
term taxon rank formula estimate std.error statistic p.value
<fct> <chr> <fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 C_section p: Proteo… phyl… `p: Pr… -3.72 0.812 -4.58 7.01e- 6
2 Female p: Proteo… phyl… `p: Pr… -0.0684 0.800 -0.0855 9.32e- 1
3 Birth_weight_Z p: Proteo… phyl… `p: Pr… -0.253 0.396 -0.638 5.24e- 1
4 Reads_Z p: Proteo… phyl… `p: Pr… -0.369 0.408 -0.905 3.66e- 1
5 C_section p: Actino… phyl… `p: Ac… -4.25 0.739 -5.75 2.25e- 8
6 Female p: Actino… phyl… `p: Ac… -0.439 0.728 -0.603 5.47e- 1
7 Birth_weight_Z p: Actino… phyl… `p: Ac… 0.293 0.361 0.813 4.17e- 1
8 Reads_Z p: Actino… phyl… `p: Ac… -0.876 0.371 -2.36 1.90e- 2
9 C_section p: Firmic… phyl… `p: Fi… 2.43 0.342 7.12 8.74e-12
10 Female p: Firmic… phyl… `p: Fi… 0.697 0.337 2.07 3.95e- 2
# ℹ 346 more rows
Adjusting p values
We have performed a lot of statistical tests here!
It is likely that we could find some significant p-values by chance alone.
We should correct for multiple testing / control the false discovery rate or family-wise error rate.
Instead of applying these adjustment methods across all 86 taxa models at all ranks, the default behaviour is to control the family-wise error rate per taxonomic rank.
<- shaoStats %>% taxatree_stats_p_adjust(method = "BH", grouping = "rank")
shaoStats # notice the new variable
%>% taxatree_stats_get() shaoStats
# A tibble: 356 × 9
# Groups: rank [6]
term taxon rank formula estimate std.error statistic p.value p.adj.BH.rank
<fct> <chr> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 C_se… p: P… phyl… `p: Pr… -3.72 0.812 -4.58 7.01e- 6 2.81e- 5
2 Fema… p: P… phyl… `p: Pr… -0.0684 0.800 -0.0855 9.32e- 1 9.32e- 1
3 Birt… p: P… phyl… `p: Pr… -0.253 0.396 -0.638 5.24e- 1 6.45e- 1
4 Read… p: P… phyl… `p: Pr… -0.369 0.408 -0.905 3.66e- 1 6.39e- 1
5 C_se… p: A… phyl… `p: Ac… -4.25 0.739 -5.75 2.25e- 8 1.20e- 7
6 Fema… p: A… phyl… `p: Ac… -0.439 0.728 -0.603 5.47e- 1 6.45e- 1
7 Birt… p: A… phyl… `p: Ac… 0.293 0.361 0.813 4.17e- 1 6.39e- 1
8 Read… p: A… phyl… `p: Ac… -0.876 0.371 -2.36 1.90e- 2 5.07e- 2
9 C_se… p: F… phyl… `p: Fi… 2.43 0.342 7.12 8.74e-12 6.99e-11
10 Fema… p: F… phyl… `p: Fi… 0.697 0.337 2.07 3.95e- 2 9.03e- 2
# ℹ 346 more rows
Plot all the taxatree_stats!
taxatree_plots()
allows you to plot statistics from all of the taxa models onto a tree layout (e.g. point estimates and significance).- The taxon model results are organised by rank, radiating out from the central root node
- e.g. from Phyla around the center to Species in the outermost ring.
taxatree_plots()
itself returns a list of plots, which you can arrange into one figure with the patchwork
package for example (and/or cowplot
).
%>%
shaoStats taxatree_plots(node_size_range = c(1, 3), sig_stat = "p.adj.BH.rank") %>%
::wrap_plots(ncol = 2, guides = "collect") patchwork
Taxatree Key
But how do we know which taxa are which nodes? We can create a labelled grey tree with taxatree_plotkey()
. This labels only some of the taxa based on certain conditions that we specify.
set.seed(123) # label position
<- shaoStats %>%
key taxatree_plotkey(
taxon_renamer = function(x) stringr::str_remove(x, "[pfg]: "),
# conditions below, for filtering taxa to be labelled
== "phylum" | rank == "genus" & prevalence > 0.2
rank # all phyla are labelled, and all genera with a prevalence of over 0.2
) key
You can do more with these trees to customise them to your liking. See an extended tutorial here on the microViz website: including how to directly label taxa on the colored plots, change the layout and style of the trees, and even how to use a different regression modelling approach.
# try it out!
Session info
session_info
records your package versions etc. This is useful for debugging / reproducing analysis.