These exercises will introduce some core concepts in microbiome analysis, using example data
Follow along by writing code in your own R script (or an R Markdown notebook, or a Quarto doc)
# Code blocks have a copy button in the top right -> -> -> -> ->cat("- Copy and modify the code examples\n- Try to understand what it does, and play around!")
- Copy and modify the code examples
- Try to understand what it does, and play around!
Click on these collapsed callout blocks for more info
Topics for part 1:
phyloseq objects for microbiome data
Stacked bar charts for visualizing microbiome compositions
Ecosystem diversity indices
Getting help
The microViz website has help pages for every function (and more tutorials)
Ask your neighbour, a TA, or me
ChatGPT doesn’t know anything about microViz…
⚙️ Setup
Open your RStudio Server on your instance
Find the microbiome analysis folder
Click on the file ending .Rproj to open an RStudio project
Sidenote: what is an RStudio project?
A convenient aid for organising multiple R analysis projects. When activated it sets the working directory to the location of the .Rproj file. You can read more here: https://rstats.wtf/projects
Go to the RStudio “Git” tab, and click on Pull (this will ensure you have this year’s example data)
We’ll use two example datasets in this evening’s exercises
Both contain gut microbiome taxonomic composition data, from stool samples
The first is from a 2019 study of antibiotic administration in mice
The mouse antibiotics study used 16S rRNA gene amplicon sequencing
They used Illumina MiSeq and processed the data into ASVs using DADA2
# Check data_directory is correct path on AMI to data folderdata_directory <- here::here("data") # A manageable small subset of the mouse data to start practicing withmice <-readRDS(file.path(data_directory, "mice.rds"))# And the full dataset, we will use it later - ignore it for nowallMice <-readRDS(file.path(data_directory, "allMice.rds"))
The second is from a large birth cohort study, containing human infants born by C-section or vaginal delivery
The birth cohort study data is shotgun metagenomics data
They used Illumina HiSeq and inferred the abundance of species-like features with metaphlan3.
data("shao19", package ="microViz")
All analysis approaches we’ll learn this evening can be applied to both kinds of data.
These methods are also appropriate for microbiome data from other environments, not just the gut microbiota!
microViz has been used by groups analysing microbiomes associated with Antarctic sponges, Kiwi birds, and wastewater samples, amongst other things.
🦠 Intro to phyloseq
We will start with some processed microbiota data from the mouse study.
So far we looked just at one day of measurements, but stool samples were collected on multiple days.
Let’s grab the full dataset for more practice arranging our bar charts.
allMice <-tax_fix(allMice, verbose =FALSE)allMice
phyloseq-class experiment-level object
otu_table() OTU Table: [ 3229 taxa and 520 samples ]
sample_data() Sample Data: [ 520 samples by 11 sample variables ]
tax_table() Taxonomy Table: [ 3229 taxa by 6 taxonomic ranks ]
Code to set up a grid layout using cage numbers and letters
# Convert sample timings to a factor variable, for correct temporal ordering. allMice <- allMice %>%ps_mutate(Days =factor(x = treatment_days,levels =paste0("D", c(".14", "0", "3", "7", "13", "16", "18", "20")),labels =c("-14", "0", "3", "7", "13", "16", "18", "20") ))# Separate the cage numbers and letters to allow a grid layout.allMice <- allMice %>%ps_mutate(cage_number = stringr::str_extract(cage, "[1-9]"),cage_letter = stringr::str_extract(cage, "[A-Z]") )# Check the grid layout: plot the virus infected, vehicle control groupallMice %>%ps_filter(virus =="WNV2000", treatment =="Vehicle") %>%comp_barplot(tax_level ="Family", n_taxa =13, x ='Days',bar_width =0.8, merge_other =FALSE ) +facet_grid(cols =vars(cage_number), rows =vars(cage_letter))
Registered S3 method overwritten by 'seriation':
method from
reorder.hclust vegan
Code to create a big plot of all samples from WNV-exposed mice
# Create a list of identically themed and coloured plots with the group_by argument.plotList <- allMice %>%ps_mutate(group =paste(treatment, virus)) %>%comp_barplot(tax_level ="Family", n_taxa =13, x ="Days", group_by ="group")# Arrange the WNV-exposed mice plots with patchwork, in 4 rowsplotsWNV <-wrap_plots(plotList[grep(x =names(plotList), pattern ="WNV2000")]) plotsWNV <- plotsWNV +plot_layout(guides ="collect", nrow =4, heights =c(2,2,3,2))# Add faceting to all the plots in the list, with patchwork's `&` operatorplotsWNV <- plotsWNV &facet_grid(cols =vars(cage_number), rows =vars(cage_letter))# Write these plots to a large image file!ggsave(plot = plotsWNV, filename ="WNVplots.pdf", height =12, width =8)
Exercise - write code for a plot of all control mice (no virus)
A true measure of ecosystem diversity will also consider the evenness of the ecosystem.
A rich ecosystem predominated by one taxon is still intuitively a less diverse ecosystem than one with an even distribution of the same number of taxa.
The exponent of the Shannon index \(e^H\) represents the number of taxa that would be present in an evenly abundant ecosystem with the same Shannon index.
Now each sample is labelled with \(e^H\) - do you see a more obvious relationship to sample composition?
Explanation of the Effective Shannon Diversity
The numeric value of the Shannon index itself has no intuitive meaning.
You can compare them, but can’t easily interpret any one number.
So, the concept of “effective numbers” of taxa is useful here.
If your original ecosystem was actually perfectly even, then \(e^H = N\)
Where N is the observed richness
The more uneven an ecosystem, the further \(e^H\) will be from \(N\)
Comparing Diversity
Let us now compare the diversity of our samples.
Perhaps we hypothesise that the mouse gut diversity differs by type of antibiotic treatment.
The mice exposed to Ampicillin (± Metronidazole) appear to have higher diversity at this day 7 sample!
# A simple statistical test supports this assertionmiceShannonDf %>%mutate(Ampicillin =if_else(grepl(pattern ="Amp", x = treatment), 1, 0)) %>%wilcox.test(formula = Exp_Shannon ~ Ampicillin, data = .)
Wilcoxon rank sum exact test
data: Exp_Shannon by Ampicillin
W = 59, p-value = 2.919e-06
alternative hypothesis: true location shift is not equal to 0
Notes on R pipe syntax
As you probably know by now, %>% pipes the result from one function into the first argument of the next function.
This works fine for most functions, but sometimes you need to pipe the result into the second or third argument, not the first!
This is possible, you must include a . as a placeholder to tell the pipe where to put the result.
miceShannonDf %>%mutate(Ampicillin =if_else(grepl(pattern ="Amp", x = treatment), 1, 0)) %>%wilcox.test(formula = Exp_Shannon ~ Ampicillin, data = .)# this dot is important! ^
Since R version 4, you can also do this with the “native” pipe |>, but you must use a _ as the placeholder, instead of a .
miceShannonDf |>mutate(Ampicillin =if_else(grepl(pattern ="Amp", x = treatment), 1, 0)) |>wilcox.test(formula = Exp_Shannon ~ Ampicillin, data = _)# this _ is important! ^
If you don’t like pipes, you must make intermediate objects, like this.
dataForWilcoxTest <-mutate(.data = miceShannonDf, Ampicillin =if_else(grepl(pattern ="Amp", x = treatment), 1, 0)) wilcox.test(formula = Exp_Shannon ~ Ampicillin, data = dataForWilcoxTest)
Wilcoxon rank sum exact test
data: Exp_Shannon by Ampicillin
W = 59, p-value = 2.919e-06
alternative hypothesis: true location shift is not equal to 0
Use whichever you prefer and feel is most organised and readable! 😎
You can apply more complex statistical tests if you like
e.g. adjusting for covariates with linear regression, using lm()
Call:
lm(formula = Exp_Shannon ~ Ampicillin + virus, data = .)
Residuals:
Min 1Q Median 3Q Max
-5.0571 -0.5624 0.1993 1.0744 4.2906
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.9133 0.4977 3.844 0.000287 ***
Ampicillin 2.9685 0.4844 6.128 6.76e-08 ***
virusWNV2000 1.1967 0.5233 2.287 0.025624 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.944 on 62 degrees of freedom
Multiple R-squared: 0.417, Adjusted R-squared: 0.3982
F-statistic: 22.17 on 2 and 62 DF, p-value: 5.454e-08
Final activities for part 1:
Try making other diversity comparisons for yourself?
Consider comparing ASV-level diversity?
Does virus exposure influence diversity?
What about comparing richness?
Practice making plots and doing simple statistical tests.
###
⌚ Break Time! ☕️ ️
You want more about diversity?
Click here for additional notes on richness and readcount
Simple measures like Observed Richness are sensitive to what ecologists call “sampling effort”.
For macroecologists, this is actually how much time/effort you spent trying to count all the organisms present in an ecosystem.
In our case, the amount of total reads obtained represents the sampling effort: more reads, more effort.
Indeed we can see that the samples with a much lower readcount have lower observed richness.
Furthermore, as this richness estimate is based on a sample, and not the actual ecosystem, the richness estimate actually has quantifiable uncertainty too.
Simple solution: Ignore the problem. Whilst you can’t interpret the richness of any individual sample as being correct, it is still usually valid to compare richness across groups of samples, as the readcount variation is only random noise, and should be uncorrelated with your grouping variable (but do check this).