```
# Code blocks have a copy button in the top right -> -> -> -> ->
cat("- Copy and modify the code examples\n- Try to understand what it does, and play around!")
```

```
- Copy and modify the code examples
- Try to understand what it does, and play around!
```

Microbiome analysis with microViz

- These exercises will introduce some core concepts in microbiome analysis, using example data
- Follow along by writing code in your own R script (or an R Markdown notebook, or a Quarto doc)

```
# Code blocks have a copy button in the top right -> -> -> -> ->
cat("- Copy and modify the code examples\n- Try to understand what it does, and play around!")
```

```
- Copy and modify the code examples
- Try to understand what it does, and play around!
```

- Open your RStudio server instance
- Find the microbiome analysis folder
- Click on the file ending
**.Rproj**to open an RStudio project - Go to the RStudio
**βGitβ**tab, and click on**Pull**(this will ensure you have this yearβs example data)

- Open a new R script / notebook / quarto doc
- Load the R packages we will be using:

```
library(tidyverse)
library(phyloseq)
library(microViz)
library(shiny)
```

- Explore some microbiomes!

- Weβll use two example datasets in this eveningβs exercises
- Both contain gut microbiome taxonomic composition data, from stool samples
- The first is from a 2019 study of antibiotic administration in mice
- The mouse antibiotics study used 16S rRNA gene amplicon sequencing
- They used Illumina MiSeq and processed the data into ASVs using DADA2

```
# Check data_directory is correct path on AMI to data folder
<- here::here("data")
data_directory
# A manageable small subset of the mouse data to start practicing with
<- readRDS(file.path(data_directory, "mice.rds"))
mice
# And the full dataset, we will use it later - ignore it for now
<- readRDS(file.path(data_directory, "allMice.rds")) allMice
```

- The second is from a large birth cohort study, containing human infants born by C-section or vaginal delivery
- The birth cohort study data is shotgun metagenomics data
- They used Illumina HiSeq and inferred the abundance of species-like features with metaphlan3.

`data("shao19", package = "microViz")`

- All analysis approaches weβll learn this evening can be applied to both kinds of data.
- These methods are also appropriate for microbiome data from other environments, not just the gut microbiota!

*microViz has been used by groups analysing microbiomes associated with Antarctic sponges, Kiwi birds, and wastewater samples, amongst other things.*

We will start with some processed microbiota data from the mouse study.

`mice`

is a `phyloseq`

S4 object, containing a small subset of these data.

- Printing the object shows you some basic info.
- It also names three functions you can use to access the data inside.

` mice`

```
phyloseq-class experiment-level object
otu_table() OTU Table: [ 591 taxa and 45 samples ]
sample_data() Sample Data: [ 45 samples by 8 sample variables ]
tax_table() Taxonomy Table: [ 591 taxa by 6 taxonomic ranks ]
```

- Have a look around inside the different parts of this object using accessor functions.
- phyloseq and microViz both provide accessor functions, some to try are included below.

```
samdat_tbl()
sample_variables()
otu_get()
nsamples()
ntaxa()
sample_names()
taxa_names()
# don't forget you can use head() to truncate long output!
```

- You might have noticed the the taxonomy table has some
`NA`

values. - This often occurs when an ASV cannot be uniquely classified at the Genus level.
- The short 16S region sequenced may only allow a unique classification at Family rank or above.

`tax_table(mice)[1:5, ]`

```
Taxonomy Table: [5 taxa by 6 taxonomic ranks]:
Kingdom Phylum Class Order
ASV0001 "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
ASV0002 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Xanthomonadales"
ASV0003 "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
ASV0004 "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
ASV0005 "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales"
Family Genus
ASV0001 "Porphyromonadaceae" NA
ASV0002 "Xanthomonadaceae" "Stenotrophomonas"
ASV0003 "Porphyromonadaceae" NA
ASV0004 "Porphyromonadaceae" NA
ASV0005 "Porphyromonadaceae" NA
```

We need to fill those gaps! and we can do this with

`tax_fix()`

`tax_fix()`

copies info down from a higher rank, to fill the gaps.First, letβs look at the taxonomy table interactively, with

`tax_fix_interactive()`

```
# tax_fix_interactive(mice) # run this in the R Console
# You may need to allow popups on your browser!
```

- Letβs update our
`mice`

phyloseq object with this fix.

`<- tax_fix(mice, verbose = FALSE) mice `

- Check the first few taxa now look correct?

`# Hint: use tax_table and head`

- Okay, so how do we look at the microbiota abundance data?
- Weβre first going to make stacked compositional bar charts using the
`microViz`

package,`comp_barplot()`

function.

- Lets start with a smaller subset of the data.
- Just the control group (Vehicle treatment) at day 13.

```
# We can filter the samples like this, using the sample_data information
%>% ps_filter(treatment == "Vehicle") # similar to a dplyr filter! mice
```

```
phyloseq-class experiment-level object
otu_table() OTU Table: [ 394 taxa and 10 samples ]
sample_data() Sample Data: [ 10 samples by 8 sample variables ]
tax_table() Taxonomy Table: [ 394 taxa by 6 taxonomic ranks ]
```

```
%>%
mice ps_filter(treatment == "Vehicle") %>%
tax_names2rank("ASV") %>%
comp_barplot(
tax_level = "ASV", n_taxa = 12,
bar_width = 0.7, sample_order = "asis"
)
```

- The ASVs have uninformative numeric IDs, but we can fix that with
`tax_rename()`

```
%>%
mice ps_filter(treatment == "Vehicle") %>%
tax_rename(rank = "Family", pad_digits = 2) %>% # this changes the taxa_names
tax_names2rank("ASV") %>%
comp_barplot(
tax_level = "ASV", n_taxa = 12,
bar_width = 0.7, sample_order = "asis"
)
```

- Sadly we donβt have enough distinct colours to show all the unique ASVs.
- Instead we can
**aggregate**the counts at a higher taxonomic rank, e.g. Families.

```
%>%
mice ps_filter(treatment == "Vehicle") %>%
comp_barplot(
tax_level = "Family", n_taxa = 9,
sample_order = "asis", bar_width = 0.7,
merge_other = FALSE
)
```

- Try aggregating at other ranks, e.g. Class.

- Letβs look at all of the data from day 7, from all treatment groups.
- We can add the ggplot2
`facet_wrap()`

to our plot, to separate the treatment groups.

```
%>%
mice comp_barplot(
tax_level = "Family", n_taxa = 10, merge_other = FALSE,
sample_order = "asis", label = "cage"
+
) facet_wrap(facets = vars(treatment), scales = "free") +
labs(
title = "Day 7 mouse fecal microbiota compositions",
x = "Mouse ID", y = "Relative Abundance"
+
) coord_flip()
```

`# coord_flip exchanges x and y axes, which can be more readable`

How diverse is the bacterial microbiome of each sample?

- The simplest richness measure is just counting, a.k.a. βObserved Richnessβ.
- Letβs compute the observed richness and sort and label some samples.
`ps_calc_richness()`

computes the index for each sample and adds it to your sample_data

```
%>%
mice ps_filter(treatment %in% c("Vehicle", "AmpMetro")) %>%
ps_calc_richness(rank = "Family", index = "observed", varname = "N_Families") %>%
ps_arrange(N_Families) %>%
comp_barplot(
tax_level = "Family", n_taxa = 14, label = "N_Families", bar_width = 0.8,
sample_order = "asis", merge_other = FALSE, facet_by = "treatment"
+
) coord_flip()
```

- A true measure of ecosystem diversity will also consider the evenness of the ecosystem.
- A rich ecosystem predominated by one taxon is still intuitively a less diverse ecosystem than one with an even distribution of the same number of taxa.

```
%>%
mice ps_filter(treatment %in% c("Vehicle", "AmpMetro")) %>%
ps_calc_diversity(rank = "Family", index = "shannon", varname = "Shannon_Family") %>%
ps_arrange(Shannon_Family) %>%
ps_mutate(Shannon = formatC(Shannon_Family, digits = 2, format = "f")) %>%
comp_barplot(
tax_level = "Family", n_taxa = 14, label = "Shannon", bar_width = 0.8,
sample_order = "asis", merge_other = FALSE, facet_by = "treatment"
+
) coord_flip()
```

- The Shannon index is a commonly used diversity measure, with this formula: \(H = -\sum_{i=1}^Np_i\ln p_i\)
- Shannon index is often is denoted with \(H\), and here \(p\) denotes proportion of the \(i\)βth taxon.

```
%>%
mice ps_filter(treatment %in% c("Vehicle", "AmpMetro")) %>%
ps_calc_diversity(rank = "Family", index = "shannon", exp = TRUE, varname = "Exp_Shannon") %>%
ps_arrange(Exp_Shannon) %>%
ps_mutate(Exp_Shannon = formatC(Exp_Shannon, digits = 2, format = "f")) %>%
comp_barplot(
tax_level = "Family", n_taxa = 14, label = "Exp_Shannon", bar_width = 0.8,
sample_order = "asis", merge_other = FALSE, facet_by = "treatment"
+
) coord_flip()
```

- The exponent of the Shannon index \(e^H\) represents the number of taxa that would be present in an evenly abundant ecosystem with the same Shannon index.

- Let us now compare the diversity of our samples.
- Perhaps we hypothesise that the mouse gut diversity differs after antibiotic treatment.

```
<- mice %>%
miceShannonDf ps_calc_diversity(rank = "Family", index = "shannon", exp = TRUE, varname = "Exp_Shannon") %>%
samdat_tbl()
```

```
%>%
miceShannonDf ggplot(aes(y = Exp_Shannon, x = treatment, color = treatment)) +
geom_boxplot(alpha = 0.5, width = 0.5) +
geom_point(position = position_jitter(width = 0.1), alpha = 0.5, size = 2) +
scale_color_brewer(palette = "Set1", guide = NULL) +
labs(y = "Effective Shannon diversity (Family)", x = NULL) +
theme_bw()
```

- The mice exposed to Ampicillin (Β± Metronidazole) appear to have higher diversity at this day 7 sample!

```
# A simple statistical test supports this assertion
%>%
miceShannonDf mutate(Ampicillin = if_else(grepl(pattern = "Amp", x = treatment), 1, 0)) %>%
wilcox.test(formula = Exp_Shannon ~ Ampicillin, data = .)
```

```
Wilcoxon rank sum exact test
data: Exp_Shannon by Ampicillin
W = 59, p-value = 2.919e-06
alternative hypothesis: true location shift is not equal to 0
```

- You can apply more complex statistical tests if you like
- e.g. adjusting for covariates with linear regression, using
`lm()`

```
%>%
allMice ps_filter(treatment_days == "D7") %>%
ps_calc_diversity(rank = "Family", index = "shannon", exp = TRUE, varname = "Exp_Shannon") %>%
samdat_tbl() %>%
mutate(Ampicillin = if_else(grepl(pattern = "Amp", x = treatment), 1, 0)) %>%
lm(formula = Exp_Shannon ~ Ampicillin + virus, data = .) %>%
summary()
```

```
Call:
lm(formula = Exp_Shannon ~ Ampicillin + virus, data = .)
Residuals:
Min 1Q Median 3Q Max
-5.0571 -0.5624 0.1993 1.0744 4.2906
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.9133 0.4977 3.844 0.000287 ***
Ampicillin 2.9685 0.4844 6.128 6.76e-08 ***
virusWNV2000 1.1967 0.5233 2.287 0.025624 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.944 on 62 degrees of freedom
Multiple R-squared: 0.417, Adjusted R-squared: 0.3982
F-statistic: 22.17 on 2 and 62 DF, p-value: 5.454e-08
```

**Final activities for part 1:**

- Try making other diversity comparisons for yourself?
- Consider comparing ASV-level diversity?
- Does virus exposure influence diversity?
- What about comparing richness?
- Practice making plots and doing simple statistical tests.

`session_info`

records your package versions etc. This is useful for debugging / reproducing analysis.