Evomics Workshop part 1

Microbiome analysis with microViz

Author

David Barnett

Published

December 17, 2024

👋 Instructions

These exercises will introduce some core concepts in microbiome analysis, using example data
Follow along by writing code in your own R script (or an R Markdown notebook, or a Quarto doc)

# Code blocks have a copy button in the top right -> -> -> -> ->
cat("- Copy and modify the code examples\n- Try to understand what it does, and play around!")

- Copy and modify the code examples
- Try to understand what it does, and play around!

Click on these collapsed callout blocks for more info

Topics for part 1:

phyloseq objects for microbiome data
Stacked bar charts for visualizing microbiome compositions
Ecosystem diversity indices

Getting help

The microViz website has help pages for every function (and more tutorials)
Ask your neighbour, a TA, or me
ChatGPT doesn’t know anything about microViz…

⚙️ Setup

Open RStudio on your instance
Find the microbiome analysis folder
Click on the file ending .Rproj to open an RStudio project

Sidenote: what is an RStudio project?

A convenient aid for organising multiple R analysis projects. When activated it sets the working directory to the location of the .Rproj file. You can read more here: https://rstats.wtf/projects

Go to the RStudio “Git” tab, and click on Pull (this will ensure you have this year’s example data)
Open a new R script / notebook / quarto doc
Load the R packages we will be using:

library(tidyverse)
library(phyloseq)
library(microViz)
library(shiny)

Explore some microbiomes!

We’ll use two example datasets in this evening’s exercises
Both contain gut microbiome taxonomic composition data, from stool samples
The first is from a 2019 study of antibiotic administration in mice
The mouse antibiotics study used 16S rRNA gene amplicon sequencing
- They used Illumina MiSeq and processed the data into ASVs using DADA2

# Check data_directory is correct path on AMI to data folder
data_directory <- here::here("data") 

# A manageable small subset of the mouse data to start practicing with
mice <- readRDS(file.path(data_directory, "mice.rds"))

# And the full dataset, we will use it later - ignore it for now
allMice <- readRDS(file.path(data_directory, "allMice.rds"))

The second is from a large birth cohort study, containing human infants born by C-section or vaginal delivery
The birth cohort study data is shotgun metagenomics data
- They used Illumina HiSeq and inferred the abundance of species-like features with metaphlan3.

data("shao19", package = "microViz")

All analysis approaches we’ll learn this evening can be applied to both kinds of data.
These methods are also appropriate for microbiome data from other environments, not just the gut microbiota!

microViz has been used by groups analysing microbiomes associated with Antarctic sponges, Kiwi birds, and wastewater samples, amongst other things.

🦠 Intro to phyloseq

We will start with some processed microbiota data from the mouse study.

Click here for details about the mice dataset:

The data originate from a study on the effects of oral antibiotics on West Nile Virus (WNV) infection
Sequence data was generated from extracted DNA from mouse stool samples.
DNA was amplified using primers specific for the 16S rRNA gene V4 region using primers 515F/806R.
Each mouse was caged separately.

The study followed WNV infection after the following treatments:

Ampicillin (Amp): https://en.wikipedia.org/wiki/Ampicillin
Metronidazole (Metro): https://en.wikipedia.org/wiki/Metronidazole
Ampicillin + Metronidazole (AmpMetro)
Koolaid Control (Vehicle): Antibiotics taste bad, so Koolaid is added as a sweetener!

Lab mouse enjoying Koolaid. Imagined by Lexica. — *Yum, Koolaid!*

Treatments were supplied ad libitum for 2 weeks prior to viral infection and maintained for 2 weeks post-infection.

The primary outcome of the study was mouse survival.
Each treatment group had two subgroups:
1. mice infected with WNV, via a subcutaneous foot pad injection
2. mice left uninfected, as another set of controls (we will not use these samples at the start)

mice is a phyloseq S4 object, containing a small subset of these data.

Printing the object shows you some basic info.
It also names three functions you can use to access the data inside.

mice

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 591 taxa and 45 samples ]
sample_data() Sample Data:       [ 45 samples by 8 sample variables ]
tax_table()   Taxonomy Table:    [ 591 taxa by 6 taxonomic ranks ]

Have a look around inside the different parts of this object using accessor functions.
phyloseq and microViz both provide accessor functions, some to try are included below.

samdat_tbl()
sample_variables()
otu_get()
nsamples()
ntaxa()
sample_names()
taxa_names() 
# don't forget you can use head() to truncate long output!

Tip: get an interactive View()

You can also try using the View() function on each part
But remember not to leave interactive functions like View in your script!

The @ symbol, a risky shortcut?

You can also use the @ symbol to access the slots of S4 objects e.g. phyloseq
This can be a quick shortcut when exploring a data object interactively

mice@otu_table[1:5, 1:5]

OTU Table:          [5 taxa and 5 samples]
                     taxa are columns
      ASV0001 ASV0002 ASV0003 ASV0004 ASV0005
D7.B1     709      33    1715     224     555
D7.B2    3896      22    7808    1201    4306
D7.B3     494      13    2657    4841     175
D7.B4    3147      80    4359    6269    3795
D7.B5    1663      11    3577     283    1814

mice@sam_data %>% View()

But, it is slightly risky to directly access the internal structure of S4 objects!
Package authors may change the object’s internal structure when they release a new version, breaking your code.
So in your scripts, prefer accessor functions over @

Filling gaps in the taxonomy table

You might have noticed that the taxonomy table has some NA values.
This often occurs when an ASV cannot be uniquely classified at the Genus level.
The short 16S region sequenced may only allow a unique classification at Family rank or above.

tax_table(mice)[1:5, ]

Taxonomy Table:     [5 taxa by 6 taxonomic ranks]:
        Kingdom    Phylum           Class                 Order            
ASV0001 "Bacteria" "Bacteroidetes"  "Bacteroidia"         "Bacteroidales"  
ASV0002 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Xanthomonadales"
ASV0003 "Bacteria" "Bacteroidetes"  "Bacteroidia"         "Bacteroidales"  
ASV0004 "Bacteria" "Bacteroidetes"  "Bacteroidia"         "Bacteroidales"  
ASV0005 "Bacteria" "Bacteroidetes"  "Bacteroidia"         "Bacteroidales"  
        Family               Genus             
ASV0001 "Porphyromonadaceae" NA                
ASV0002 "Xanthomonadaceae"   "Stenotrophomonas"
ASV0003 "Porphyromonadaceae" NA                
ASV0004 "Porphyromonadaceae" NA                
ASV0005 "Porphyromonadaceae" NA

We need to fill those gaps! and we can do this with tax_fix()
tax_fix() copies info down from a higher rank, to fill the gaps.
First, let’s look at the taxonomy table interactively, with tax_fix_interactive()

# tax_fix_interactive(mice) # run this in the R Console 
# You may need to allow popups on your browser!

Press Stop!

Running tax_fix_interactive(mice) will open a new web browser tab.
When you are done looking, click the red STOP button in the R console!

Let’s update our mice phyloseq object with this fix.

mice <- tax_fix(mice, verbose = FALSE)

Check the first few taxa now look correct?

# Hint: use tax_table and head

📊 Plot some microbiome compositions!

Okay, so how do we look at the microbiota abundance data?
We’re first going to make stacked compositional bar charts using the microViz package, comp_barplot() function.

Stacked compositional barplots

Lets start with a smaller subset of the data.
Just the control group (Vehicle treatment) at day 13.

# We can filter the samples like this, using the sample_data information
mice %>% ps_filter(treatment == "Vehicle") # similar to a dplyr filter!

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 394 taxa and 10 samples ]
sample_data() Sample Data:       [ 10 samples by 8 sample variables ]
tax_table()   Taxonomy Table:    [ 394 taxa by 6 taxonomic ranks ]

Tip: Forgot the variable’s levels?

# Use table()
mice@sam_data$treatment %>% table()

mice %>%
  ps_filter(treatment == "Vehicle") %>%
  tax_names2rank("ASV") %>% 
  comp_barplot(
    tax_level = "ASV", n_taxa = 12, 
    bar_width = 0.7, sample_order = "asis"
  )

The ASVs have uninformative numeric IDs, but we can fix that with tax_rename()

mice %>% 
  ps_filter(treatment == "Vehicle") %>% 
  tax_rename(rank = "Family", pad_digits = 2) %>% # this changes the taxa_names
  tax_names2rank("ASV") %>%
  comp_barplot(
    tax_level = "ASV", n_taxa = 12, 
    bar_width = 0.7, sample_order = "asis"
  )

Tip: What is that function?

When you see a function you are unfamiliar with, e.g. tax_names2rank(), look for documentation:

Run ?tax_names2rank in the console, or click on the function and press the F1 key (F2 shows the source code!)
Many packages have nice websites: search for the function reference page, e.g. https://david-barnett.github.io/microViz/reference/index.html
Test out the function in isolation, does it do what you expected?

mice %>% 
  tax_rename(rank = "Family", pad_digits = 2) %>% 
  taxa_names() %>% 
  head(12)

Aggregating taxa

Sadly we don’t have enough distinct colours to show all the unique ASVs.
Instead we can aggregate the counts at a higher taxonomic rank, e.g. Families.

mice %>%
  ps_filter(treatment == "Vehicle") %>%
  comp_barplot(
    tax_level = "Family", n_taxa = 9, 
    sample_order = "asis", bar_width = 0.7,
    merge_other = FALSE
  )

Tip: Use ggsave to save your plots?

Viewing your plots in the RStudio Plots window is okay for practice
But don’t copy paste them to save them, as there is a much better way!

# Assign your plot to an R object
miceBarsD7koolaid <- mice %>%
  ps_filter(treatment == "Vehicle") %>%
  comp_barplot(
    tax_level = "Family", n_taxa = 9, 
    sample_order = "asis", bar_width = 0.7,
    merge_other = FALSE
  )

# Write the plot to a file, with ggsave
ggsave(
  filename = "mice-barchart-day7-WNV.pdf",
  plot = miceBarsD7koolaid, width = 6.5, height = 3.5
)

Be sure to carefully adjust the sizing and resolution of your plots for your paper or presentation!

Try aggregating at other ranks, e.g. Class.

#
#
#

Fickle Phyla, beware!

There were major changes to bacterial taxonomy last year
Many of the phylum names in our dataset are now officially outdated.
Currently, some research is published with the old names and some with the new names.
Here are the key changes to be aware of for the human gut microbiota
- Actinobacteria is now Actinomycetota
- Bacteroidetes is now Bacteroidota
- Proteobacteria is now Pseudomonadota (!)
- Firmicutes is now Bacillota (!?!)

Organising your bar charts

Let’s look at all of the data from day 7, from all treatment groups.
We can add the ggplot2 facet_wrap() to our plot, to separate the treatment groups.

mice %>% 
  comp_barplot(
    tax_level = "Family", n_taxa = 10, merge_other = FALSE,
    sample_order = "asis", label = "cage"
  ) +
  facet_wrap(facets = vars(treatment), scales = "free") +
  labs(
    title = "Day 7 mouse fecal microbiota compositions",
    x = "Mouse ID", y = "Relative Abundance"
  ) +
  coord_flip() # coord_flip exchanges x and y axes, which can be more readable

Optional extension exercise: More mice data!

Not recommended for R or ggplot2 beginners.

This optional plotting exercise is more complex, and aims to exemplify the kind of complex manual arrangements enabled by comp_barplot.

# Additional packages needed
library(stringr)
library(patchwork)

So far we looked just at one day of measurements, but stool samples were collected on multiple days.
Let’s grab the full dataset for more practice arranging our bar charts.

allMice <- tax_fix(allMice, verbose = FALSE)
allMice

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3229 taxa and 520 samples ]
sample_data() Sample Data:       [ 520 samples by 11 sample variables ]
tax_table()   Taxonomy Table:    [ 3229 taxa by 6 taxonomic ranks ]

Code to set up a grid layout using cage numbers and letters

# Convert sample timings to a factor variable, for correct temporal ordering. 
allMice <- allMice %>% 
  ps_mutate(Days = factor(
    x = treatment_days,
    levels = paste0("D", c(".14", "0", "3", "7", "13", "16", "18", "20")),
    labels = c("-14", "0", "3", "7", "13", "16", "18", "20")
  ))

# Separate the cage numbers and letters to allow a grid layout.
allMice <- allMice %>% 
  ps_mutate(
    cage_number = stringr::str_extract(cage, "[1-9]"),
    cage_letter = stringr::str_extract(cage, "[A-Z]")
  )

# Check the grid layout: plot the virus infected, vehicle control group
allMice %>% 
  ps_filter(virus == "WNV2000", treatment == "Vehicle") %>%
  comp_barplot(
    tax_level = "Family", n_taxa = 13, x = 'Days',
    bar_width = 0.8, merge_other = FALSE
  ) +
  facet_grid(cols = vars(cage_number), rows = vars(cage_letter))

Registered S3 method overwritten by 'seriation':
  method         from 
  reorder.hclust vegan

Code to create a big plot of all samples from WNV-exposed mice

# Create a list of identically themed and coloured plots with the group_by argument.
plotList <- allMice %>% 
  ps_mutate(group = paste(treatment, virus)) %>% 
  comp_barplot(tax_level = "Family", n_taxa = 13, x = "Days", group_by = "group")

# Arrange the WNV-exposed mice plots with patchwork, in 4 rows
plotsWNV <- wrap_plots(plotList[grep(x = names(plotList), pattern = "WNV2000")]) 
plotsWNV <- plotsWNV + plot_layout(guides = "collect", nrow = 4, heights = c(2,2,3,2))

# Add faceting to all the plots in the list, with patchwork's `&` operator
plotsWNV <- plotsWNV & facet_grid(cols = vars(cage_number), rows = vars(cage_letter))

# Write these plots to a large image file!
ggsave(plot = plotsWNV, filename = "WNVplots.pdf", height = 12, width = 8)

Exercise - write code for a plot of all control mice (no virus)

# 
# 
#

More barchart resources:

More examples/tutorial of visualizing microbiome data using stacked barcharts can be found here: https://david-barnett.github.io/microViz/articles/web-only/compositions.html

Note: Bar charts often look better when you sort the samples by similarity.

The webpage mentions using Bray-Curtis distances and hierarchical clustering to sort samples.
We haven’t discussed dissimilarity or distances yet, but we will in the next set of exercises!
For now, just appreciate that it can make it easier to see patterns in your compositions!

⚖️ Ecosystem diversity

How diverse is the bacterial microbiome of each sample?

Why is diversity interesting?

Biologically

Higher diversity ecosystems are probably more resilient to perturbations
Lower gut microbiome diversity is related to worse health in adult humans
BUT: diverse == healthy is not TRUE for all ecosystems (e.g. early infant gut microbiome)
So, consider your own data and diversity hypotheses carefully

Practically

Diversity indices provide a simple one number summary of each ecosystem
This makes it relatively easy to compare samples, and do statistical testing

Richness

The simplest richness measure is just counting, a.k.a. “Observed Richness”.
Let’s compute the observed richness and sort and label some samples.
ps_calc_richness() computes the index for each sample and adds it to your sample_data

mice %>%
  ps_filter(treatment %in% c("Vehicle", "AmpMetro")) %>%
  ps_calc_richness(rank = "Family", index = "observed", varname = "N_Families") %>%
  ps_arrange(N_Families) %>% 
  comp_barplot(
    tax_level = "Family", n_taxa = 14, label = "N_Families", bar_width = 0.8,
    sample_order = "asis", merge_other = FALSE, facet_by = "treatment"
  ) +
  coord_flip()

Diversity

A true measure of ecosystem diversity will also consider the evenness of the ecosystem.
A rich ecosystem predominated by one taxon is still intuitively a less diverse ecosystem than one with an even distribution of the same number of taxa.

Code to compute Shannon Index and make a barplot

mice %>%
  ps_filter(treatment %in% c("Vehicle", "AmpMetro")) %>%
  ps_calc_diversity(rank = "Family", index = "shannon", varname = "Shannon_Family") %>%
  ps_arrange(Shannon_Family) %>% 
  ps_mutate(Shannon = formatC(Shannon_Family, digits = 2, format = "f")) %>%
  comp_barplot(
    tax_level = "Family", n_taxa = 14, label = "Shannon", bar_width = 0.8,
    sample_order = "asis", merge_other = FALSE, facet_by = "treatment"
  ) +
  coord_flip()

The Shannon index is a commonly used diversity measure, with this formula: \(H = -\sum_{i=1}^Np_i\ln p_i\)
Shannon index is often is denoted with \(H\), and here \(p\) denotes proportion of the \(i\)’th taxon.
The Shannon index for each of these samples is shown to the left of each bar

Explanation of the Shannon index formula

For each taxon \(i\), you multiply its proportional abundance \(p_i\) by the natural log of that proportion \(\ln p_i\), and sum those values.
Try it out for yourself to convince yourself you get larger (negative) values for higher proportions.
The highest value you can achieve with \(N\) taxa occurs with equal proportions
e.g. with 20 taxa, maximum diversity occurs if each has a proportion of 5%
Lastly, you change the sign of the result to a positive number, for ease of interpretation
This just makes more intuitive sense: as higher positive numbers indicates higher diversity.

Code to compute Exponential Shannon and make a barplot

mice %>%
  ps_filter(treatment %in% c("Vehicle", "AmpMetro")) %>%
  ps_calc_diversity(rank = "Family", index = "shannon", exp = TRUE, varname = "Exp_Shannon") %>%
  ps_arrange(Exp_Shannon) %>% 
  ps_mutate(Exp_Shannon = formatC(Exp_Shannon, digits = 2, format = "f")) %>%
  comp_barplot(
    tax_level = "Family", n_taxa = 14, label = "Exp_Shannon", bar_width = 0.8,
    sample_order = "asis", merge_other = FALSE, facet_by = "treatment"
  ) +
  coord_flip()

The exponent of the Shannon index \(e^H\) represents the number of taxa that would be present in an evenly abundant ecosystem with the same Shannon index.
Now each sample is labelled with \(e^H\) - do you see a more obvious relationship to sample composition?

Explanation of the Effective Shannon Diversity

The numeric value of the Shannon index itself has no intuitive meaning.
You can compare them, but can’t easily interpret any one number.
So, the concept of “effective numbers” of taxa is useful here.
If your original ecosystem was actually perfectly even, then \(e^H = N\)
Where N is the observed richness
The more uneven an ecosystem, the further \(e^H\) will be from \(N\)

Comparing Diversity

Let us now compare the diversity of our samples.
Perhaps we hypothesise that the mouse gut diversity differs by type of antibiotic treatment.

miceShannonDf <- mice %>%
  ps_calc_diversity(rank = "Family", index = "shannon", exp = TRUE, varname = "Exp_Shannon") %>%
  samdat_tbl()

miceShannonDf %>%
  ggplot(aes(y = Exp_Shannon, x = treatment, color = treatment)) +
  geom_boxplot(alpha = 0.5, width = 0.5) +
  geom_point(position = position_jitter(width = 0.1), alpha = 0.5, size = 2) +
  scale_color_brewer(palette = "Set1", guide = NULL) +
  labs(y = "Effective Shannon diversity (Family)", x = NULL) +
  theme_bw()

The mice exposed to Ampicillin (± Metronidazole) appear to have higher diversity at this day 7 sample!

# A simple statistical test supports this assertion
miceShannonDf %>% 
  mutate(Ampicillin = if_else(grepl(pattern = "Amp", x = treatment), 1, 0)) %>%  
  wilcox.test(formula = Exp_Shannon ~ Ampicillin, data = .)


    Wilcoxon rank sum exact test

data:  Exp_Shannon by Ampicillin
W = 59, p-value = 2.919e-06
alternative hypothesis: true location shift is not equal to 0

Notes on R pipe syntax

As you probably know by now, %>% pipes the result from one function into the first argument of the next function.
This works fine for most functions, but sometimes you need to pipe the result into the second or third argument, not the first!
This is possible, you must include a . as a placeholder to tell the pipe where to put the result.

miceShannonDf %>% 
  mutate(Ampicillin = if_else(grepl(pattern = "Amp", x = treatment), 1, 0)) %>%  
  wilcox.test(formula = Exp_Shannon ~ Ampicillin, data = .)
#                                 this dot is important! ^

Since R version 4, you can also do this with the “native” pipe |>, but you must use a _ as the placeholder, instead of a .

miceShannonDf |>
  mutate(Ampicillin = if_else(grepl(pattern = "Amp", x = treatment), 1, 0)) |>  
  wilcox.test(formula = Exp_Shannon ~ Ampicillin, data = _)
#                                   this _ is important! ^

If you don’t like pipes, you must make intermediate objects, like this.

dataForWilcoxTest <- mutate(
  .data = miceShannonDf, 
  Ampicillin = if_else(grepl(pattern = "Amp", x = treatment), 1, 0)
)  

wilcox.test(formula = Exp_Shannon ~ Ampicillin, data = dataForWilcoxTest)


    Wilcoxon rank sum exact test

data:  Exp_Shannon by Ampicillin
W = 59, p-value = 2.919e-06
alternative hypothesis: true location shift is not equal to 0

Use whichever you prefer and feel is most organised and readable! 😎

You can apply more complex statistical tests if you like
e.g. adjusting for covariates with linear regression, using lm()

allMice %>%
  tax_fix(verbose = FALSE) %>% 
  ps_filter(treatment_days == "D7") %>%
  ps_calc_diversity(rank = "Family", index = "shannon", exp = TRUE, varname = "Exp_Shannon") %>%
  samdat_tbl() %>%
  mutate(Ampicillin = if_else(grepl(pattern = "Amp", x = treatment), 1, 0)) %>% 
  lm(formula = Exp_Shannon ~ Ampicillin + virus, data = .) %>%
  summary()


Call:
lm(formula = Exp_Shannon ~ Ampicillin + virus, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0571 -0.5624  0.1993  1.0744  4.2906 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    1.9133     0.4977   3.844 0.000287 ***
Ampicillin     2.9685     0.4844   6.128 6.76e-08 ***
virusWNV2000   1.1967     0.5233   2.287 0.025624 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.944 on 62 degrees of freedom
Multiple R-squared:  0.417, Adjusted R-squared:  0.3982 
F-statistic: 22.17 on 2 and 62 DF,  p-value: 5.454e-08

Final activities for part 1:

Try making other diversity comparisons for yourself?
Consider comparing ASV-level diversity?
Does virus exposure influence diversity?
What about comparing richness?
Practice making plots and doing simple statistical tests.

#
#
#

⌚ Break Time! ☕️ ️

You want more about diversity?

Click here for additional notes on richness and readcount

Simple measures like Observed Richness are sensitive to what ecologists call “sampling effort”.
For macroecologists, this is actually how much time/effort you spent trying to count all the organisms present in an ecosystem.
In our case, the amount of total reads obtained represents the sampling effort: more reads, more effort.
Indeed we can see that the samples with a much lower readcount have lower observed richness.
Furthermore, as this richness estimate is based on a sample, and not the actual ecosystem, the richness estimate actually has quantifiable uncertainty too.

mice %>%
  ps_filter(treatment == "Amp") %>%
  ps_calc_richness(rank = "Family", index = "observed", varname = "N families") %>%
  comp_barplot(
    tax_level = "Family", n_taxa = 12, label = "N families", bar_width = 0.7,
    sample_order = "asis", merge_other = FALSE, tax_transform_for_plot = "identity"
  )

mice %>%
  tax_names2rank("ASV") %>% 
  ps_calc_richness(rank = "ASV", index = "observed", varname = "ASVs") %>%
  ps_mutate(readcount = sample_sums(mice)) %>%
  samdat_tbl() %>%
  ggplot(aes(readcount, ASVs)) +
  geom_point(alpha = 0.4, size = 2.5) +
  theme_bw(14)

What to do?

Simple solution: Ignore the problem. Whilst you can’t interpret the richness of any individual sample as being correct, it is still usually valid to compare richness across groups of samples, as the readcount variation is only random noise, and should be uncorrelated with your grouping variable (but do check this).
Harder solution: Explore more rigorous methods like breakaway by Amy Willis and team. https://www.frontiersin.org/articles/10.3389/fmicb.2019.02407/full

Click here for an extension activity with an IBD dataset.

This is an extension exercise, for anyone who is moving very quickly.

Inflammatory Bowel Disease study

ibd <- ibd %>% 
  tax_mutate(Species = NULL) %>% # Species column is blank -> deleted it
  ps_mutate(disease = ibd == "ibd", ibd = NULL) # adds disease state indicator variable

ibd is another phyloseq object containing 16S rRNA gene amplicon sequencing data
It is from a 2012 study of Inflammatory Bowel Disease in children and young adults
It’s “old” data: they used 454 Pyrosequencing, and clustered the raw sequences into “OTUs”
Have a look at the data, like we did before for the mice dataset

ibd

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 36349 taxa and 91 samples ]
sample_data() Sample Data:       [ 91 samples by 15 sample variables ]
tax_table()   Taxonomy Table:    [ 36349 taxa by 6 taxonomic ranks ]

#
#
#

You can perform alpha diversity analysis:
Try comparing the alpha diversity of the IBD patients against the healthy controls.

#
#
#

Session info

session_info records your package versions etc. This is useful for debugging / reproducing analysis.

Long session_info output

sessioninfo::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.2 (2024-10-31)
 os       macOS Sequoia 15.1.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/London
 date     2024-12-17
 pandoc   3.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package          * version  date (UTC) lib source
 ade4               1.7-22   2023-02-06 [1] CRAN (R 4.4.0)
 ape                5.8      2024-04-11 [1] CRAN (R 4.4.0)
 Biobase            2.64.0   2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 BiocGenerics       0.50.0   2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 biomformat         1.32.0   2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 Biostrings         2.72.1   2024-06-02 [1] RSPM (R 4.4.0)
 ca                 0.71.1   2020-01-24 [1] CRAN (R 4.4.0)
 cli                3.6.3    2024-06-21 [1] CRAN (R 4.4.0)
 cluster            2.1.8    2024-12-11 [1] CRAN (R 4.4.1)
 codetools          0.2-20   2024-03-31 [1] CRAN (R 4.4.2)
 colorspace         2.1-1    2024-07-26 [1] RSPM (R 4.4.0)
 crayon             1.5.3    2024-06-20 [1] CRAN (R 4.4.0)
 data.table         1.16.4   2024-12-06 [1] CRAN (R 4.4.1)
 digest             0.6.37   2024-08-19 [1] CRAN (R 4.4.1)
 dplyr            * 1.1.4    2023-11-17 [1] CRAN (R 4.4.0)
 evaluate           1.0.1    2024-10-10 [1] CRAN (R 4.4.1)
 fansi              1.0.6    2023-12-08 [1] CRAN (R 4.4.0)
 farver             2.1.2    2024-05-13 [1] CRAN (R 4.4.0)
 fastmap            1.2.0    2024-05-15 [1] CRAN (R 4.4.0)
 forcats          * 1.0.0    2023-01-29 [1] CRAN (R 4.4.0)
 foreach            1.5.2    2022-02-02 [1] CRAN (R 4.4.0)
 generics           0.1.3    2022-07-05 [1] CRAN (R 4.4.0)
 GenomeInfoDb       1.40.1   2024-05-24 [1] Bioconductor 3.19 (R 4.4.0)
 GenomeInfoDbData   1.2.12   2024-05-26 [1] Bioconductor
 ggplot2          * 3.5.1    2024-04-23 [1] CRAN (R 4.4.0)
 glue               1.8.0    2024-09-30 [1] RSPM (R 4.4.0)
 gtable             0.3.6    2024-10-25 [1] CRAN (R 4.4.1)
 here               1.0.1    2020-12-13 [1] CRAN (R 4.4.0)
 hms                1.1.3    2023-03-21 [1] CRAN (R 4.4.0)
 htmltools          0.5.8.1  2024-04-04 [1] CRAN (R 4.4.0)
 htmlwidgets        1.6.4    2023-12-06 [1] CRAN (R 4.4.0)
 httpuv             1.6.15   2024-03-26 [1] CRAN (R 4.4.0)
 httr               1.4.7    2023-08-15 [1] CRAN (R 4.4.0)
 igraph             2.1.2    2024-12-07 [1] CRAN (R 4.4.1)
 IRanges            2.38.1   2024-07-03 [1] RSPM (R 4.4.0)
 iterators          1.0.14   2022-02-05 [1] CRAN (R 4.4.0)
 jsonlite           1.8.9    2024-09-20 [1] CRAN (R 4.4.1)
 knitr              1.49     2024-11-08 [1] CRAN (R 4.4.1)
 labeling           0.4.3    2023-08-29 [1] CRAN (R 4.4.0)
 later              1.4.1    2024-11-27 [1] CRAN (R 4.4.1)
 lattice            0.22-6   2024-03-20 [1] CRAN (R 4.4.2)
 lifecycle          1.0.4    2023-11-07 [1] CRAN (R 4.4.0)
 lubridate        * 1.9.4    2024-12-08 [1] CRAN (R 4.4.1)
 magrittr           2.0.3    2022-03-30 [1] CRAN (R 4.4.0)
 MASS               7.3-61   2024-06-13 [1] CRAN (R 4.4.2)
 Matrix             1.7-1    2024-10-18 [1] CRAN (R 4.4.2)
 mgcv               1.9-1    2023-12-21 [1] CRAN (R 4.4.2)
 microbiome         1.26.0   2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 microViz         * 0.12.6   2024-12-16 [1] https://david-barnett.r-universe.dev (R 4.4.2)
 mime               0.12     2021-09-28 [1] CRAN (R 4.4.0)
 multtest           2.60.0   2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 munsell            0.5.1    2024-04-01 [1] CRAN (R 4.4.0)
 nlme               3.1-166  2024-08-14 [1] CRAN (R 4.4.2)
 patchwork        * 1.3.0    2024-09-16 [1] CRAN (R 4.4.1)
 permute            0.9-7    2022-01-27 [1] CRAN (R 4.4.0)
 phyloseq         * 1.48.0   2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 pillar             1.9.0    2023-03-22 [1] CRAN (R 4.4.0)
 pkgconfig          2.0.3    2019-09-22 [1] CRAN (R 4.4.0)
 plyr               1.8.9    2023-10-02 [1] CRAN (R 4.4.0)
 promises           1.3.2    2024-11-28 [1] CRAN (R 4.4.1)
 purrr            * 1.0.2    2023-08-10 [1] CRAN (R 4.4.0)
 R6                 2.5.1    2021-08-19 [1] CRAN (R 4.4.0)
 RColorBrewer       1.1-3    2022-04-03 [1] CRAN (R 4.4.0)
 Rcpp               1.0.13-1 2024-11-02 [1] CRAN (R 4.4.1)
 readr            * 2.1.5    2024-01-10 [1] CRAN (R 4.4.0)
 registry           0.5-1    2019-03-05 [1] CRAN (R 4.4.0)
 reshape2           1.4.4    2020-04-09 [1] CRAN (R 4.4.0)
 rhdf5              2.48.0   2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 rhdf5filters       1.16.0   2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 Rhdf5lib           1.26.0   2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 rlang              1.1.4    2024-06-04 [1] CRAN (R 4.4.0)
 rmarkdown          2.29     2024-11-04 [1] CRAN (R 4.4.1)
 rprojroot          2.0.4    2023-11-05 [1] CRAN (R 4.4.0)
 rstudioapi         0.17.1   2024-10-22 [1] CRAN (R 4.4.1)
 Rtsne              0.17     2023-12-07 [1] CRAN (R 4.4.0)
 S4Vectors          0.42.1   2024-07-03 [1] RSPM (R 4.4.0)
 scales             1.3.0    2023-11-28 [1] CRAN (R 4.4.0)
 seriation          1.5.7    2024-12-05 [1] CRAN (R 4.4.1)
 sessioninfo        1.2.2    2021-12-06 [1] CRAN (R 4.4.0)
 shiny            * 1.10.0   2024-12-14 [1] CRAN (R 4.4.1)
 stringi            1.8.4    2024-05-06 [1] CRAN (R 4.4.0)
 stringr          * 1.5.1    2023-11-14 [1] CRAN (R 4.4.0)
 survival           3.7-0    2024-06-05 [1] CRAN (R 4.4.2)
 tibble           * 3.2.1    2023-03-20 [1] CRAN (R 4.4.0)
 tidyr            * 1.3.1    2024-01-24 [1] CRAN (R 4.4.0)
 tidyselect         1.2.1    2024-03-11 [1] CRAN (R 4.4.0)
 tidyverse        * 2.0.0    2023-02-22 [1] CRAN (R 4.4.0)
 timechange         0.3.0    2024-01-18 [1] CRAN (R 4.4.0)
 TSP                1.2-4    2023-04-04 [1] CRAN (R 4.4.0)
 tzdb               0.4.0    2023-05-12 [1] CRAN (R 4.4.0)
 UCSC.utils         1.0.0    2024-05-06 [1] Bioconductor 3.19 (R 4.4.0)
 utf8               1.2.4    2023-10-22 [1] CRAN (R 4.4.0)
 vctrs              0.6.5    2023-12-01 [1] CRAN (R 4.4.0)
 vegan              2.7-0    2024-12-01 [1] https://vegandevs.r-universe.dev (R 4.4.2)
 withr              3.0.2    2024-10-28 [1] CRAN (R 4.4.1)
 xfun               0.49     2024-10-31 [1] CRAN (R 4.4.1)
 xtable             1.8-4    2019-04-21 [1] CRAN (R 4.4.0)
 XVector            0.44.0   2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 yaml               2.3.10   2024-07-26 [1] RSPM (R 4.4.0)
 zlibbioc           1.50.0   2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)

 [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────