This tutorial will show you how microViz makes working with phyloseq objects easier.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(phyloseq)
library(microViz)
#> microViz version 0.12.6 - Copyright (C) 2021-2024 David Barnett
#> ! Website: https://david-barnett.github.io/microViz
#> ✔ Useful? For citation details, run: `citation("microViz")`
#> ✖ Silence? `suppressPackageStartupMessages(library(microViz))`
phyloseq objects are probably the most commonly used data format for working with microbiome data in R.
The creator of phyloseq, Paul J. McMurdie, explains the structure of phyloseq objects and how to construct them on the phyloseq website.
biom format files can be
imported to phyloseq with the import_biom
function. Such biom files are generated (or can be) from many processing
tools including QIIME 1 / 2, MetaPhlAn,
and NG-Tax.
mothur output is also directly
supported via phyloseq’s import_mothur
function.
DADA2 output can converted to phyloseq according to these DADA2 handoff instructions
Don’t worry too much about getting all of your sample metadata into
your biom file or phyloseq object at the start, as
ps_join()
makes it easy to add sample data later.
# This tutorial will just use some example data
# It is already available as a phyloseq object from the corncob package
example_ps <- microViz::ibd
example_ps
#> phyloseq-class experiment-level object
#> otu_table() OTU Table: [ 36349 taxa and 91 samples ]
#> sample_data() Sample Data: [ 91 samples by 15 sample variables ]
#> tax_table() Taxonomy Table: [ 36349 taxa by 7 taxonomic ranks ]
phyloseq checks that your sample and taxa names are consistent across
the different slots of the phyloseq object. microViz provides
phyloseq_validate()
to check for and fix other possible
problems with your phyloseq that might cause problems in later analyses.
It is recommended to run this at the start of your analyses, and fix any
problems identified.
example_ps <- phyloseq_validate(example_ps, remove_undetected = TRUE)
#> Short values detected in phyloseq tax_table (nchar<4) :
#> Consider using tax_fix() to make taxa uniquely identifiable
example_ps <- tax_fix(example_ps)
Once you have a valid phyloseq object, microViz provides several helpful functions for manipulating that object. The names and syntax of some functions will be familiar to users of dplyr, and reading the dplyr help pages may be useful for getting the most out of these functions.
Add a data.frame of metadata to the sample_data slot with
ps_join()
Compute new sample_data variables with
ps_mutate()
Subset or reorder variables in sample_data with
ps_select()
Subset samples based on sample_data variables with
ps_filter()
Reorder samples (can be useful for examining sample data or plotting) with:
ps_reorder()
: manually set sample order
ps_arrange()
: order samples using sample_data
variables
ps_seriate()
: order samples according to microbiome
similarity
Remove duplicated/repeated samples with
ps_dedupe()
Remove samples with missing values in sample_data with
ps_drop_incomplete()
Check out the examples section on each function’s reference page for extensive usage examples.
Lets look at the sample data already in our example phyloseq.
# 91 samples, with 15 sample_data variables
example_ps
#> phyloseq-class experiment-level object
#> otu_table() OTU Table: [ 36349 taxa and 91 samples ]
#> sample_data() Sample Data: [ 91 samples by 15 sample variables ]
#> tax_table() Taxonomy Table: [ 36349 taxa by 7 taxonomic ranks ]
# return sample data as a tibble for pretty printing
samdat_tbl(example_ps)
#> # A tibble: 91 × 16
#> .sample_name sample gender age DiseaseState steroids imsp abx mesalamine
#> <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 099A 099-AX female 17 UC steroids imsp noabx nomes
#> 2 199A 199-AX female 5 nonIBD nostero… noim… noabx nomes
#> 3 062B 062-BZ male 16 CD steroids imsp noabx nomes
#> 4 194A 194-AZ male 15 UC steroids noim… noabx nomes
#> 5 166A 166-AX female 20 UC nostero… noim… abx nomes
#> 6 219A 219-AX female 18 UC nostero… imsp abx nomes
#> 7 132A 132-AX female 12 CD steroids imsp noabx nomes
#> 8 026A 026-AX male 10 UC steroids imsp abx nomes
#> 9 102A 102-AZ male 3 nonIBD nostero… noim… noabx nomes
#> 10 140A 140-AX female 5 nonIBD nostero… noim… noabx nomes
#> # ℹ 81 more rows
#> # ℹ 7 more variables: ibd <chr>, activity <chr>, active <chr>, race <chr>,
#> # fhx <chr>, imspLEVEL <chr>, SampleType <chr>
Maybe you want to only select participants who have IBD (not controls). You can do that by filtering samples based on the values of the sample data variable: ibd
example_ps %>% ps_filter(ibd == "ibd") # it is essential to use `==` , not just one `=`
#> phyloseq-class experiment-level object
#> otu_table() OTU Table: [ 24490 taxa and 67 samples ]
#> sample_data() Sample Data: [ 67 samples by 15 sample variables ]
#> tax_table() Taxonomy Table: [ 24490 taxa by 7 taxonomic ranks ]
# notice that taxa that no longer appear in the remaining 67 samples have been removed!
More complicated filtering rules can be applied. Let’s say you want female IBD patients with “mild” or “severe” activity, who are at least 13 years old.
partial_ps <- example_ps %>%
ps_filter(
gender == "female",
activity %in% c("mild", "severe"),
age >= 13
)
partial_ps
#> phyloseq-class experiment-level object
#> otu_table() OTU Table: [ 2600 taxa and 9 samples ]
#> sample_data() Sample Data: [ 9 samples by 15 sample variables ]
#> tax_table() Taxonomy Table: [ 2600 taxa by 7 taxonomic ranks ]
Let’s have a look at the sample data of these participants. We will also arrange the samples grouped by disease and in descending age order, and select only a few interesting variables to show.
partial_ps %>%
ps_arrange(DiseaseState, desc(age)) %>%
ps_select(DiseaseState, age, matches("activ"), abx) %>% # selection order is respected
samdat_tbl() # this adds the .sample_name variable
#> # A tibble: 9 × 6
#> .sample_name DiseaseState age activity active abx
#> <chr> <chr> <int> <chr> <chr> <chr>
#> 1 195A CD 13 mild active noabx
#> 2 164A UC 21 severe active abx
#> 3 166A UC 20 severe active abx
#> 4 038A UC 19 mild active noabx
#> 5 114C UC 19 mild active noabx
#> 6 099A UC 17 severe active noabx
#> 7 009A UC 15 severe active noabx
#> 8 120D UC 14 mild active noabx
#> 9 186A UC 14 severe active abx
You can also sort sample by microbiome similarity with
ps_seriate()
.
partial_ps %>%
tax_agg("Genus") %>%
ps_seriate(dist = "bray", method = "OLO_ward") %>% # these are the defaults
comp_barplot(tax_level = "Genus", sample_order = "asis", n_taxa = 10)
#> Registered S3 method overwritten by 'seriation':
#> method from
#> reorder.hclust vegan
# note that comp_barplot with sample_order = "bray" will run
# this ps_seriate call internally, so you don't have to!
You can also arrange samples by abundance of one of more microbes
using ps_arrange()
with .target = "otu_table"
.
Arranging by taxon can only be done at the current taxonomic rank, so we
will aggregate to Genus level first.
# Arranging by decreasing Bacteroides abundance
partial_ps %>%
tax_agg("Genus") %>%
ps_arrange(desc(Bacteroides), .target = "otu_table") %>%
otu_get() %>% # get the otu table
.[, 1:6] # show only a subset of the otu_table
#> OTU Table: [6 taxa and 9 samples]
#> taxa are columns
#> Prevotella Clostridium_XlVa Hallella Veillonella Bacteroides Blautia
#> 009A 4480 462 4 250 1542 0
#> 195A 0 40 0 1 514 5
#> 166A 549 7 0 17 170 27
#> 099A 0 60 0 0 132 48
#> 186A 0 7 0 0 69 38
#> 038A 0 0 0 192 0 0
#> 114C 0 44 0 6 0 236
#> 120D 0 0 0 0 0 0
#> 164A 0 0 0 0 0 0
# Plot samples' compositions in this order
partial_ps %>%
tax_agg("Genus") %>%
ps_arrange(desc(Bacteroides), .target = "otu_table") %>%
comp_barplot(tax_level = "Genus", sample_order = "asis", n_taxa = 10)
# Notice this is sorted by bacteroides counts
# (this doesn't quite match relative abundance % due to sequencing depth variation)
ps_filter()
vs. phyloseq::subset_samples()
As well as filtering your samples, ps_filter()
might
also modify the otu_table and tax_table of the phyloseq object (unlike
phyloseq::subset_samples()
, which never does this).
Why does it do this?
If you remove many samples from your dataset, often your phyloseq object
will be left with taxa that never occur in any of the remaining samples
(i.e. total counts of zero). ps_filter()
removes those
absent taxa by default.
If you don’t want this, you can set the .keep_all_taxa
argument to TRUE
in ps_filter
.
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.4.2 (2024-10-31)
#> os Ubuntu 24.04.1 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language en
#> collate C.UTF-8
#> ctype C.UTF-8
#> tz UTC
#> date 2024-12-16
#> pandoc 3.1.11 @ /opt/hostedtoolcache/pandoc/3.1.11/x64/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> ade4 1.7-22 2023-02-06 [1] RSPM
#> ape 5.8 2024-04-11 [1] RSPM
#> Biobase 2.66.0 2024-10-29 [1] Bioconduc~
#> BiocGenerics 0.52.0 2024-10-29 [1] Bioconduc~
#> biomformat 1.34.0 2024-10-29 [1] Bioconduc~
#> Biostrings 2.74.0 2024-10-29 [1] Bioconduc~
#> bslib 0.8.0 2024-07-29 [1] RSPM
#> ca 0.71.1 2020-01-24 [1] RSPM
#> cachem 1.1.0 2024-05-16 [1] RSPM
#> cli 3.6.3 2024-06-21 [1] RSPM
#> cluster 2.1.6 2023-12-01 [3] CRAN (R 4.4.2)
#> codetools 0.2-20 2024-03-31 [3] CRAN (R 4.4.2)
#> colorspace 2.1-1 2024-07-26 [1] RSPM
#> crayon 1.5.3 2024-06-20 [1] RSPM
#> data.table 1.16.4 2024-12-06 [1] RSPM
#> desc 1.4.3 2023-12-10 [1] RSPM
#> devtools 2.4.5 2022-10-11 [1] RSPM
#> digest 0.6.37 2024-08-19 [1] RSPM
#> dplyr * 1.1.4 2023-11-17 [1] RSPM
#> ellipsis 0.3.2 2021-04-29 [1] RSPM
#> evaluate 1.0.1 2024-10-10 [1] RSPM
#> fansi 1.0.6 2023-12-08 [1] RSPM
#> farver 2.1.2 2024-05-13 [1] RSPM
#> fastmap 1.2.0 2024-05-15 [1] RSPM
#> foreach 1.5.2 2022-02-02 [1] RSPM
#> fs 1.6.5 2024-10-30 [1] RSPM
#> generics 0.1.3 2022-07-05 [1] RSPM
#> GenomeInfoDb 1.42.1 2024-11-28 [1] Bioconduc~
#> GenomeInfoDbData 1.2.13 2024-12-16 [1] Bioconductor
#> ggplot2 3.5.1 2024-04-23 [1] RSPM
#> glue 1.8.0 2024-09-30 [1] RSPM
#> gtable 0.3.6 2024-10-25 [1] RSPM
#> htmltools 0.5.8.1 2024-04-04 [1] RSPM
#> htmlwidgets 1.6.4 2023-12-06 [1] RSPM
#> httpuv 1.6.15 2024-03-26 [1] RSPM
#> httr 1.4.7 2023-08-15 [1] RSPM
#> igraph 2.1.2 2024-12-07 [1] RSPM
#> IRanges 2.40.1 2024-12-05 [1] Bioconduc~
#> iterators 1.0.14 2022-02-05 [1] RSPM
#> jquerylib 0.1.4 2021-04-26 [1] RSPM
#> jsonlite 1.8.9 2024-09-20 [1] RSPM
#> knitr 1.49 2024-11-08 [1] RSPM
#> labeling 0.4.3 2023-08-29 [1] RSPM
#> later 1.4.1 2024-11-27 [1] RSPM
#> lattice 0.22-6 2024-03-20 [3] CRAN (R 4.4.2)
#> lifecycle 1.0.4 2023-11-07 [1] RSPM
#> magrittr 2.0.3 2022-03-30 [1] RSPM
#> MASS 7.3-61 2024-06-13 [3] CRAN (R 4.4.2)
#> Matrix 1.7-1 2024-10-18 [3] CRAN (R 4.4.2)
#> memoise 2.0.1 2021-11-26 [1] RSPM
#> mgcv 1.9-1 2023-12-21 [3] CRAN (R 4.4.2)
#> microbiome 1.28.0 2024-10-29 [1] Bioconduc~
#> microViz * 0.12.6 2024-12-16 [1] local
#> mime 0.12 2021-09-28 [1] RSPM
#> miniUI 0.1.1.1 2018-05-18 [1] RSPM
#> multtest 2.62.0 2024-10-29 [1] Bioconduc~
#> munsell 0.5.1 2024-04-01 [1] RSPM
#> nlme 3.1-166 2024-08-14 [3] CRAN (R 4.4.2)
#> permute 0.9-7 2022-01-27 [1] RSPM
#> phyloseq * 1.50.0 2024-10-29 [1] Bioconduc~
#> pillar 1.9.0 2023-03-22 [1] RSPM
#> pkgbuild 1.4.5 2024-10-28 [1] RSPM
#> pkgconfig 2.0.3 2019-09-22 [1] RSPM
#> pkgdown 2.1.1 2024-09-17 [1] RSPM
#> pkgload 1.4.0 2024-06-28 [1] RSPM
#> plyr 1.8.9 2023-10-02 [1] RSPM
#> profvis 0.4.0 2024-09-20 [1] RSPM
#> promises 1.3.2 2024-11-28 [1] RSPM
#> purrr 1.0.2 2023-08-10 [1] RSPM
#> R6 2.5.1 2021-08-19 [1] RSPM
#> ragg 1.3.3 2024-09-11 [1] RSPM
#> Rcpp 1.0.13-1 2024-11-02 [1] RSPM
#> registry 0.5-1 2019-03-05 [1] RSPM
#> remotes 2.5.0 2024-03-17 [1] RSPM
#> reshape2 1.4.4 2020-04-09 [1] RSPM
#> rhdf5 2.50.1 2024-12-09 [1] Bioconduc~
#> rhdf5filters 1.18.0 2024-10-29 [1] Bioconduc~
#> Rhdf5lib 1.28.0 2024-10-29 [1] Bioconduc~
#> rlang 1.1.4 2024-06-04 [1] RSPM
#> rmarkdown 2.29 2024-11-04 [1] RSPM
#> Rtsne 0.17 2023-12-07 [1] RSPM
#> S4Vectors 0.44.0 2024-10-29 [1] Bioconduc~
#> sass 0.4.9 2024-03-15 [1] RSPM
#> scales 1.3.0 2023-11-28 [1] RSPM
#> seriation 1.5.7 2024-12-05 [1] RSPM
#> sessioninfo 1.2.2 2021-12-06 [1] RSPM
#> shiny 1.10.0 2024-12-14 [1] RSPM
#> stringi 1.8.4 2024-05-06 [1] RSPM
#> stringr 1.5.1 2023-11-14 [1] RSPM
#> survival 3.7-0 2024-06-05 [3] CRAN (R 4.4.2)
#> systemfonts 1.1.0 2024-05-15 [1] RSPM
#> textshaping 0.4.1 2024-12-06 [1] RSPM
#> tibble 3.2.1 2023-03-20 [1] RSPM
#> tidyr 1.3.1 2024-01-24 [1] RSPM
#> tidyselect 1.2.1 2024-03-11 [1] RSPM
#> TSP 1.2-4 2023-04-04 [1] RSPM
#> UCSC.utils 1.2.0 2024-10-29 [1] Bioconduc~
#> urlchecker 1.0.1 2021-11-30 [1] RSPM
#> usethis 3.1.0 2024-11-26 [1] RSPM
#> utf8 1.2.4 2023-10-22 [1] RSPM
#> vctrs 0.6.5 2023-12-01 [1] RSPM
#> vegan 2.6-8 2024-08-28 [1] RSPM
#> withr 3.0.2 2024-10-28 [1] RSPM
#> xfun 0.49 2024-10-31 [1] RSPM
#> xtable 1.8-4 2019-04-21 [1] RSPM
#> XVector 0.46.0 2024-10-29 [1] Bioconduc~
#> yaml 2.3.10 2024-07-26 [1] RSPM
#> zlibbioc 1.52.0 2024-10-29 [1] Bioconduc~
#>
#> [1] /home/runner/work/_temp/Library
#> [2] /opt/R/4.4.2/lib/R/site-library
#> [3] /opt/R/4.4.2/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────