Use tax_fix()
on your phyloseq data with default
arguments to repair most tax_table
problems (missing or
uninformative values). If you still encounter errors using
e.g. tax_agg
, try using the Shiny app
tax_fix_interactive()
to help you generate
tax_fix
code that will fix your particular
tax_table
problems.
This article will explain some of the common problems that can occur
in your phyloseq tax_table
, and that might cause problems
for e.g. tax_agg
. You can fix these problems with the help
of tax_fix
and tax_fix_interactive
.
Let’s look at some example data from the corncob package:
pseq <- microViz::ibd
pseq
#> phyloseq-class experiment-level object
#> otu_table() OTU Table: [ 36349 taxa and 91 samples ]
#> sample_data() Sample Data: [ 91 samples by 15 sample variables ]
#> tax_table() Taxonomy Table: [ 36349 taxa by 7 taxonomic ranks ]
The Species rank appears to be blank for many entries. This is a problem you may well encounter in your data: unique sequences or OTUs often cannot be annotated at lower taxonomic ranks.
tax_table(pseq)[40:54, 4:7] # highest 3 ranks not shown, to save space
#> Taxonomy Table: [15 taxa by 4 taxonomic ranks]:
#> Order Family Genus Species
#> OTU.40 "Enterobacteriales" "Enterobacteriaceae" "Escherichia/Shigella" ""
#> OTU.41 "Coriobacteriales" "Coriobacteriaceae" "Gordonibacter" ""
#> OTU.42 "Clostridiales" "Ruminococcaceae" "Faecalibacterium" ""
#> OTU.43 "Clostridiales" "Ruminococcaceae" "" ""
#> OTU.44 "Bacteroidales" "Prevotellaceae" "Prevotella" ""
#> OTU.45 "Bacteroidales" "Bacteroidaceae" "Bacteroides" ""
#> OTU.46 "Enterobacteriales" "Enterobacteriaceae" "Klebsiella" ""
#> OTU.47 "Bacteroidales" "Prevotellaceae" "Prevotella" ""
#> OTU.48 "Clostridiales" "Lachnospiraceae" "Blautia" ""
#> OTU.49 "Clostridiales" "Ruminococcaceae" "Faecalibacterium" ""
#> OTU.50 "Enterobacteriales" "Enterobacteriaceae" "Escherichia/Shigella" ""
#> OTU.51 "Bacteroidales" "Prevotellaceae" "Prevotella" ""
#> OTU.52 "Clostridiales" "Lachnospiraceae" "Clostridium_XlVa" ""
#> OTU.53 "Bacteroidales" "Prevotellaceae" "Prevotella" ""
#> OTU.54 "Clostridiales" "Lachnospiraceae" "" ""
If we would try to aggregate at Genus or Family rank level, we discover that blank values at these ranks prevent taxonomic aggregation. This is because, for example, it looks like OTU.43 and OTU.54 share the same (empty) Genus name, ““, despite being different at a higher rank, Family.
# tax_agg(pseq, rank = "Family") # this fails, and sends (helpful) messages about taxa problems
So we should run tax_fix
first, which will fix
most problems with default settings, allowing taxa to
be aggregated successfully (at any rank). If you still have errors when
using tax_agg after tax_fix, carefully read the error and accompanying
messages. Often you can copy suggested tax_fix code from the tax_agg
error. You should generally also have a look around your tax_table for
other uninformative values, using tax_fix_interactive.
pseq %>%
tax_fix() %>%
tax_agg(rank = "Family")
#> psExtra object - a phyloseq object with extra slots:
#>
#> phyloseq-class experiment-level object
#> otu_table() OTU Table: [ 59 taxa and 91 samples ]
#> sample_data() Sample Data: [ 91 samples by 15 sample variables ]
#> tax_table() Taxonomy Table: [ 59 taxa by 5 taxonomic ranks ]
#>
#> psExtra info:
#> tax_agg = "Family"
tax_fix
searches all the ranks of the phyloseq object
tax_table
for:
short values, like “g__”, ““,” “, etc. (any with fewer characters than min_length)
common, longer but uninformative values like “unknown” (see full
list at ?tax_fix
)
NAs
tax_fix
replaces these values with the next higher
taxonomic rank, e.g. an “unknown” Family within the Order Clostridiales
will be renamed “Clostridiales Order”, as seen below.
pseq %>%
tax_fix(min_length = 4) %>%
tax_agg("Family") %>%
# ps_get() %>% # needed in older versions of microViz (< 0.10.0)
tax_table() %>%
.[1:8, 3:5] # removes the first 2 ranks and shows only first 8 rows for nice printing
#> Taxonomy Table: [8 taxa by 3 taxonomic ranks]:
#> Class Order Family
#> Ruminococcaceae "Clostridia" "Clostridiales" "Ruminococcaceae"
#> Bacteroidaceae "Bacteroidia" "Bacteroidales" "Bacteroidaceae"
#> Prevotellaceae "Bacteroidia" "Bacteroidales" "Prevotellaceae"
#> Lachnospiraceae "Clostridia" "Clostridiales" "Lachnospiraceae"
#> Veillonellaceae "Negativicutes" "Selenomonadales" "Veillonellaceae"
#> Clostridiales Order "Clostridia" "Clostridiales" "Clostridiales Order"
#> Peptostreptococcaceae "Clostridia" "Clostridiales" "Peptostreptococcaceae"
#> Porphyromonadaceae "Bacteroidia" "Bacteroidales" "Porphyromonadaceae"
You can use tax_fix_interactive()
to explore your data’s
tax_table
visually, and interactively find and fix
problematic entries. You can then copy your automagically personalised
tax_fix
code from tax_fix_interactive
’s
output, to paste into your script. Below is a screen capture video of
tax_fix
in action, using some other artificially mangled
example data (see details at ?tax_fix_interactive()
).
tax_fix_interactive(example_data)
tax_table
row are either too short or
listed in unknowns argument) will be replaced at all ranks with their
unique row name by default (or alternatively with a generic name of
“unclassified [highest rank]”, which is useful if you want to aggregate
all the unclassified sequences together with
tax_agg()
)taxa_names(enterotype)[1] <- "unclassified taxon"
or give them all completely different names with
tax_name()
.tax_table
entry repeated
across multiple ranks: This is a problem for functions like
taxatree_plots()
, which need distinct entries at each rank
to build the tree structure for plotting. This might happen after you
tax_fix
data with problem 1. of this list, or in data from
e.g. microarray methods like HITchip. The solution is to use
tax_prepend_ranks()
(after tax_fix
) to add the
first character of the rank to all tax_table entries (you will also need
set the tax_fix
argument suffix_rank = “current”).tax_table
entries: e.g. you don’t want to delete/replace a genus name
completely, but it is shared by two families and thus blocking
tax_agg
. The solution is to rename (one of) these values
manually to make them distinct.
tax_table(yourPhyloseq)["targetTaxonName", "targetRank"] <- "newBetterGenusName"
tax_name()
for an easy way to
rename all your taxa.Sequences that are unclassified at fairly high ranks e.g. Class are
often very low abundance (or possibly represent sequencing
errors/chimeras), if you are using data from an environment that is
typically well represented in reference databases. So if you are
struggling with what to do with unclassified taxa, consider if you can
just remove them first using tax_filter()
(perhaps using
fairly relaxed filtering criteria like min_prevalence of 2 samples, or
min_total_abundance of 1000 reads, and keeping the tax_level argument as
NA, so that no aggregation is attempted before filtering).
microbiome::aggregate_taxa()
also solves some
tax_table
problems, e.g. where multiple distinct genera
converge again to the same species name like “” or “s__”, it will make
unique taxa names by pasting together all of the rank
names. However this can produce some very long names, which need to be
manually shortened before use in plots. Plus, it doesn’t replace names
like “s__” if they only occur once. Moreover, when creating ordination
plots with microViz, only tax_agg()
will record the
aggregation level for provenance tracking and automated plot
captioning.
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.4.2 (2024-10-31)
#> os Ubuntu 24.04.1 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language en
#> collate C.UTF-8
#> ctype C.UTF-8
#> tz UTC
#> date 2024-12-16
#> pandoc 3.1.11 @ /opt/hostedtoolcache/pandoc/3.1.11/x64/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> ade4 1.7-22 2023-02-06 [1] RSPM
#> ape 5.8 2024-04-11 [1] RSPM
#> Biobase 2.66.0 2024-10-29 [1] Bioconduc~
#> BiocGenerics 0.52.0 2024-10-29 [1] Bioconduc~
#> biomformat 1.34.0 2024-10-29 [1] Bioconduc~
#> Biostrings 2.74.0 2024-10-29 [1] Bioconduc~
#> bslib 0.8.0 2024-07-29 [1] RSPM
#> cachem 1.1.0 2024-05-16 [1] RSPM
#> cli 3.6.3 2024-06-21 [1] RSPM
#> cluster 2.1.6 2023-12-01 [3] CRAN (R 4.4.2)
#> codetools 0.2-20 2024-03-31 [3] CRAN (R 4.4.2)
#> colorspace 2.1-1 2024-07-26 [1] RSPM
#> crayon 1.5.3 2024-06-20 [1] RSPM
#> data.table 1.16.4 2024-12-06 [1] RSPM
#> desc 1.4.3 2023-12-10 [1] RSPM
#> devtools 2.4.5 2022-10-11 [1] RSPM
#> digest 0.6.37 2024-08-19 [1] RSPM
#> dplyr 1.1.4 2023-11-17 [1] RSPM
#> ellipsis 0.3.2 2021-04-29 [1] RSPM
#> evaluate 1.0.1 2024-10-10 [1] RSPM
#> fansi 1.0.6 2023-12-08 [1] RSPM
#> fastmap 1.2.0 2024-05-15 [1] RSPM
#> foreach 1.5.2 2022-02-02 [1] RSPM
#> fs 1.6.5 2024-10-30 [1] RSPM
#> generics 0.1.3 2022-07-05 [1] RSPM
#> GenomeInfoDb 1.42.1 2024-11-28 [1] Bioconduc~
#> GenomeInfoDbData 1.2.13 2024-12-16 [1] Bioconductor
#> ggplot2 3.5.1 2024-04-23 [1] RSPM
#> glue 1.8.0 2024-09-30 [1] RSPM
#> gtable 0.3.6 2024-10-25 [1] RSPM
#> htmltools 0.5.8.1 2024-04-04 [1] RSPM
#> htmlwidgets 1.6.4 2023-12-06 [1] RSPM
#> httpuv 1.6.15 2024-03-26 [1] RSPM
#> httr 1.4.7 2023-08-15 [1] RSPM
#> igraph 2.1.2 2024-12-07 [1] RSPM
#> IRanges 2.40.1 2024-12-05 [1] Bioconduc~
#> iterators 1.0.14 2022-02-05 [1] RSPM
#> jquerylib 0.1.4 2021-04-26 [1] RSPM
#> jsonlite 1.8.9 2024-09-20 [1] RSPM
#> knitr 1.49 2024-11-08 [1] RSPM
#> later 1.4.1 2024-11-27 [1] RSPM
#> lattice 0.22-6 2024-03-20 [3] CRAN (R 4.4.2)
#> lifecycle 1.0.4 2023-11-07 [1] RSPM
#> magrittr 2.0.3 2022-03-30 [1] RSPM
#> MASS 7.3-61 2024-06-13 [3] CRAN (R 4.4.2)
#> Matrix 1.7-1 2024-10-18 [3] CRAN (R 4.4.2)
#> memoise 2.0.1 2021-11-26 [1] RSPM
#> mgcv 1.9-1 2023-12-21 [3] CRAN (R 4.4.2)
#> microViz * 0.12.6 2024-12-16 [1] local
#> mime 0.12 2021-09-28 [1] RSPM
#> miniUI 0.1.1.1 2018-05-18 [1] RSPM
#> multtest 2.62.0 2024-10-29 [1] Bioconduc~
#> munsell 0.5.1 2024-04-01 [1] RSPM
#> nlme 3.1-166 2024-08-14 [3] CRAN (R 4.4.2)
#> permute 0.9-7 2022-01-27 [1] RSPM
#> phyloseq * 1.50.0 2024-10-29 [1] Bioconduc~
#> pillar 1.9.0 2023-03-22 [1] RSPM
#> pkgbuild 1.4.5 2024-10-28 [1] RSPM
#> pkgconfig 2.0.3 2019-09-22 [1] RSPM
#> pkgdown 2.1.1 2024-09-17 [1] RSPM
#> pkgload 1.4.0 2024-06-28 [1] RSPM
#> plyr 1.8.9 2023-10-02 [1] RSPM
#> profvis 0.4.0 2024-09-20 [1] RSPM
#> promises 1.3.2 2024-11-28 [1] RSPM
#> purrr 1.0.2 2023-08-10 [1] RSPM
#> R6 2.5.1 2021-08-19 [1] RSPM
#> ragg 1.3.3 2024-09-11 [1] RSPM
#> Rcpp 1.0.13-1 2024-11-02 [1] RSPM
#> remotes 2.5.0 2024-03-17 [1] RSPM
#> reshape2 1.4.4 2020-04-09 [1] RSPM
#> rhdf5 2.50.1 2024-12-09 [1] Bioconduc~
#> rhdf5filters 1.18.0 2024-10-29 [1] Bioconduc~
#> Rhdf5lib 1.28.0 2024-10-29 [1] Bioconduc~
#> rlang 1.1.4 2024-06-04 [1] RSPM
#> rmarkdown 2.29 2024-11-04 [1] RSPM
#> S4Vectors 0.44.0 2024-10-29 [1] Bioconduc~
#> sass 0.4.9 2024-03-15 [1] RSPM
#> scales 1.3.0 2023-11-28 [1] RSPM
#> sessioninfo 1.2.2 2021-12-06 [1] RSPM
#> shiny 1.10.0 2024-12-14 [1] RSPM
#> stringi 1.8.4 2024-05-06 [1] RSPM
#> stringr 1.5.1 2023-11-14 [1] RSPM
#> survival 3.7-0 2024-06-05 [3] CRAN (R 4.4.2)
#> systemfonts 1.1.0 2024-05-15 [1] RSPM
#> textshaping 0.4.1 2024-12-06 [1] RSPM
#> tibble 3.2.1 2023-03-20 [1] RSPM
#> tidyselect 1.2.1 2024-03-11 [1] RSPM
#> UCSC.utils 1.2.0 2024-10-29 [1] Bioconduc~
#> urlchecker 1.0.1 2021-11-30 [1] RSPM
#> usethis 3.1.0 2024-11-26 [1] RSPM
#> utf8 1.2.4 2023-10-22 [1] RSPM
#> vctrs 0.6.5 2023-12-01 [1] RSPM
#> vegan 2.6-8 2024-08-28 [1] RSPM
#> withr 3.0.2 2024-10-28 [1] RSPM
#> xfun 0.49 2024-10-31 [1] RSPM
#> xtable 1.8-4 2019-04-21 [1] RSPM
#> XVector 0.46.0 2024-10-29 [1] Bioconduc~
#> yaml 2.3.10 2024-07-26 [1] RSPM
#> zlibbioc 1.52.0 2024-10-29 [1] Bioconduc~
#>
#> [1] /home/runner/work/_temp/Library
#> [2] /opt/R/4.4.2/lib/R/site-library
#> [3] /opt/R/4.4.2/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────