Quick start / TLDR

Use tax_fix() on your phyloseq data with default arguments to repair most tax_table problems (missing or uninformative values). If you still encounter errors using e.g. tax_agg, try using the Shiny app tax_fix_interactive() to help you generate tax_fix code that will fix your particular tax_table problems.


Intro

This article will explain some of the common problems that can occur in your phyloseq tax_table, and that might cause problems for e.g. tax_agg. You can fix these problems with the help of tax_fix and tax_fix_interactive.

Fixing problems

Let’s look at some example data from the corncob package:

pseq <- microViz::ibd
pseq
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 36349 taxa and 91 samples ]
#> sample_data() Sample Data:       [ 91 samples by 15 sample variables ]
#> tax_table()   Taxonomy Table:    [ 36349 taxa by 7 taxonomic ranks ]

The Species rank appears to be blank for many entries. This is a problem you may well encounter in your data: unique sequences or OTUs often cannot be annotated at lower taxonomic ranks.

tax_table(pseq)[40:54, 4:7] # highest 3 ranks not shown, to save space
#> Taxonomy Table:     [15 taxa by 4 taxonomic ranks]:
#>        Order               Family               Genus                  Species
#> OTU.40 "Enterobacteriales" "Enterobacteriaceae" "Escherichia/Shigella" ""     
#> OTU.41 "Coriobacteriales"  "Coriobacteriaceae"  "Gordonibacter"        ""     
#> OTU.42 "Clostridiales"     "Ruminococcaceae"    "Faecalibacterium"     ""     
#> OTU.43 "Clostridiales"     "Ruminococcaceae"    ""                     ""     
#> OTU.44 "Bacteroidales"     "Prevotellaceae"     "Prevotella"           ""     
#> OTU.45 "Bacteroidales"     "Bacteroidaceae"     "Bacteroides"          ""     
#> OTU.46 "Enterobacteriales" "Enterobacteriaceae" "Klebsiella"           ""     
#> OTU.47 "Bacteroidales"     "Prevotellaceae"     "Prevotella"           ""     
#> OTU.48 "Clostridiales"     "Lachnospiraceae"    "Blautia"              ""     
#> OTU.49 "Clostridiales"     "Ruminococcaceae"    "Faecalibacterium"     ""     
#> OTU.50 "Enterobacteriales" "Enterobacteriaceae" "Escherichia/Shigella" ""     
#> OTU.51 "Bacteroidales"     "Prevotellaceae"     "Prevotella"           ""     
#> OTU.52 "Clostridiales"     "Lachnospiraceae"    "Clostridium_XlVa"     ""     
#> OTU.53 "Bacteroidales"     "Prevotellaceae"     "Prevotella"           ""     
#> OTU.54 "Clostridiales"     "Lachnospiraceae"    ""                     ""

If we would try to aggregate at Genus or Family rank level, we discover that blank values at these ranks prevent taxonomic aggregation. This is because, for example, it looks like OTU.43 and OTU.54 share the same (empty) Genus name, ““, despite being different at a higher rank, Family.

# tax_agg(pseq, rank = "Family") # this fails, and sends (helpful) messages about taxa problems

So we should run tax_fix first, which will fix most problems with default settings, allowing taxa to be aggregated successfully (at any rank). If you still have errors when using tax_agg after tax_fix, carefully read the error and accompanying messages. Often you can copy suggested tax_fix code from the tax_agg error. You should generally also have a look around your tax_table for other uninformative values, using tax_fix_interactive.

pseq %>%
  tax_fix() %>%
  tax_agg(rank = "Family")
#> psExtra object - a phyloseq object with extra slots:
#> 
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 59 taxa and 91 samples ]
#> sample_data() Sample Data:       [ 91 samples by 15 sample variables ]
#> tax_table()   Taxonomy Table:    [ 59 taxa by 5 taxonomic ranks ]
#> 
#> psExtra info:
#> tax_agg = "Family"

What does tax_fix do?

tax_fix searches all the ranks of the phyloseq object tax_table for:

  • short values, like “g__”, ““,” “, etc. (any with fewer characters than min_length)

  • common, longer but uninformative values like “unknown” (see full list at ?tax_fix)

  • NAs

tax_fix replaces these values with the next higher taxonomic rank, e.g. an “unknown” Family within the Order Clostridiales will be renamed “Clostridiales Order”, as seen below.

pseq %>%
  tax_fix(min_length = 4) %>%
  tax_agg("Family") %>%
  # ps_get() %>% # needed in older versions of microViz (< 0.10.0)
  tax_table() %>%
  .[1:8, 3:5] # removes the first 2 ranks and shows only first 8 rows for nice printing
#> Taxonomy Table:     [8 taxa by 3 taxonomic ranks]:
#>                       Class           Order             Family                 
#> Ruminococcaceae       "Clostridia"    "Clostridiales"   "Ruminococcaceae"      
#> Bacteroidaceae        "Bacteroidia"   "Bacteroidales"   "Bacteroidaceae"       
#> Prevotellaceae        "Bacteroidia"   "Bacteroidales"   "Prevotellaceae"       
#> Lachnospiraceae       "Clostridia"    "Clostridiales"   "Lachnospiraceae"      
#> Veillonellaceae       "Negativicutes" "Selenomonadales" "Veillonellaceae"      
#> Clostridiales Order   "Clostridia"    "Clostridiales"   "Clostridiales Order"  
#> Peptostreptococcaceae "Clostridia"    "Clostridiales"   "Peptostreptococcaceae"
#> Porphyromonadaceae    "Bacteroidia"   "Bacteroidales"   "Porphyromonadaceae"

Interactive solutions

You can use tax_fix_interactive() to explore your data’s tax_table visually, and interactively find and fix problematic entries. You can then copy your automagically personalised tax_fix code from tax_fix_interactive’s output, to paste into your script. Below is a screen capture video of tax_fix in action, using some other artificially mangled example data (see details at ?tax_fix_interactive()).

tax_fix_interactive(example_data)

Other possible problems

  1. Completely unclassified taxa (aka taxa where all values in their tax_table row are either too short or listed in unknowns argument) will be replaced at all ranks with their unique row name by default (or alternatively with a generic name of “unclassified [highest rank]”, which is useful if you want to aggregate all the unclassified sequences together with tax_agg())
  2. Unclassified taxa that also have short / unknown row names, e.g. the unclassified taxon called “-1” in the example “enterotype” dataset from phyloseq. If something like this happens in your data, rename the taxa manually, (e.g. taxa_names(enterotype)[1] <- "unclassified taxon" or give them all completely different names with tax_name().
  3. Taxa with the same tax_table entry repeated across multiple ranks: This is a problem for functions like taxatree_plots(), which need distinct entries at each rank to build the tree structure for plotting. This might happen after you tax_fix data with problem 1. of this list, or in data from e.g. microarray methods like HITchip. The solution is to use tax_prepend_ranks() (after tax_fix) to add the first character of the rank to all tax_table entries (you will also need set the tax_fix argument suffix_rank = “current”).
  4. Informative but duplicated tax_table entries: e.g. you don’t want to delete/replace a genus name completely, but it is shared by two families and thus blocking tax_agg. The solution is to rename (one of) these values manually to make them distinct. tax_table(yourPhyloseq)["targetTaxonName", "targetRank"] <- "newBetterGenusName"
  5. Really long taxa_names(): e.g. you have DNA sequences as names. See tax_name() for an easy way to rename all your taxa.

Abundance filtering as a solution

Sequences that are unclassified at fairly high ranks e.g. Class are often very low abundance (or possibly represent sequencing errors/chimeras), if you are using data from an environment that is typically well represented in reference databases. So if you are struggling with what to do with unclassified taxa, consider if you can just remove them first using tax_filter() (perhaps using fairly relaxed filtering criteria like min_prevalence of 2 samples, or min_total_abundance of 1000 reads, and keeping the tax_level argument as NA, so that no aggregation is attempted before filtering).

Alternatives

microbiome::aggregate_taxa() also solves some tax_table problems, e.g. where multiple distinct genera converge again to the same species name like “” or “s__”, it will make unique taxa names by pasting together all of the rank names. However this can produce some very long names, which need to be manually shortened before use in plots. Plus, it doesn’t replace names like “s__” if they only occur once. Moreover, when creating ordination plots with microViz, only tax_agg() will record the aggregation level for provenance tracking and automated plot captioning.

Session info

devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.3 (2024-02-29)
#>  os       Ubuntu 22.04.4 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en
#>  collate  C.UTF-8
#>  ctype    C.UTF-8
#>  tz       UTC
#>  date     2024-04-03
#>  pandoc   3.1.11 @ /opt/hostedtoolcache/pandoc/3.1.11/x64/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package          * version    date (UTC) lib source
#>  ade4               1.7-22     2023-02-06 [1] RSPM
#>  ape                5.7-1      2023-03-13 [1] RSPM
#>  Biobase            2.62.0     2023-10-24 [1] Bioconductor
#>  BiocGenerics       0.48.1     2023-11-01 [1] Bioconductor
#>  biomformat         1.30.0     2023-10-24 [1] Bioconductor
#>  Biostrings         2.70.3     2024-03-13 [1] Bioconduc~
#>  bitops             1.0-7      2021-04-24 [1] RSPM
#>  bslib              0.7.0      2024-03-29 [1] RSPM
#>  cachem             1.0.8      2023-05-01 [1] RSPM
#>  cli                3.6.2      2023-12-11 [1] RSPM
#>  cluster            2.1.6      2023-12-01 [3] CRAN (R 4.3.3)
#>  codetools          0.2-19     2023-02-01 [3] CRAN (R 4.3.3)
#>  colorspace         2.1-0      2023-01-23 [1] RSPM
#>  crayon             1.5.2      2022-09-29 [1] RSPM
#>  data.table         1.15.4     2024-03-30 [1] RSPM
#>  desc               1.4.3      2023-12-10 [1] RSPM
#>  devtools           2.4.5      2022-10-11 [1] RSPM
#>  digest             0.6.35     2024-03-11 [1] RSPM
#>  dplyr              1.1.4      2023-11-17 [1] RSPM
#>  ellipsis           0.3.2      2021-04-29 [1] RSPM
#>  evaluate           0.23       2023-11-01 [1] RSPM
#>  fansi              1.0.6      2023-12-08 [1] RSPM
#>  fastmap            1.1.1      2023-02-24 [1] RSPM
#>  foreach            1.5.2      2022-02-02 [1] RSPM
#>  fs                 1.6.3      2023-07-20 [1] RSPM
#>  generics           0.1.3      2022-07-05 [1] RSPM
#>  GenomeInfoDb       1.38.8     2024-03-15 [1] Bioconduc~
#>  GenomeInfoDbData   1.2.11     2024-04-03 [1] Bioconductor
#>  ggplot2            3.5.0      2024-02-23 [1] RSPM
#>  glue               1.7.0      2024-01-09 [1] RSPM
#>  gtable             0.3.4      2023-08-21 [1] RSPM
#>  htmltools          0.5.8      2024-03-25 [1] RSPM
#>  htmlwidgets        1.6.4      2023-12-06 [1] RSPM
#>  httpuv             1.6.15     2024-03-26 [1] RSPM
#>  igraph             2.0.3      2024-03-13 [1] RSPM
#>  IRanges            2.36.0     2023-10-24 [1] Bioconductor
#>  iterators          1.0.14     2022-02-05 [1] RSPM
#>  jquerylib          0.1.4      2021-04-26 [1] RSPM
#>  jsonlite           1.8.8      2023-12-04 [1] RSPM
#>  knitr              1.45       2023-10-30 [1] RSPM
#>  later              1.3.2      2023-12-06 [1] RSPM
#>  lattice            0.22-5     2023-10-24 [3] CRAN (R 4.3.3)
#>  lifecycle          1.0.4      2023-11-07 [1] RSPM
#>  magrittr           2.0.3      2022-03-30 [1] RSPM
#>  MASS               7.3-60.0.1 2024-01-13 [3] CRAN (R 4.3.3)
#>  Matrix             1.6-5      2024-01-11 [3] CRAN (R 4.3.3)
#>  memoise            2.0.1      2021-11-26 [1] RSPM
#>  mgcv               1.9-1      2023-12-21 [3] CRAN (R 4.3.3)
#>  microViz         * 0.12.1     2024-04-03 [1] local
#>  mime               0.12       2021-09-28 [1] RSPM
#>  miniUI             0.1.1.1    2018-05-18 [1] RSPM
#>  multtest           2.58.0     2023-10-24 [1] Bioconductor
#>  munsell            0.5.1      2024-04-01 [1] RSPM
#>  nlme               3.1-164    2023-11-27 [3] CRAN (R 4.3.3)
#>  permute            0.9-7      2022-01-27 [1] RSPM
#>  phyloseq         * 1.46.0     2023-10-24 [1] Bioconductor
#>  pillar             1.9.0      2023-03-22 [1] RSPM
#>  pkgbuild           1.4.4      2024-03-17 [1] RSPM
#>  pkgconfig          2.0.3      2019-09-22 [1] RSPM
#>  pkgdown            2.0.7      2022-12-14 [1] RSPM
#>  pkgload            1.3.4      2024-01-16 [1] RSPM
#>  plyr               1.8.9      2023-10-02 [1] RSPM
#>  profvis            0.3.8      2023-05-02 [1] RSPM
#>  promises           1.2.1      2023-08-10 [1] RSPM
#>  purrr              1.0.2      2023-08-10 [1] RSPM
#>  R6                 2.5.1      2021-08-19 [1] RSPM
#>  ragg               1.3.0      2024-03-13 [1] RSPM
#>  Rcpp               1.0.12     2024-01-09 [1] RSPM
#>  RCurl              1.98-1.14  2024-01-09 [1] RSPM
#>  remotes            2.5.0      2024-03-17 [1] RSPM
#>  reshape2           1.4.4      2020-04-09 [1] RSPM
#>  rhdf5              2.46.1     2023-11-29 [1] Bioconduc~
#>  rhdf5filters       1.14.1     2023-11-06 [1] Bioconductor
#>  Rhdf5lib           1.24.2     2024-02-07 [1] Bioconduc~
#>  rlang              1.1.3      2024-01-10 [1] RSPM
#>  rmarkdown          2.26       2024-03-05 [1] RSPM
#>  S4Vectors          0.40.2     2023-11-23 [1] Bioconduc~
#>  sass               0.4.9      2024-03-15 [1] RSPM
#>  scales             1.3.0      2023-11-28 [1] RSPM
#>  sessioninfo        1.2.2      2021-12-06 [1] RSPM
#>  shiny              1.8.1.1    2024-04-02 [1] RSPM
#>  stringi            1.8.3      2023-12-11 [1] RSPM
#>  stringr            1.5.1      2023-11-14 [1] RSPM
#>  survival           3.5-8      2024-02-14 [3] CRAN (R 4.3.3)
#>  systemfonts        1.0.6      2024-03-07 [1] RSPM
#>  textshaping        0.3.7      2023-10-09 [1] RSPM
#>  tibble             3.2.1      2023-03-20 [1] RSPM
#>  tidyselect         1.2.1      2024-03-11 [1] RSPM
#>  urlchecker         1.0.1      2021-11-30 [1] RSPM
#>  usethis            2.2.3      2024-02-19 [1] RSPM
#>  utf8               1.2.4      2023-10-22 [1] RSPM
#>  vctrs              0.6.5      2023-12-01 [1] RSPM
#>  vegan              2.6-4      2022-10-11 [1] RSPM
#>  withr              3.0.0      2024-01-16 [1] RSPM
#>  xfun               0.43       2024-03-25 [1] RSPM
#>  xtable             1.8-4      2019-04-21 [1] RSPM
#>  XVector            0.42.0     2023-10-24 [1] Bioconductor
#>  yaml               2.3.8      2023-12-11 [1] RSPM
#>  zlibbioc           1.48.2     2024-03-13 [1] Bioconduc~
#> 
#>  [1] /home/runner/work/_temp/Library
#>  [2] /opt/R/4.3.3/lib/R/site-library
#>  [3] /opt/R/4.3.3/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────