Join a dataframe to phyloseq sample data

You can use most types of join from the dplyr::*_join function family, including e.g. "inner", "left", "semi", "anti" (see details below). Defaults to type = "left" which calls left_join(), this supports x as a phyloseq and y as a dataframe. Most of the time you'll want "left" (adds variables with no sample filtering), or "inner" (adds variables and filters samples). This function simply:

extracts the sample_data from the phyloseq as a dataframe
performs the chosen type of join (with the given arguments)
filters the phyloseq if type = inner, semi or anti
reattaches the modified sample_data to the phyloseq and returns the phyloseq

Usage

ps_join(
  x,
  y,
  by = NULL,
  match_sample_names = NULL,
  keep_sample_name_col = TRUE,
  sample_name_natural_join = FALSE,
  type = "left",
  .keep_all_taxa = FALSE
)

Arguments

x: phyloseq (or dataframe)
y: dataframe (or phyloseq for e.g. type = "right")
by: A character vector of variables to join by (col must be present in both x and y or paired via a named vector like c("xname" = "yname", etc.))
match_sample_names: match against the phyloseq sample_names by naming a variable in the additional dataframe (this is in addition to any variables named in by)
keep_sample_name_col: should the column named in match_sample_names be kept in the returned phyloseq's sample_data? (only relevant if match_sample_names is not NULL)
sample_name_natural_join: if TRUE, use sample_name AND all shared colnames to match rows (only relevant if match_sample_names is not NULL, this arg takes precedence over anything also entered in by arg)
type: name of type of join e.g. "left", "right", "inner", "semi" (see dplyr help pages)
.keep_all_taxa: if FALSE (the default), remove taxa which are no longer present in the dataset after filtering

Value

phyloseq with modified sample_data (and possibly filtered)

Details

Mutating joins, which will add columns from a dataframe to phyloseq sample data, matching rows based on the key columns named in the by argument:

"inner": includes all rows in present in both x and y.
"left": includes all rows in x. (so x must be the phyloseq)
"right": includes all rows in y. (so y must be the phyloseq)
"full": includes all rows present in x or y. (will likely NOT work, as additional rows cannot be added to sample_data!)

If a row in x matches multiple rows in y (based on variables named in the by argument), all the rows in y will be added once for each matching row in x. This will cause this function to fail, as additional rows cannot be added to the phyloseq sample_data!

Filtering joins filter rows from x based on the presence or absence of matches in y:

"semi": return all rows from x with a match in y.
"anti": return all rows from x without a match in y.

Examples

library(phyloseq)
data("enterotype", package = "phyloseq")

x <- enterotype
y <- data.frame(
  ID_var = sample_names(enterotype)[c(1:50, 101:150)],
  SeqTech = sample_data(enterotype)[c(1:50, 101:150), "SeqTech"],
  arbitrary_info = rep(c("A", "B"), 50)
)

# simply match the new data to samples that exist in x, as default is a left_join
# where some sample names of x are expected to match variable ID_var in dataframe y
out1A <- ps_join(x = x, y = y, match_sample_names = "ID_var")
out1A
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 553 taxa and 280 samples ]
#> sample_data() Sample Data:       [ 280 samples by 12 sample variables ]
#> tax_table()   Taxonomy Table:    [ 553 taxa by 1 taxonomic ranks ]
sample_data(out1A)[1:6, ]
#>              ID_var Enterotype Sample_ID SeqTech.x  SampleID     Project
#> AM.AD.1     AM.AD.1       <NA>   AM.AD.1    Sanger   AM.AD.1      gill06
#> AM.AD.2     AM.AD.2       <NA>   AM.AD.2    Sanger   AM.AD.2      gill06
#> AM.F10.T1 AM.F10.T1       <NA> AM.F10.T1    Sanger AM.F10.T1 turnbaugh09
#> AM.F10.T2 AM.F10.T2          3 AM.F10.T2    Sanger AM.F10.T2 turnbaugh09
#> DA.AD.1     DA.AD.1          2   DA.AD.1    Sanger   DA.AD.1     MetaHIT
#> DA.AD.1T   DA.AD.1T       <NA>  DA.AD.1T    Sanger      <NA>        <NA>
#>           Nationality Gender Age ClinicalStatus SeqTech.y arbitrary_info
#> AM.AD.1      american      F  28        healthy    Sanger              A
#> AM.AD.2      american      M  37        healthy    Sanger              B
#> AM.F10.T1    american      F  NA          obese    Sanger              A
#> AM.F10.T2    american      F  NA          obese    Sanger              B
#> DA.AD.1        danish      F  59        healthy    Sanger              A
#> DA.AD.1T         <NA>   <NA>  NA           <NA>    Sanger              B


# use sample_name and all shared variables to join
# (a natural join is not a type of join per se,
# but it indicates that all shared variables should be used for matching)
out1B <- ps_join(
  x = x, y = y, match_sample_names = "ID_var",
  sample_name_natural_join = TRUE, keep_sample_name_col = FALSE
)
out1B
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 553 taxa and 280 samples ]
#> sample_data() Sample Data:       [ 280 samples by 10 sample variables ]
#> tax_table()   Taxonomy Table:    [ 553 taxa by 1 taxonomic ranks ]
sample_data(out1B)[1:6, ]
#>           Enterotype Sample_ID SeqTech  SampleID     Project Nationality Gender
#> AM.AD.1         <NA>   AM.AD.1  Sanger   AM.AD.1      gill06    american      F
#> AM.AD.2         <NA>   AM.AD.2  Sanger   AM.AD.2      gill06    american      M
#> AM.F10.T1       <NA> AM.F10.T1  Sanger AM.F10.T1 turnbaugh09    american      F
#> AM.F10.T2          3 AM.F10.T2  Sanger AM.F10.T2 turnbaugh09    american      F
#> DA.AD.1            2   DA.AD.1  Sanger   DA.AD.1     MetaHIT      danish      F
#> DA.AD.1T        <NA>  DA.AD.1T  Sanger      <NA>        <NA>        <NA>   <NA>
#>           Age ClinicalStatus arbitrary_info
#> AM.AD.1    28        healthy              A
#> AM.AD.2    37        healthy              B
#> AM.F10.T1  NA          obese              A
#> AM.F10.T2  NA          obese              B
#> DA.AD.1    59        healthy              A
#> DA.AD.1T   NA           <NA>              B

# if you only want to keep phyloseq samples that exist in the new data, try an inner join
# this will add the new variables AND filter the phyloseq
# this example matches sample names to ID_var and by matching the shared SeqTech variable
out1C <- ps_join(x = x, y = y, type = "inner", by = "SeqTech", match_sample_names = "ID_var")
out1C
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 533 taxa and 100 samples ]
#> sample_data() Sample Data:       [ 100 samples by 11 sample variables ]
#> tax_table()   Taxonomy Table:    [ 533 taxa by 1 taxonomic ranks ]
sample_data(out1C)[1:6, ]
#>              ID_var Enterotype Sample_ID SeqTech  SampleID     Project
#> AM.AD.1     AM.AD.1       <NA>   AM.AD.1  Sanger   AM.AD.1      gill06
#> AM.AD.2     AM.AD.2       <NA>   AM.AD.2  Sanger   AM.AD.2      gill06
#> AM.F10.T1 AM.F10.T1       <NA> AM.F10.T1  Sanger AM.F10.T1 turnbaugh09
#> AM.F10.T2 AM.F10.T2          3 AM.F10.T2  Sanger AM.F10.T2 turnbaugh09
#> DA.AD.1     DA.AD.1          2   DA.AD.1  Sanger   DA.AD.1     MetaHIT
#> DA.AD.1T   DA.AD.1T       <NA>  DA.AD.1T  Sanger      <NA>        <NA>
#>           Nationality Gender Age ClinicalStatus arbitrary_info
#> AM.AD.1      american      F  28        healthy              A
#> AM.AD.2      american      M  37        healthy              B
#> AM.F10.T1    american      F  NA          obese              A
#> AM.F10.T2    american      F  NA          obese              B
#> DA.AD.1        danish      F  59        healthy              A
#> DA.AD.1T         <NA>   <NA>  NA           <NA>              B

# the id variable is named Sample_ID in x and ID_var in y
# semi_join is only a filtering join (doesn't add new variables but just filters samples in x)
out2A <- ps_join(x = x, y = y, by = c("Sample_ID" = "ID_var"), type = "semi")
out2A
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 533 taxa and 100 samples ]
#> sample_data() Sample Data:       [ 100 samples by 9 sample variables ]
#> tax_table()   Taxonomy Table:    [ 533 taxa by 1 taxonomic ranks ]
sample_data(out2A)[1:6, ]
#>           Enterotype Sample_ID SeqTech  SampleID     Project Nationality Gender
#> AM.AD.1         <NA>   AM.AD.1  Sanger   AM.AD.1      gill06    american      F
#> AM.AD.2         <NA>   AM.AD.2  Sanger   AM.AD.2      gill06    american      M
#> AM.F10.T1       <NA> AM.F10.T1  Sanger AM.F10.T1 turnbaugh09    american      F
#> AM.F10.T2          3 AM.F10.T2  Sanger AM.F10.T2 turnbaugh09    american      F
#> DA.AD.1            2   DA.AD.1  Sanger   DA.AD.1     MetaHIT      danish      F
#> DA.AD.1T        <NA>  DA.AD.1T  Sanger      <NA>        <NA>        <NA>   <NA>
#>           Age ClinicalStatus
#> AM.AD.1    28        healthy
#> AM.AD.2    37        healthy
#> AM.F10.T1  NA          obese
#> AM.F10.T2  NA          obese
#> DA.AD.1    59        healthy
#> DA.AD.1T   NA           <NA>

# anti_join is another type of filtering join
out2B <- ps_join(x = x, y = y, by = c("Sample_ID" = "ID_var"), type = "anti")
out2B
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 549 taxa and 180 samples ]
#> sample_data() Sample Data:       [ 180 samples by 9 sample variables ]
#> tax_table()   Taxonomy Table:    [ 549 taxa by 1 taxonomic ranks ]
sample_data(out2B)[1:6, ]
#>        Enterotype Sample_ID  SeqTech SampleID Project Nationality Gender Age
#> MH0010          1    MH0010 Illumina     <NA>    <NA>        <NA>   <NA>  NA
#> MH0011          1    MH0011 Illumina     <NA>    <NA>        <NA>   <NA>  NA
#> MH0012          1    MH0012 Illumina     <NA>    <NA>        <NA>   <NA>  NA
#> MH0013          1    MH0013 Illumina     <NA>    <NA>        <NA>   <NA>  NA
#> MH0014          1    MH0014 Illumina     <NA>    <NA>        <NA>   <NA>  NA
#> MH0015          1    MH0015 Illumina     <NA>    <NA>        <NA>   <NA>  NA
#>        ClinicalStatus
#> MH0010           <NA>
#> MH0011           <NA>
#> MH0012           <NA>
#> MH0013           <NA>
#> MH0014           <NA>
#> MH0015           <NA>

# semi and anti joins keep opposite sets of samples
intersect(sample_names(out2A), sample_names(out2B))
#> character(0)

# you can mix and match named and unnamed values in the `by` vector
# inner is like a combination of left join and semi join
out3 <- ps_join(x = x, y = y, by = c("Sample_ID" = "ID_var", "SeqTech"), type = "inner")
out3
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 533 taxa and 100 samples ]
#> sample_data() Sample Data:       [ 100 samples by 10 sample variables ]
#> tax_table()   Taxonomy Table:    [ 533 taxa by 1 taxonomic ranks ]
sample_data(out3)[1:6, ]
#>           Enterotype Sample_ID SeqTech  SampleID     Project Nationality Gender
#> AM.AD.1         <NA>   AM.AD.1  Sanger   AM.AD.1      gill06    american      F
#> AM.AD.2         <NA>   AM.AD.2  Sanger   AM.AD.2      gill06    american      M
#> AM.F10.T1       <NA> AM.F10.T1  Sanger AM.F10.T1 turnbaugh09    american      F
#> AM.F10.T2          3 AM.F10.T2  Sanger AM.F10.T2 turnbaugh09    american      F
#> DA.AD.1            2   DA.AD.1  Sanger   DA.AD.1     MetaHIT      danish      F
#> DA.AD.1T        <NA>  DA.AD.1T  Sanger      <NA>        <NA>        <NA>   <NA>
#>           Age ClinicalStatus arbitrary_info
#> AM.AD.1    28        healthy              A
#> AM.AD.2    37        healthy              B
#> AM.F10.T1  NA          obese              A
#> AM.F10.T2  NA          obese              B
#> DA.AD.1    59        healthy              A
#> DA.AD.1T   NA           <NA>              B