This notebook gives context for what strains are in our strain bank and shows that standard analyses don’t work for capturing genomic strain variation.
# read in databiobank =readh5mu(joinpath(bbdir, "BB669.h5mu"))biobank.obs[!,:kept_species] =convert.(Bool,biobank.obs[!,:kept_species]);biobank
┌ Warning: Cannot join columns with the same name because var_names are intersecting.
└ @ Muon /Users/bend/.julia/packages/Muon/UKjAF/src/mudata.jl:367
# we have taxanomic information each of these strains# GTDB_* are the annotations from https://gtdb.ecogenomic.org/# NCBI_* are the annotation post-submission to NCBI# * annotations are pre-submission to NCBI# donor specifies which human microbiome each strain was sampled from# <hospital>.<donor number>bbobs = biobank.obs;first(bbobs, 5)
5×22 DataFrame
Row
Strain_ID
Accession
Phylum
Class
Order
Family
Genus
Species
Donor
NCBI_Phylum
NCBI_Class
NCBI_Order
NCBI_Family
NCBI_Genus
NCBI_Species
GTDB_Phylum
GTDB_Class
GTDB_Order
GTDB_Family
GTDB_Genus
GTDB_Species
kept_species
String
String
String
String
String
String
String
String
String
String
String
String
String
String
String
String
String
String
String
String
String
Bool
1
MSK.16.19
JAHOLG000000000
Bacteroidetes
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides eggerthii
MSK.16
Bacteroidetes
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides eggerthii
Bacteroidota
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides eggerthii
false
2
MSK.10.5
JAHOMY000000000
Bacteroidetes
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides fragilis
MSK.10
Bacteroidetes
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides fragilis
Bacteroidota
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides fragilis_A
false
3
MSK.13.23
JAHOMK000000000
Bacteroidetes
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides fragilis
MSK.13
Bacteroidetes
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides fragilis
Bacteroidota
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides fragilis_A
false
4
MSK.16.61
JAHOKD000000000
Bacteroidetes
Bacteroidia
Bacteroidales
Bacteroidaceae
Phocaeicola
Phocaeicola vulgatus
MSK.16
Bacteroidetes
Bacteroidia
Bacteroidales
Bacteroidaceae
Phocaeicola
Phocaeicola vulgatus
Bacteroidota
Bacteroidia
Bacteroidales
Bacteroidaceae
Phocaeicola
Phocaeicola vulgatus
true
5
MSK.18.56
JAHOHX000000000
Bacteroidetes
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides caccae
MSK.18
Bacteroidetes
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides caccae
Bacteroidota
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides caccae
false
# concentrated for lachno's and bacteroidaceae # (matches distibutions normally found in these types of strain bank)family_counts =sort(countmap(bbobs.NCBI_Family), byvalue=true, rev=true)
proportion CSB strains from different NCBI families
species_counts =collect(values(countmap(bbobs.NCBI_Species)));histogram( species_counts, bins=100, xlabel="# of replicates", ylabel="# of species", size=(500,500),)
each bar represents the number of measured species (y axis) with a certain number of replicates (x axis). Most species we classify only have 1 strain representative, 11 species have at least 20 strain representatives.
Phyml on 16S alignment of Biobank strains
16S is a ribosomal protein (involved with translating RNA to Proteins) that is present across all bacteria. For that reason, it is commonly used as a gene marker for constructing ancestral trees of bacteria and taxanomic classification.
What we show is that 16S annotations are only good up to species level classification. At that point all strains are completely indistinguishable from each other.
Bac120 aimed to be an improvement over 16S annotation. It takes 120 genes that are reasonably well conserved across bacteria, and takes ~50 genomic positions from each of those genes, creating an annotation of around 6000 features.
We find that these annotations equally only work down to species level classification, with strain level replicates all being identically annotated.
metmtx = biobank["metabolites_foldchange"].X[:, :]# too close to zero for detectable measurment, so assume no changemetmtx[isinf.(metmtx)] .=0.0; @showsize(metmtx)# # filter to metabolites that have at least 10% detectable shifts in measurement# metabolite_names_full = biobank["metabolites_foldchange"].var_names.vals;# keepmetabolites_mask = mapslices(c->mean(c .== 0.0) < 0.9, metmtx, dims=1) |> vec;# metabolite_names = metabolite_names_full[keepmetabolites_mask]# metmtx = metmtx[:, keepmetabolites_mask];# @show size(metmtx);
However, these strains are not functionally identical. Strains of the same species, digest/produce different amounts of metabolite compounds. And it is not simply individual variation. In the plot below we see bi-modal distribution indicating that there are groups of strains that behave differently.
# load metabolites matrixmetmtx = biobank["metabolites_foldchange"].X[:, :][biobank.obs.kept_species,:];# too close to zero for detectable measurment, so assume no changemetmtx[isinf.(metmtx)] .=0.0; @showsize(metmtx)# filter to metabolites that have at least 10% detectable shifts in measurementmetabolite_names_full = biobank["metabolites_foldchange"].var_names.vals;keepmetabolites_mask =mapslices(c->mean(c .==0.0) <0.9, metmtx, dims=1) |> vec;metabolite_names = metabolite_names_full[keepmetabolites_mask]metmtx = metmtx[:, keepmetabolites_mask];@showsize(metmtx);
show log-2 fold change from base media across strains from 9 species with more than 20 replicates. multi-modal distributions for some distributions indicates sub-species phylogentic variation in phenotype.