Main - Projecting in strains from the commensal strain bank
Know that we have a latent space of bacterial evolutionary diversity, see previous notebook. We want to know if it informs us about the relatedness of our newly sequenced gut commensal strains. Have we learned something general about biology or just the idiosyncrasies of the UniProt dataset?
To do this we leverage the fact that the UniProt latest spaces is built using SVD/PCA, which means that we can linearly project new data into that latest space.
# read in datauniprot =readh5ad(datadir("exp_raw", "UP7047", "2020_02_UP7047.h5ad"))biobank =readh5mu(datadir("exp_raw", "BB669", "BB669.h5mu"))
┌ Warning: Cannot join columns with the same name because var_names are intersecting.
└ @ Muon /Users/bend/.julia/packages/Muon/UKjAF/src/mudata.jl:367
# remake UniProt latent spaceupmtx = uniprot.X[:, :]upusv =svd(upmtx)# get strain bank strains annotated in same way as UniProtbbmtx = biobank["UPorder_oggs"].X[:,:]size(bbmtx)
(669, 10177)
# project into uniprot spectral latent space# uses ̂u = newdata * V * S_inversebbuhat =projectinLSV(bbmtx, upusv)
┌ Warning: RCall.jl: Scale for y is already present.
│ Adding another scale for y, which will replace the existing scale.
│ Coordinate system already present. Adding new coordinate system, which will
│ replace the existing one.
└ @ RCall /Users/bend/.julia/packages/RCall/0ggIQ/src/io.jl:172
┌ Warning: RCall.jl: Scale for y is already present.
│ Adding another scale for y, which will replace the existing scale.
│ Coordinate system already present. Adding new coordinate system, which will
│ replace the existing one.
└ @ RCall /Users/bend/.julia/packages/RCall/0ggIQ/src/io.jl:172
Sub-trees
Let’s check some local neighbors around certain taxa. Expanding out from individual taxa shows the other strain replicates of that species. And, we see sub-species groupings!
nthparent(n, i) = i <1 ? n :nthparent(parent(n), i-1)rename_treeleaves!(tree, idmapping) =beginfor node ingetleaves(tree) NewickTree.setname!(node, idmapping[name(node)])end treeend
rename_treeleaves! (generic function with 1 method)
One thing to notice as we view these sub trees is that they are grouping by NCBI species, but there are further sub-groupings within each species. The ID convention for our strains is “<sampling hospital>.<donor>.<strain>”. Knowing this naming convention we realized that the sub-groups were corresponding to the donor these strains were sampled from
This becomes more obvious as we color by donor id
donorcolormap =Dict(k=>v for (k,v) inzip(unique(biobank.obs.Donor), palette(:glasbey_bw_n256, length(unique(biobank.obs.Donor)))));
What components were needed to resolve these clusters? If we plot the projections of the taxa onto each of the components, we see that differences between these sub-species clusters only really shows up after PC 10, which we can see by the beginning of banding in the heatmap.
Again we wan’t to look systematically over our tree, to verify that the patterns we saw in the above two examples are consistent with the general patterns in the tree.
Thus we compare grouping in our tree to both NCBI taxonomy and GTDB taxonomy.
Comparing the Spectral tree to the 16S or bac120 trees. We find that we are able to represent genomic strain level variation where neither 16S nor Bac120 annotations can.
# strain ids for 669 samples in CSBobs_names = biobank.obs.Strain_ID;# re-read in biobank tree (same as what is plotted above)bbtree =readnw(readline(joinpath(bbtreedir, "BB669-tree.nw")))as_polytomy!(n->NewickTree.support(n)<0.5, bbtree)ladderize!(bbtree)leafnames_spitree =getleafnames(bbtree); # same order as returned by patristic_distancesorder_treespitree =indexin(leafnames_spitree, obs_names);# read in tree built from bac120 annotationstreebac120 =readnw(readline(projectdir("_research","BB669_bac120_phyml","BB669_bac120.phy-supporttree.txt")))as_polytomy!(n->NewickTree.support(n)<0.5, treebac120)ladderize!(treebac120)leafnames_bac120 =getleafnames(treebac120); # same order as returned by patristic_distancesorder_treebac120 =indexin(leafnames_bac120, obs_names);# read in tree built from 16S sequencetree16s =readnw(readline(projectdir("_research","BB669_16S_phyml","BB669_16S.phy-supporttree.txt")))as_polytomy!(n->NewickTree.support(n)<0.5, tree16s)ladderize!(tree16s)leafnames_16S =getleafnames(tree16s); # same order as returned by patristic_distancesorder_tree16S =indexin(leafnames_16S, obs_names);# 1 taxa from the CSB got dropped because we couldn't get a good annotated 16S read# so I am collecting the sample mapping here between the taxa in the tree and# the taxa in CSBorder_spi_tree16s =indexin(leafnames_16S, leafnames_spitree);# likewise 5 taxa from CSB got dropped when annotating with bac120order_spi_treebac120 =indexin(leafnames_bac120, leafnames_spitree);
bbtaxa = biobank.obs[:,[:NCBI_Phylum, :NCBI_Class, :NCBI_Order, :NCBI_Family, :NCBI_Genus, :NCBI_Species]];taxanomicdistance =pairwise(eachrow(Matrix(bbtaxa)[:,6:-1:1])) do a, b x =findfirst(a .==replace(b, "<unclassified>"=>"missing")) x ===nothing ? 6: x-1end;
# get patristic distances between each leaf of the spectral treesixPij =patristic_distances(tree16s);bacPij =patristic_distances(treebac120);spiPij =patristic_distances(bbtree);# mask to collect just upper triangle of distance matrix for each datasetuppertriangle_six =triu(trues(size(sixPij)), 1);uppertriangle_bac =triu(trues(size(bacPij)), 1);uppertriangle_spi =triu(trues(size(spiPij)), 1);# get upper trianglesixPij = sixPij[uppertriangle_six];bacPij = bacPij[uppertriangle_bac];spiPij = spiPij[uppertriangle_spi];# reorder distance matrix to same order as metadata filesixTij = taxanomicdistance[order_tree16S, order_tree16S][uppertriangle_six];bacTij = taxanomicdistance[order_treebac120, order_treebac120][uppertriangle_bac];spiTij = taxanomicdistance[order_treespitree, order_treespitree][uppertriangle_spi];# scaled distances to between 0 and 1spiPij_rel = (spiPij ./maximum(spiPij));bacPij_rel = (bacPij ./maximum(bacPij));sixPij_rel = (sixPij ./maximum(sixPij));
plot relative density distributions
p1 =plot( xlims=(0,1), widen=1, xticks=(0:.25:1, ["0.0", "", "0.5", "", "1.0"]),# ylabel="same species",# size=(400, 250),)# patristic distance of pairs that are the same genusviolin!(["same\n genus"], spiPij_rel[spiTij .==1], # normed dist trim=true, side=:right, c=:blue, fill=0, linecolor=:black, lw=1, label="Spectral tree", permute=(:x, :y),)violin!(["same\n genus"], bacPij_rel[bacTij .==1], # normed dist trim=true, side=:right, c=:yellow, fill=0, linecolor=:black, lw=1, label="Bac120 tree", permute=(:x, :y),)# patristic distance of pairs that are the same speciesviolin!(["same\n species"], spiPij_rel[spiTij .==0], # normed dist trim=true, side=:right, c=:blue, fill=0, linecolor=:black, lw=1,# label="Spectral Tree", permute=(:x, :y),)violin!(["same\n species"], bacPij_rel[bacTij .==0], # normed dist trim=true, side=:right, c=:yellow, fill=0, linecolor=:black, lw=1,# label="16S", permute=(:x, :y),)p2 =plot( xlims=(0,1), widen=1, xticks=(0:.25:1, ["0.0", "", "0.5", "", "1.0"]),# ylabel="same species",# size=(400, 250),)# patristic distance of pairs that are the same genusviolin!(["same\n genus"], spiPij_rel[spiTij .==1], # normed dist trim=true, side=:right, c=:blue, fill=0, linecolor=:black, lw=1, label="Spectral tree", permute=(:x, :y),)violin!(["same\n genus"], sixPij_rel[sixTij .==1], # normed dist trim=true, side=:right, c=:orange, fill=0, linecolor=:black, lw=1, label="16S tree", permute=(:x, :y),)# patristic distance of pairs that are the same speciesviolin!(["same\n species"], spiPij_rel[spiTij .==0], # normed dist trim=true, side=:right, c=:blue, fill=0, linecolor=:black, lw=1,# label="Spectral Tree", permute=(:x, :y),)violin!(["same\n species"], sixPij_rel[sixTij .==0], # normed dist trim=true, side=:right, c=:orange, fill=0, linecolor=:black, lw=1,# label="16S", permute=(:x, :y),)plot(p1, p2, layout=grid(1,2), size=(800, 250), margin=3Plots.mm)
# difference in distributions is significant to below floating point precision# same species distributions@show pval =pvalue(MannWhitneyUTest(spiPij_rel[spiTij.==0], bacPij_rel[bacTij.==0]); tail=:both);@show pval =pvalue(MannWhitneyUTest(spiPij_rel[spiTij.==0], sixPij_rel[sixTij.==0]); tail=:both);# same genus but different species distributions@show pval =pvalue(MannWhitneyUTest(spiPij_rel[spiTij.==1], bacPij_rel[bacTij.==0]); tail=:both);@show pval =pvalue(MannWhitneyUTest(spiPij_rel[spiTij.==1], sixPij_rel[sixTij.==0]); tail=:both);
# taxonomic distance: number of levels before two strains share a taxonomic classificationbbtaxa = biobank.obs[:,[:NCBI_Phylum, :NCBI_Class, :NCBI_Order, :NCBI_Family, :NCBI_Genus, :NCBI_Species]];# bbtaxa = biobank.obs[:,[:GTDB_Phylum, :GTDB_Class, :GTDB_Order, :GTDB_Family, :GTDB_Genus, :GTDB_Species]];taxanomicdistance =pairwise(eachrow(Matrix(bbtaxa)[:,6:-1:1])) do a, b x =findfirst(a .==replace(b, "<unclassified>"=>"missing")) x ===nothing ? 6: x-1end;
There are no pairs of taxa that are from different families that are closer together than any pair of strains that belong to the same species, when we contextualize by the entire distribution of bacteria in UniProt.
BBUPisbelowthreshold = dij .< speciesthreshold;any(issameorder .& BBUPisbelowthreshold)
false
If we constrain ourselves to just our strain bank there are pairs of taxa that are from different families that are closer together than any pair of strains that belong to the same species.