We had the hypothesis that by leveraging the vast diversity of sequenced strains procured from many different environments, we could ‘fill in’ evolutionary distance within our strain bank in a more complete way.
To test this hypothesis, we used the 7047 bacterial proteomes within UniProt’s ‘2020_02’ release as our reference, as it contained strains collected across ocean, soil, human and mouse microbiomes, amongst other environments.
# what's the matrix look like# 7047 bacterial strains x 10177 orthologs # each element is the number of sequences in the strain that map to that ortholog (emapper v2)upmtx = uniprot.X[:, :]
rectangle(x, y, w, h) =Shape(x .+ [0,w,w,0], y .+ [0,0,h,h])plot( xlabel="spectral depth", ylabel="singular value", ticks=[exp10(i) for i in0:4], scale=:log10, )plot!(usv.S, label="singular values")pltpars = @. rectangle(first(partitions[1:end-1]), # x usv.S[first(partitions[1:end-1])], # ylength(partitions[1:end-1]), # width usv.S[first(partitions[2:end])] - usv.S[first(partitions[1:end-1])] # height)plot!(pltpars, c=:blue, alpha=.2)
Then we compute the spectral distances using these partitions
@time dij =spectraldistances(usv.U, usv.S, partitions);
243.653901 seconds (639.25 k allocations: 652.763 GiB, 46.67% gc time, 0.05% compilation time: 100% of which was recompilation)
calculate a tree from the SPI distance matrix.
So far I have been using basic hierarchical clustering functions; however, basic hierarchical clustering assumes a constant rate of evolution and because this is a dataset of natural systems (i.e. bacteria across natural evolution), this core assumption is likely broken. Thus we will use neighbor joining to compute our Spectral Tree.
# write out treeUP7047_treedir =projectdir("_research", "UP7047_neighborjoined_spitree") |> mkpathopen(joinpath(UP7047_treedir, "2020_02_UP7047-tree.nw"), "w") do ioprintln(io, basetreestring)end
So far we have compute the reference SPI tree, we may also want some measure of how statistically confident we are on each of the branches and merges we have predicted in the tree. That is best performed through a boot strap analysis. Where we sample with replacement the features and recompute the SPI distance matrix and tree.
usingDistributedrmprocs(workers()) # remove any current workersaddprocs() # startup default number (8) workers @everywhereusingDrWatson@everywhere@quickactivateprojectdir()@everywhereusingSpectralInference@everywhereusingNeighborJoining@everywhereusingMuon@everywhere uniprot =readh5ad(datadir("exp_raw", "UP7047", "2020_02_UP7047.h5ad"))@everywhere upmtx = uniprot.X[:,:]@everywhere upids = uniprot.obs_names@everywherefunctionrunbootstrap(feature_ids, mtx=upmtx, ids=upids) M = mtx[:, feature_ids] # matrix bootstrapped on columns vals, vecs =eigen(Matrix(M * M')) # more efficient than SVD as it only computes the row-wise factorization# floating point impercision gives some -1e-16 vals so set them to zero# and take the square root to get singular values rather than eigen values S =sqrt.(max.(vals, zero(eltype(vals)))) dij =spectraldistances(vecs, S; alpha=1.5, q=.75) # use same params as reference tree treestring = NeighborJoining.newickstring(fastNJ(dij), ids) # get tree stringend
┌ Warning: RCall.jl: Scale for y is already present.
│ Adding another scale for y, which will replace the existing scale.
│ Coordinate system already present. Adding new coordinate system, which will
│ replace the existing one.
└ @ RCall /Users/bend/.julia/packages/RCall/YrsKg/src/io.jl:172
Sub-clades of Spectral Tree
Look at local surrounding of a few taxa, they are with known closely taxonomically related neighbors
nthparent(n, i) = i <1 ? n :nthparent(parent(n), i-1)
To verify that our tree is systematically capturing known phylogeny we compare the grouping of the tree at varying distances from the ‘root’ to known NCBI taxonomic groupings and calculate the mutual information.
we see that as we move from the root to the leaves we get spiking information related Phylum, Class, Order, Family, Genus, and Species in discrete layers and in that order.
Fig. S6 - Plot GTDB taxonomy on Spectral Tree
Due to size constraints from github, I can’t show the images embedded into the notebook, but here is the code to make the tree plots from Figure S5.