Spectral decomposition’s like SVD and PCA are based on co-variation, therefore they incorporate information about contextual dependence between features and taxa. This means that Spectral Inference can infer the correct ancestral tree even in cases of convergent evolution. Below we created an alignment of 18 taxa described by 9 features, where there are multiple evolutionary paths to get to each feature, i.e., each feature is only a good genomic feature that matches a true ancestral divergence in the context of other features.
Spectral Inference is able to perfectly recapitulate this ancestral tree, where standard phylogenetic tools fail.
M =Float64.([101000101;011000101;110000101;101000011;011000011;110000011;101000110;011000110;110000110;101101000;011101000;110101000;101011000;011011000;110011000;101110000;011110000;110110000;]);heatmap(M, c=[:white, :black], ratio=1, xticks=(1:9, ["Pos $i" for i in1:9]), xrotation=45, yticks=(1:18, 'a':'r'), yflip=true, size=(300, 600), margin=5Plots.mm,)
How is information encoded about convergent processes?
We show that subsets of principal components, hold information that project taxa to different positions based on the ancestral path they took to obtain a particular genomic feature.
taxa a-i and j-r split at the first generation. using position 5 as a gene marker would place j-l with a-i as they both lost this feature through convergent processes.
recreating the alignment using components 5-8, transforms position 5 in such a way that taxa j-l are distict because they lost this feature in a different context as to taxa a-i
We show in this case as well that information regarding more recent generational differences correlate to deeper principal components
spectralcorrs =map([i:(i+2) for i in1:(size(M,2)-2)]) do windowspectralcorrelations(usv.U, window)end;F1mask =kron([10; 01], ones(3,3), ones(3,3))F2mask =kron(Diagonal(ones(6)), ones(3,3));uppertriangle =triu(trues(18, 18), 1);
f1mi =map(spectralcorrs) do spcorr empiricalMI(spcorr[uppertriangle], (F1mask .==1)[uppertriangle]) # edges=-1:0.001:1endf2mi =map(spectralcorrs) do spcorr empiricalMI(spcorr[uppertriangle], (F2mask .==1)[uppertriangle]) # edges=-1:0.001:1endplot( ylabel="Cumulative MI\n (density)", xlabel="Principal component Window\n [Principal component start to end]", yticks=[0.0, .5, 1.0], xticks=(2:8, ["[$i to $(i+2)]" for i in1:7]), xrotation=45, margin=5Plots.mm,)plot!(scaledcumsum(vcat(0, f1mi)), c=:red, marker=true, label="F1", lw=2,)plot!(scaledcumsum(vcat(0, f2mi)), c=:orange, marker=true, label="F2", lw=2,)
Fig. S5 - Timing benchmark
Spectral Inference is fast, it is based on PCA. So it has the potential to scale quite large. Thousands of taxa on a laptop, larger using clusters and distributed computing.
Here we just show that Spectral Inference (SPI) does not grow exponentially with the number of taxa.
## submit job to cluster to run all phylogenetic inference tasks in parallel# run(`sbatch $(projectdir("scripts", "slurm-run-PI-on-UPsubsets.sbatch"))`)