SPI

Documentation for SPI.

SPI.UPGMA_treeMethod
UPGMA_tree(Dij::AbstractMatrix{<:Number})

shorthand for Clustering.hclust(Dij, linkage=:average, branchorder=:optimal)

source
SPI.adjustedrandindexMethod
adjustedrandindex(a::AbstractVector{<:Number}, b::AbstractVector{<:Number}; nbins=50)

Args:

  • a, vector of numbers
  • b, vector of numbers
  • nbins, for continuous approximates discrete, for discrete choose nbins>maxnumberof_classes
source
SPI.calc_spcorr_mtxMethod
calc_spcorr_mtx(vecs::AbstractMatrix{<:Number}, window)
calc_spcorr_mtx(vecs::AbstractMatrix{<:Number}, vals::AbstractVector{<:Number}, window)

Calculates pairwise spectral (pearson) correlations for a set of observations.

Args:

  • vecs, set of left singular vectors or principal components with observations on rows and components on columns
  • vals, vector of singular values
  • window, set of indices of vecs columns to compute correlations across

Returns:

  • correlation matrix where each pixel is the correlation between a pair of observations
source
SPI.calc_spi_mtxMethod
calc_spi_mtx(A::AbstractMatrix; [Nsmps=size(A,1), Nfeats=size(A,2), alpha=1.0, q=0.5])
calc_spi_mtx(usv::SVD; [Nsmps=size(A,1), Nfeats=size(A,2), alpha=1.0, q=0.5])
calc_spi_mtx(usv::SVD[, SPI.LSVs(); Nsmps=size(A,1), Nfeats=size(A,2), alpha=1.0, q=0.5])
calc_spi_mtx(usv::SVD[, SPI.RSVs(); Nsmps=size(A,1), Nfeats=size(A,2), alpha=1.0, q=0.5])

computes the cumulative spectral residual distance for spectral phylogenetic inference

(∑_{p ∈ P} ||UₚΣₚ||₂)²

where $P$ are the spectral partitions found with getintervals.

Args:

  • A,usv = AbstractMatrix or SVD factorization (AbstractMatrix is just passed to svd() before calculation)
  • SPI.Left() computes SPI matrix for LSVs; SPI.Right() computes SPI matrix for RSVs
  • alpha, q are passed to getintervals() see its documentation

Returns:

  • distance matrix
source
SPI.calc_spi_traceMethod
calc_spi_trace(usv::SVD; onrows=true, groups=nothing, alpha=1.0, q=0.5)
calc_spi_trace(vecs, vals, groups)

calculates spectral residual within each partition of spectrum and each pair of taxa

returns matrix where rows are spectral partitions and columns are taxa:taxa pairs ordered as the upper triangle in rowwise order, or lower triangle in colwise order.

Args:

  • method: calc_spi_trace(vecs, vals, groups)
    • vecs: either usv.U or usv.V matrix
    • vals: usv.S singular values vector
    • groups: usually calculated with getintervals(usv.S; alpha=alpha, q=q)
  • method: calc_spi_trace(usv::SVD; onrows=true, groups=nothing, alpha=1.0, q=0.5)
    • usv: SVD object
    • onrows: switch to calculate SPI on rows (U matrix) or columns (V matrix).
    • groups: if nothing groups are calculated with getintervals(usv.S; alpha=alpha, q=q), otherwise they assume a vector of index ranges [1:1, 2:3, ...] to group usv.S with.
    • alpha: passed to getintervals
    • q: passed to getintervals
source
SPI.calc_spi_treeMethod
calc_spi_tree(A[, ids; labelinternalnodes=true])

helper function that immediately returns the newick tree string inferred by SPI

source
SPI.empiricalMIMethod
empiricalMI(a::AbstractVector{<:Float}, b::AbstractVector{<:Float}[, nbins=50, normalize=false])

computes empirical MI from identity of $H(a) + H(b) - H(a,b)$. where $H := -sum(p(x)*log(p(x))) + log(Δ)$ the $+ log(Δ)$ corresponds to the log binwidth and unbiases the entropy estimate from binwidth choice. estimates are roughly stable from $32$ ($32^2 ≈ 1000$ total bins) to size of sample. going from a small undersestimate to a small overestimate across that range. We recommend choosing the sqrt(mean(1000, samplesize)) for nbins argument, or taking a few estimates across that range and averaging.

Args:

  • a, vecter of length N
  • b, AbstractVector of length N
  • nbins, number of bins per side, use 1000 < nbins^2 < length(a) for best results
  • base, base unit of MI (defaults to nats with base=ℯ)
  • normalize, bool, whether to normalize with mi / mean(ha, hb)

Returns:

  • MI
source
SPI.getintervalsMethod
getintervals(S::AbstractVector{<:Number}; alpha=1.0, q=0.5)

finds spectral partitions. Computes log difference between each subsequent singular value and by default selects the differences that are larger than 1.0 * Q2(differences)

i.e. finds breaks in the spectrum that explain smaller scales of variance

Args:

  • S = singular values of a SVD factorization
  • alpha = scalar multiple of q
  • q = which quantile of log differences to use; by default Q2

Returns:

  • AbstractVector{UnitRange} indices into S corresponding to the spectral partitions
source
SPI.getintervalsIQRMethod
getintervalsIQR(S::AbstractVector{<:Number}; alpha=1.5, ql=.25, qh=.75)

finds spectral partitions. Computes log difference between each subsequent singular value and by default selects the differences that are larger than ALPHA * Q3(differences)

i.e. finds breaks in the spectrum that explain smaller scales of variance

Args:

  • S = singular values of a SVD factorization
  • alpha = scalar multiple of q
  • q = which quantile of log differences to use; by default Q3

Returns:

  • AbstractVector{UnitRange} indices into S corresponding to the spectral partitions
source
SPI.ij2kMethod
ij2k(i,j,n)

with pair (i,j) give index k to the pairs produced by combinations(vec), where vec is length n

source
SPI.k2ijMethod
k2ij(k, n)

which pair (i,j) produces the kth element of combinations(vec), where vec is length n

source
SPI.minspaceneededMethod
minspaceneeded(n, p; bits=64) = Base.format_bytes(binomial(n,2) * p * bits)

how much memory is needed to store spectral residual trace

source
SPI.nwstrMethod
nwstr(hc::Hclust[, tiplabels::AbstractVector[<:String]])

convert Hclust to newick tree string Args:

  • hc, Hclust object from Clustering package
  • tiplabels, AbstractVector{<:String} names in same order as distance matrix
source
SPI.pairwiseMethod
pairwise(func::Function, m::M) where M<:AbstractMatrix

returns upper offdiagonals of res[k] = func(i, j) where (k, (i,j)) are calculated from enumerate(((i, j) for j in axes(m, 2) for i in (j+1):lastindex(m, 2)))

source
SPI.projectinLSVMethod
projectinLSV(data::AbstractArray{T}, usv::SVD{T}, [window])

returns estimated left singular vectors (aka: LSV or Û) for new data based on already calculated SVD factorization

source
SPI.projectinRSVMethod
projectinRSV(data::AbstractArray{T}, usv::SVD{T}, [window])

returns estimated transposed right singular vectors (RSV or V̂ᵗ) for new data based on already calculated SVD factorization

source
SPI.projectoutMethod
projectout(usv::SVD, [window])

recreates original matrix i.e. calculates $UΣV'$ or if window is included creates a spectrally filtered version of the original matrix off of the provided components in window.

source
SPI.readphylipMethod
readphylip(fn::String)

Read phylip alignment file, return dataframe of IDs and Sequences

source
SPI.spimtx_spaceneededMethod
spimtx_spaceneeded(n, p; bits=64) = Base.format_bytes(binomial(n,2) * p * bits)

how much memory is needed to store SPI distance matrix

source
SPI.spiresultMethod
spiresult(A::AbstractMatrix{<:Number})

convenience function for optaining SVD, SPImtx, and SPItree

source
SPI.squareformFunction
squareform(d::AbstractVector, fillvalue=zero(eltype(d)))
squareform(d::AbstractVector, fillvalue=zero(eltype(d)))

If d is a vector, squareform checks if it of n choose 2 length for integer n, then fills the values of a symetric square matrix with the values of d.

If d is a matrix, squareform checks if it is square then fills the values of vector with the lower offdiagonal of matrix d in column order form.

fillvalue is the initial value of the produced vector or matrix. Only really apparant in a produced matrix where it will be the values on the diagonal.

source
SPI.vmeasure_homogeneity_completenessMethod
vmeasure_homogeneity_completeness(labels_true, labels_pred; β=1.)

calculates and returns v-measure, homogeneity, completeness; similar to f-score, precision, and recall respectively

Args:

  • β, weighting term for v-measure, if β is greater than 1 completeness

is weighted more strongly in the calculation, if β is less than 1, homogeneity is weighted more strongly

Citation:

  • A. Rosenberg, J. Hirschberg, in Proceedings of the 2007 Joint Conference

on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (Association for Computational Linguistics, Prague, Czech Republic, 2007; https://aclanthology.org/D07-1043), pp. 410–420.

source