LATENT TAXONOMIC SIGNATURES


Quick tutorial

Because of false positive results, we suggest you to use single proteome data (for example from: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes), or protein data you know is coming from the same organism

Search for the semantically most similar taxa among 147,058 currently indexed organisms
Please limit your FASTA protein sequence input to less than 20MB

Examples of analyses

c_difficile_T14.fasta Show results
coronavirus_2.fasta Show results
igniococcus_hospitalis.fasta Show results
caloramator_fervidus.fasta Show results
ustilago_maydis.fasta Show results

Description

In a manner, LTSs can be regarded as a linguistic approach to the concept of species, based on Latent semantic analysis – LSA. LSA is a natural language processing (NLP) method developed to improve the accuracy of information retrieval. It relies on a technique called truncated singular value decomposition (t-SVD) in order to process unstructured 3-peptide data in species and identify relationships between the semantic concepts these 3-peptides stand for.

This particular implementation models species (or more generally taxa) as "documents" by utilizing all naturally occurring 3-peptides (e.g. words) extracted from entire sets of proteins as terms, in order to transform taxa proteomes into multi-dimensional LSA vector space representations.

This current version of Latent Taxonomic Signatures (LTSs) contain information from 147,058 taxa proteomes. We have embedded distributional information on their constituent 3-peptide motifs in 400-dimensional vectors and explored their distributional semantic similarity in terms of cosine vector similarity, which we correlated with established NCBI Taxonomy classification (taxonomy benchmarking was performed as a method of validation). This has led us to discovery of a novel feature that we named Latent Taxonomic Signatures. The name of this feature reflects the fact that this feature is both distributed and conserved amongst completely unrelated proteins (unrelated in terms of alignment-based homology, but related in terms of sharing a proteome). In plain English: any randomly sampled protein set from any given species proteome can be used as a query against other species Latent Taxonomic Signatures and the closest matching taxa vector from the species matrix will likely share taxonomic lineage with this query. It is important that the query protein set contains at least 30 proteins (although it is possible to get good taxonomic correlations even with smaller sets), all coming from a single organism and preferably randomly sampled.

In order to post query against 147,058 taxa vectors, which currently inhabit LSA vector space, just upload a number of randomly sampled proteins (more than 30) belonging to one organism in a simple multi FASTA format.

Biologists and taxonomists have had many attempts to define species, beginning from morphology and moving towards genetics. Early taxonomists such as Linnaeus had no option but to describe what they saw: this was later formalised as the typological or morphological species concept. Ernst Mayr emphasised reproductive isolation, but this, like other species concepts, is hard or even impossible to test. Later biologists have tried to refine Mayr's definition with the recognition and cohesion concepts, among others. Many of the concepts are quite similar or overlap, so they are not easy to count: the biologist R. L. Mayden recorded about 24 concepts, and the philosopher of science John Wilkins counted 26. Wilkins further grouped the species concepts into seven basic kinds of concepts:

  1. agamospecies for asexual organisms
  2. biospecies for reproductively isolated sexual organisms
  3. ecospecies based on ecological niches
  4. evolutionary species based on lineage
  5. genetic species based on gene pool
  6. morphospecies based on form or phenotype and
  7. taxonomic species, a species as determined by a taxonomist.

This is a new species concept, based on distributional semantics hypothesis and Latent semantic analysis – LSA (a.k.a. latent semantic indexing - LSI). LSA is a mathematical method developed to improve the accuracy of information retrieval (Deerwester et al. 1990, "Indexing by Latent Semantic Analysis"). It relies on a technique called singular value decomposition in order to process unstructured data in documents and identify relationships between the concepts contained within. Essentially, it finds hidden (latent) relationships between words (semantic) in order to improve information understanding (analysis).

LSA relies on a term-document matrix, which describes the occurrences of terms in documents. This matrix is usually very sparse with rows corresponding to terms and columns corresponding to documents. In this case, documents were replaced with species proteomes and terms are related to all occurring 3-peptides. These 3-peptides are naturally occurring combinations of 3 successive amino acids composing proteins, which add to complete or partial proteomes. All protein sequence data used herein comes from the NCBI “nr” database (https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/). In this current version of Latent Taxonomic Signatures, LSA was applied to proteomes coming from 147,058 different taxa, in particular from: 67022 bacteria, 74241 viruses, 3876 eukaryotes and 1919 archaea. Words were substituted with 3-peptides and latent relationships identified, which allows use of randomly selected sets of proteins from any given organism in order to reconstructs taxonomic relations. In a way, we could say that this represents 8th concept of species – a linguistic species, determined by comparison of 3-peptide signature proteome vectors in a n-dimensional space. Essentially, by comparing randomly selected sets of proteins (although as low as 5 randomly chosen proteins can give meaningful results, it is advisable to use more than 30) representing different species proteomes, LTSs enable establishment of taxonomic relationships. This is achieved by transforming protein sequence data into vector representations (a process called “embedding” and the procedure "folding-in"), which can be pairwise compared by cosine similarity.


*Source: https://en.wikipedia.org/wiki/Species

*This website relies on Gensim for LSA - https://radimrehurek.com/gensim/