LATENT TAXONOMIC SIGNATURES
This is a taxonomic information retrieval system based on a newly discovered feature of Latent Taxonomic Signatures (LTSs). Latent Taxonomic Signatures represent a shared taxonomy related information content which can be extracted from any combination of randomly sampled proteins belonging to any organism using natural language processing methods.
- You should upload only proteins which belong to the same species
- You should upload only "randomly selected proteins" - meaning proteins which belong to various protein families
- The system shall extract all of the 3-peptides from uploaded proteins and use them to construct a query vector
- In order to use the most of the system, you should perform several queries with different random proteome samples of your organism
- Upload only protein sequences of single organism in multi FASTA format (example)
- After performing similarity search, 100 most similar taxa (based on LTSs) are being displayed.
- In the results table, you can select any number of subject taxa and visually inspect subject signatures in comparison to your query signature. Signatures are being displayed based on underlying 3-peptide frequencies weighted by TF-IDF.
Because LTSs may be sensitive to "noise" coming from other species proteomes, we suggest you to use single proteome data (for example from: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes)
Search for the most similar organism between 78,643 taxa signatures
Examples of analyses
This is a new, more mathematical concept of species, based on Latent semantic analysis – LSA. LSA is a natural language processing method developed to improve the accuracy of information retrieval. It relies on a technique called singular value decomposition in order to process unstructured data in documents and identify relationships between the concepts contained within.
This mathematical model of species (or more generally taxa) utilizes all naturally occurring 3-peptides extracted from proteins within proteomes as terms, in order to transform taxa proteomes into an LSA vector space representations.
Latent Taxonomic Signatures contain information from 78,643 taxa proteomes. We have embedded distributional information on their constituent 3-peptide motifs in 400-dimensional vectors and explored their distributional semantic similarity in terms of cosine vector similarity, which we correlated with established taxonomy classification (benchmarking). This has led us to discovery of a novel feature that we named Latent Taxonomic Signatures. The name of this feature reflects the fact that this feature is both distributed and conserved amongst completely unrelated proteins (unrelated in terms of alignment-based homology, but related in terms of sharing a proteome). In plain English: any randomly sampled protein set of sufficient size, from any given species proteome can be used as a query against other species Latent Taxonomic Signatures and the closest matching taxa vector from the species matrix will likely share taxonomic lineage with this query. It is important that the query protein set contains at least 30 proteins, all coming from a single organism and preferably randomly sampled.
In order to post query against 78,643 taxa vectors, which currently inhabit LSA vector space, just upload a number of randomly sampled proteins (more than 30) belonging to one organism in a simple multi FASTA format.
Biologists and taxonomists have had many attempts to define species, beginning from morphology and moving towards genetics. Early taxonomists such as Linnaeus had no option but to describe what they saw: this was later formalised as the typological or morphological species concept. Ernst Mayr emphasised reproductive isolation, but this, like other species concepts, is hard or even impossible to test. Later biologists have tried to refine Mayr's definition with the recognition and cohesion concepts, among others. Many of the concepts are quite similar or overlap, so they are not easy to count: the biologist R. L. Mayden recorded about 24 concepts, and the philosopher of science John Wilkins counted 26. Wilkins further grouped the species concepts into seven basic kinds of concepts:
- agamospecies for asexual organisms
- biospecies for reproductively isolated sexual organisms
- ecospecies based on ecological niches
- evolutionary species based on lineage
- genetic species based on gene pool
- morphospecies based on form or phenotype and
- taxonomic species, a species as determined by a taxonomist.
This is a new species concept, based on distributional semantics hypothesis and Latent semantic analysis – LSA (a.k.a. latent semantic indexing - LSI). LSA is a mathematical method developed to improve the accuracy of information retrieval (Deerwester et al. 1990, "Indexing by Latent Semantic Analysis"). It relies on a technique called singular value decomposition in order to process unstructured data in documents and identify relationships between the concepts contained within. Essentially, it finds hidden (latent) relationships between words (semantic) in order to improve information understanding (analysis).
LSA relies on a term-document matrix, which describes the occurrences of terms in documents. This matrix is usually very sparse with rows corresponding to terms and columns corresponding to documents. In our case, documents were replaced with taxa and terms with all occurring 3-peptides. These 3-peptides are naturally occurring combinations of 3 successive amino acids composing proteins which make complete or partial proteomes. All protein sequence data comes from NCBI “nr” database (https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/). In current version of Latent Taxonomic Signatures, LSA was applied to proteomes coming from 78,643 species. Terms were substituted with 3-peptides and latent relationships identified, which allowed us to use randomly selected subsets of proteins from any given organism in order to reconstructs taxonomic relations. In a way, we could say that this represents 8th concept of species – a mathematical species, determined by comparison of proteome vectors in n-dimensional space. Essentially, by comparing randomly selected sets of proteins (although as low as 5 randomly chosen proteins can give meaningful results, it is advisable to use more than 30) representing different species proteomes, we can establish taxonomic relationships. This is achieved by transforming protein sequence data into vector representations (a process called “embedding”), which can be pairwise compared by cosine similarity.