LATENT TAXONOMIC SIGNATURES
Intro
This is a taxonomic information retrieval system based on a newly discovered semantic feature of species proteomes. Latent Taxonomic Signatures (LTSs) represent a semantic similarity between the 3-peptides that have a greater weight for a given species proteome, in comparison to all other species proteomes.
Quick tutorial
- You should upload only proteins which belong to the same species
- LTSs are best identified when using random selection of proteins - meaning species proteins which belong to various protein families and share little or no alignment based homology.
- The app extracts all of the 3-peptides from uploaded proteins - a process called "tokenization" and uses them in order to construct a query vector by a process called "folding-in"
- In order to use the most of the app, you should perform several queries with different randomly selected protein sets of your query taxa
- Use only protein sequences that are part of the same species, in a multi FASTA format (example)
- After performing similarity search, 100 most similar taxa LTSs are being displayed.
- In the results table, you can select any number of subject taxa and visually inspect subject signatures in comparison to your query signature. Signatures are being displayed based on underlying 3-peptide frequencies weighted by TF-IDF.
- You can also make an interactive display of the retrieved taxa in order to better expore semantic links to other taxa. For this purpose, Krona is being used (Ondov BD, Bergman NH, and Phillippy AM. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics. 2011 Sep 30; 12(1):385.)
Because of false positive results, we suggest you to use single proteome data (for example from: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes), or protein data you know is coming from the same organism
Search for the semantically most similar taxa among 147,058 currently indexed organisms
Examples of analyses
c_difficile_T14.fasta | Show results |
coronavirus_2.fasta | Show results |
igniococcus_hospitalis.fasta | Show results |
caloramator_fervidus.fasta | Show results |
ustilago_maydis.fasta | Show results |
Description
In a manner, LTSs can be regarded as a linguistic approach to the concept of species, based on Latent semantic analysis – LSA. LSA is a natural language processing (NLP) method developed to improve the accuracy of information retrieval. It relies on a technique called truncated singular value decomposition (t-SVD) in order to process unstructured 3-peptide data in species and identify relationships between the semantic concepts these 3-peptides stand for.
This particular implementation models species (or more generally taxa) as "documents" by utilizing all naturally occurring 3-peptides (e.g. words) extracted from entire sets of proteins as terms, in order to transform taxa proteomes into multi-dimensional LSA vector space representations.
This current version of Latent Taxonomic Signatures (LTSs) contain information from 147,058 taxa proteomes. We have embedded distributional information on their constituent 3-peptide motifs in 400-dimensional vectors and explored their distributional semantic similarity in terms of cosine vector similarity, which we correlated with established NCBI Taxonomy classification (taxonomy benchmarking was performed as a method of validation). This has led us to discovery of a novel feature that we named Latent Taxonomic Signatures. The name of this feature reflects the fact that this feature is both distributed and conserved amongst completely unrelated proteins (unrelated in terms of alignment-based homology, but related in terms of sharing a proteome). In plain English: any randomly sampled protein set from any given species proteome can be used as a query against other species Latent Taxonomic Signatures and the closest matching taxa vector from the species matrix will likely share taxonomic lineage with this query. It is important that the query protein set contains at least 30 proteins (although it is possible to get good taxonomic correlations even with smaller sets), all coming from a single organism and preferably randomly sampled.
In order to post query against 147,058 taxa vectors, which currently inhabit LSA vector space, just upload a number of randomly sampled proteins (more than 30) belonging to one organism in a simple multi FASTA format.
Biologists and taxonomists have had many attempts to define species, beginning from morphology and moving towards genetics. Early taxonomists such as Linnaeus had no option but to describe what they saw: this was later formalised as the typological or morphological species concept. Ernst Mayr emphasised reproductive isolation, but this, like other species concepts, is hard or even impossible to test. Later biologists have tried to refine Mayr's definition with the recognition and cohesion concepts, among others. Many of the concepts are quite similar or overlap, so they are not easy to count: the biologist R. L. Mayden recorded about 24 concepts, and the philosopher of science John Wilkins counted 26. Wilkins further grouped the species concepts into seven basic kinds of concepts:
- agamospecies for asexual organisms
- biospecies for reproductively isolated sexual organisms
- ecospecies based on ecological niches
- evolutionary species based on lineage
- genetic species based on gene pool
- morphospecies based on form or phenotype and
- taxonomic species, a species as determined by a taxonomist.
This is a new species concept, based on distributional semantics hypothesis and Latent semantic analysis – LSA (a.k.a. latent semantic indexing - LSI). LSA is a mathematical method developed to improve the accuracy of information retrieval (Deerwester et al. 1990, "Indexing by Latent Semantic Analysis"). It relies on a technique called singular value decomposition in order to process unstructured data in documents and identify relationships between the concepts contained within. Essentially, it finds hidden (latent) relationships between words (semantic) in order to improve information understanding (analysis).
LSA relies on a term-document matrix, which describes the occurrences of terms in documents. This matrix is usually very sparse with rows corresponding to terms and columns corresponding to documents. In this case, documents were replaced with species proteomes and terms are related to all occurring 3-peptides. These 3-peptides are naturally occurring combinations of 3 successive amino acids composing proteins, which add to complete or partial proteomes. All protein sequence data used herein comes from the NCBI “nr” database (https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/). In this current version of Latent Taxonomic Signatures, LSA was applied to proteomes coming from 147,058 different taxa, in particular from: 67022 bacteria, 74241 viruses, 3876 eukaryotes and 1919 archaea. Words were substituted with 3-peptides and latent relationships identified, which allows use of randomly selected sets of proteins from any given organism in order to reconstructs taxonomic relations. In a way, we could say that this represents 8th concept of species – a linguistic species, determined by comparison of 3-peptide signature proteome vectors in a n-dimensional space. Essentially, by comparing randomly selected sets of proteins (although as low as 5 randomly chosen proteins can give meaningful results, it is advisable to use more than 30) representing different species proteomes, LTSs enable establishment of taxonomic relationships. This is achieved by transforming protein sequence data into vector representations (a process called “embedding” and the procedure "folding-in"), which can be pairwise compared by cosine similarity.
*Source: https://en.wikipedia.org/wiki/Species
*This website relies on Gensim for LSA - https://radimrehurek.com/gensim/