Low-dimensional representation of genomic sequences
- 68 Downloads
Numerous data analysis and data mining techniques require that data be embedded in a Euclidean space. When faced with symbolic datasets, particularly biological sequence data produced by high-throughput sequencing assays, conventional embedding approaches like binary and k-mer count vectors may be too high dimensional or coarse-grained to learn from the data effectively. Other representation techniques such as Multidimensional Scaling (MDS) and Node2Vec may be inadequate for large datasets as they require recomputing the full embedding from scratch when faced with new, unclassified data. To overcome these issues we amend the graph-theoretic notion of “metric dimension” to that of “multilateration.” Much like trilateration can be used to represent points in the Euclidean plane by their distances to three non-colinear points, multilateration allows us to represent any node in a graph by its distances to a subset of nodes. Unfortunately, the problem of determining a minimal subset and hence the lowest dimensional embedding is NP-complete for general graphs. However, by specializing to Hamming graphs, which are particularly well suited to representing biological sequences, we can readily generate low-dimensional embeddings to map sequences of arbitrary length to a real space. As proof-of-concept, we use MDS, Node2Vec, and multilateration-based embeddings to classify DNA 20-mers centered at intron–exon boundaries. Although these different techniques perform comparably, MDS and Node2Vec potentially suffer from scalability issues with increasing sequence length whereas multilateration provides an efficient means of mapping long genomic sequences.
KeywordsFeature extraction Graph embeddings Hamming graph Metric dimension Reads Resolving set
The authors thank the reviewers for their very insightful comments on the original version of this manuscript. This research was partially funded by the NSF IGERT Grant 1144807, and NSF IIS Grant 1836914. The authors acknowledge the BioFrontiers Computing Core at the University of Colorado–Boulder for providing High-Performance Computing resources (funded by National Institutes of Health 1S10OD012300), supported by BioFrontiers IT group.
Compliance with ethical standards
Conflict of interest
The authors declare that they have no conflict of interest.
- Aguirre S, Maestre AM, Pagni S, Patel JR, Savage T, Gutman D, Maringer K, Bernal-Rubio D, Shabman RS, Simon V, Rodriguez-Madoz JR, Mulder LC, Barber GN, Fernandez-Sesma A (2012) DENV inhibits type I IFN production in infected cells by cleaving human STING. PLoS Pathog 8(10):e1002–934CrossRefGoogle Scholar
- Bennett J, Lanning S et al (2007) The Netflix prize. In: Proceedings of KDD cup and workshop, New York, vol 2007, p 35Google Scholar
- Cook SA (1971) The complexity of theorem-proving procedures. In: Proceedings of the third annual ACM symposium on theory of computing. ACM, pp 151–158Google Scholar
- Davis J, Goadrich M (2006) The relationship between precision-recall and roc curves. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 233–240Google Scholar
- Fix E, Hodges JL Jr (1951) Discriminatory analysis-nonparametric discrimination: consistency properties. Tech. rep, DTIC DocumentGoogle Scholar
- Gary MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-completeness. WH Freeman and Company, New YorkGoogle Scholar
- Grover A, Leskovec J (2016) node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 855–864Google Scholar
- Jaakkola TS, Diekhans M, Haussler D (1999) Using the Fisher kernel method to detect remote protein homologies. ISMB 99:149–158Google Scholar
- Karp RM (1972) Reducibility among combinatorial problems. In: Complexity of computer computations. Springer, pp 85–103Google Scholar
- Leslie CS, Eskin E, Noble WS (2002) The spectrum kernel: a string kernel for SVM protein classification. Pac Symp Biocomput 7:566–575Google Scholar
- Li J, Lim SP, Beer D, Patel V, Wen D, Tumanut C, Tully DC, Williams JA, Jiricek J, Priestle JP, Harris JL, Vasudevan SG (2005) Functional profiling of recombinant NS3 proteases from all four serotypes of dengue virus using tetrapeptide and octapeptide substrate libraries. J Biol Chem 280(31):28,766–28,774CrossRefGoogle Scholar
- Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. ArXiv e-prints 1301.3781
- Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119Google Scholar
- Ng P (2017) dna2vec: consistent vector representations of variable-length k-mers. ArXiv e-prints 1701.06279
- Opsahl T (2011) Why Anchorage is not (that) important: binary ties and sample selection. http://toreopsahl.com/2011/08/12/why-anchorage-is-not-that-important-binary-tiesand-sample-selection. Accessed September 2013
- Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 701–710Google Scholar