Implementing a Transcription Factor Interaction Prediction System Using the GenoMetric Query Language

Perna, Stefano; Canakoglu, Arif; Pinoli, Pietro; Ceri, Stefano; Wong, Limsoon

doi:10.1007/978-1-4939-8561-6_6

Implementing a Transcription Factor Interaction Prediction System Using the GenoMetric Query Language

Stefano Perna³,
Arif Canakoglu³,
Pietro Pinoli³,
Stefano Ceri³ &
…
Limsoon Wong⁴

Protocol
First Online: 21 July 2018

1347 Accesses
8 Altmetric

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1807))

Abstract

Novel technologies and growing interest have resulted in a large increase in the amount of data available for genomics and transcriptomics studies, both in terms of volume and contents. Biology is relying more and more on computational methods to process, investigate, and extract knowledge from this huge amount of data. In this work, we present the TICA web server (available at http://www.gmql.eu/tica/), a fast and compact tool developed to support data-driven knowledge discovery in the realm of transcription factor interaction prediction. TICA leverages both the GenoMetric Query Language, a novel query tool (based on the Apache Hadoop and Spark technologies) specialized in the integration and management of heterogeneous, large genomic datasets, and a statistical method for robust detection of co-locations across interval-based data, in order to infer physically interacting transcription factors. Notably, TICA allows investigators to upload and analyze their own ChIP-seq experiments datasets, comparing them both against ENCODE data or between themselves, achieving computation time which increases linearly with respect to dataset size and density. Using ENCODE data from three well-studied cell lines as reference, we show that TICA predictions are supported by existing biological knowledge, making the web server a reliable and efficient tool for interaction screening and data-driven hypothesis generation.

This is a preview of subscription content, log in via an institution.

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

Notes

1.
The schema for ENCODE narrowpeak data files is defined at https://genome.ucsc.edu/FAQ/FAQformat.html#format12.
2.
The full description of GMQL language for the latest version (2.1 at the time of writing) can be found at http://www.bioinformatics.deib.polimi. it/geco/?try.
3.
ENCODE narrowpeaks are also given for ChIP-seqs targeting histone modifications. We remove them from the dataset by means of NOT clauses—omitted for brevity.
4.
These are nominal values for promoter and exon length, chosen for our experiments. Different investigators can use their own values for regulatory regions extension, depending on their biological assumptions.

References

Batuwita R, Palade V (2012) Adjusted geometric-mean: a novel performance measure for imbalanced bioinformatics datasets learning. J Bioinform Comput Biol 10(04):1250003
Article PubMed Google Scholar
Codd EF (1970) A relational model of data for large shared data banks. Commun ACM 13(6):377–387
Article Google Scholar
Geisel N, Gerland U (2011) Physical limits on cooperative protein-DNA binding and the kinetics of combinatorial transcription regulation. Biophys J 101(7):1569–1579
Article CAS PubMed PubMed Central Google Scholar
Jankowski A, Szczurek E, Jauch R, Tiuryn J, Prabhakar S (2013) Comprehensive prediction in 78 human cell lines reveals rigidity and compactness of transcription factor dimers. Genome Res 23(8):1307–1318
Article CAS PubMed PubMed Central Google Scholar
Kaitoua A, Pinoli P, Bertoni M, Ceri S (2017) Framework for supporting genomic operations. IEEE Trans Comput 66(3): 443–457
Article Google Scholar
Landt SG, Marinov GK, Kundaje A, Kheradpour P, Pauli F, Batzoglou S, Bernstein BE, Bickel P, Brown JB, Cayting P et al (2012) Chip-seq guidelines and practices of the encode and modencode consortia. Genome Res 22(9):1813–1831
Article CAS PubMed PubMed Central Google Scholar
Masseroli M, Pinoli P, Venco F, Kaitoua A, Jalili V, Palluzzi F, Muller H, Ceri S (2015) Genometric query language: a novel approach to large-scale genomic data management. Bioinformatics 31(12):1881–1888
Article CAS PubMed Google Scholar
Masseroli M, Kaitoua A, Pinoli P, Ceri S (2016) Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying. Methods 111:3–11
Article CAS PubMed Google Scholar
Nanni L (2017) A python data analysis library for genomics and its application to biology. Master’s thesis, Politecnico di Milano - DEIB. Available at https://www.politesi.polimi.it/ handle/10589/135989-
Ruepp A, Waegele B, Lechner M, Brauner B, Dunger-Kaltenbach I, Fobo G, Frishman G, Montrone C, Mewes HW (2009) Corum: the comprehensive resource of mammalian protein complexes—2009. Nucleic Acids Res 38(suppl_1):D497–D501
Google Scholar
Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M (2006) Biogrid: a general repository for interaction datasets. Nucleic Acids Res 34(suppl_1):D535–D539
Google Scholar

Download references

Acknowledgements

This work was supported by the ERC Advanced Grant GeCo (Data-Driven Genomic Computing) (Grant No. 693174) awarded to Prof. Stefano Ceri. We would like to thank members of the GeCo project for helpful insights. Prof. Limsoon Wong was supported in part by a Kwan-Im-Thong-Hood-Cho-Temple chair professorship.

Author information

Authors and Affiliations

DEIB, Politecnico di Milano, Milano, Italy
Stefano Perna, Arif Canakoglu, Pietro Pinoli & Stefano Ceri
School of Computing, National University of Singapore, Singapore, Singapore
Limsoon Wong

Authors

Stefano Perna
View author publications
You can also search for this author in PubMed Google Scholar
Arif Canakoglu
View author publications
You can also search for this author in PubMed Google Scholar
Pietro Pinoli
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Ceri
View author publications
You can also search for this author in PubMed Google Scholar
Limsoon Wong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefano Perna .

Editor information

Editors and Affiliations

Bioinformatics Center, Kyoto University, Uji, Kyoto, Japan
Hiroshi Mamitsuka

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Perna, S., Canakoglu, A., Pinoli, P., Ceri, S., Wong, L. (2018). Implementing a Transcription Factor Interaction Prediction System Using the GenoMetric Query Language. In: Mamitsuka, H. (eds) Data Mining for Systems Biology. Methods in Molecular Biology, vol 1807. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-8561-6_6

Download citation

DOI: https://doi.org/10.1007/978-1-4939-8561-6_6
Published: 21 July 2018
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-8560-9
Online ISBN: 978-1-4939-8561-6
eBook Packages: Springer Protocols

Publish with us

Policies and ethics