ReadFast: Structural Information Retrieval from Biomedical Big Text by Natural Language Processing

Gubanov, Michael; Shapiro, Linda; Pyayt, Anna

doi:10.1007/978-3-7091-1538-1_9

Michael Gubanov⁵,
Linda Shapiro⁵ &
Anna Pyayt⁶

407 Accesses
3 Citations

Abstract

While the problem to find needed information on the Web is being solved by the major search engines, access to the information in Big text, large-scale text datasets, and documents (Biomedical literature, e-books, conference proceedings, etc.) is still very rudimentary (Lin and Cohen (2010) A very fast method for clustering big text datasets. In: ECAI, Lisbon). Thus, keyword-search is often the only way to find the needle in the haystack. There is abundance of relevant research results in the Semantic Web research community that offers more robust access interfaces compared to keyword-search. Here we describe a new information retrieval engine that offers advanced user experience combining keyword-search with navigation over an automatically inferred hierarchical document index. The internal representation of the browsing index as a collection of UFOs (Gubanov et al. (2009) Ibm ufo repository. In: VLDB, Lyon; Gubanov et al. (2011) Learning unified famous objects (ufo) to bootstrap information integration. In: IEEE IRI, Las Vegas) yields more relevant search results and improves user experience.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Adelberg B (1998) NoDoSE – a tool for semi-automatically extracting structured and semistructured data from text documents. In: SIGMOD record, Seattle
Google Scholar
Agichtein E, Gravano L (2000) Snowball: extracting relations from large plain-text collections. In: ACM DL, San Antonio
Google Scholar
Agichtein E, Ipeirotis P, Gravano L (2003) Modeling query-based access to text databases. In: WebDB, San Diego
Google Scholar
Agichtein E, Brill E, Dumais S (2006) Improving web search ranking by incorporating user behavior information. In: SIGIR, Seattle
Google Scholar
Agrawal S, Chaudhuri S, Das G (2002) Dbxplorer: a system for keyword-based search over relational databases. In: ICDE, San Jose
Google Scholar
Anyanwu K, Maduko A, Sheth A (2007) Sparq2l: towards support for subgraph extraction queries in rdf databases. In: WWW, Banff
Google Scholar
Arocena GO, Mendelzon AO (1998) Weboql: restructuring documents, databases, and webs. In: ICDE, Orlando
Google Scholar
Banko M, Brill E, Dumais S, Lin J (2002) Askmsr: question answering using the worldwide web. In: EMNLP, Philadelphia
Google Scholar
Brin S (1998) Extracting patterns and relations from the world wide web. In: EDBT, Valencia
Google Scholar
Cai Y, Dong XL, Halevy A, Liu JM, Madhavan J (2005) Personal information management with semex. In: SIGMOD, Baltimore
Google Scholar
Califf ME, Mooney RJ (1998) Relational learning of pattern-match rules for information extraction. In: AAAI, Madison
Google Scholar
Chakrabarti S (2007) Dynamic personalized pagerank in entity-relation graphs. In: WWW, Banff
Google Scholar
Cheng T, Yan X, Chang KCC (2007) Entityrank: searching entities directly and holistically. In: VLDB, Vienna
Google Scholar
Crescenzi V, Mecca G (1998) Grammars have exceptions. J Inf Syst (Special issue on Semistructured Data) 23(9):539–565
Google Scholar
Crescenzi V, Mecca G, Merialdo P (2001) Roadrunner: towards automatic data extraction from large web sites. In: VLDB, Roma
Google Scholar
Crestani F (1997) Application of spreading activation techniques in information retrieval. Artif Intell Rev 11:453
Article Google Scholar
Diederich J, Balke WT, Thaden U (2007) Demonstrating the semantic growbag: automatically creating topic facets for faceteddblp. In: JCDL, Vancouver
Google Scholar
Dong X, Halevy A (2007) Indexing dataspaces. In: SIGMOD, Beijing
Google Scholar
Downey D, Etzioni O, Soderland S, Weld D (2004) Learning text patterns for web information extraction and assessment. In: AAAI, San Jose
Google Scholar
Embley DW, Campbell DM, Jiang YS, Liddle SW, Ng YK, Quass D, Smith, RD (1999) Conceptual-model-based data extraction from multiple-record web pages. Data Knowl Eng 31:227–251
Article MATH Google Scholar
Etzioni O, Cafarella M, Downey D, Kok S, Popescu A, Shaked T, Soderland S, Weld D, Yates A (2004) Web-scale information extraction in knowitall. In: WWW, Manhattan
Google Scholar
Freitag D (1998) Machine learning for information extraction in informal domains. Ph.D. thesis, Carnegie Mellon University
Google Scholar
Gubanov M, Shapiro L (2011) Using unified famous objects (ufo) to automate Alzheimer’s disease diagnosis. In: IEEE BIBM, Atlanta
Google Scholar
Gubanov MN, Popa L, Ho H, Pirahesh H, Chang P, Chen L (2009) Ibm ufo repository. In: VLDB, Lyon
Google Scholar
Gubanov M, Shapiro L, Pyayt A (2011) Learning unified famous objects (ufo) to bootstrap information integration. In: IEEE IRI, Las Vegas
Google Scholar
Hammer J, McHugh J, Garcia-Molina H (1997) Semistructured data: the TSIMMIS experience. In: Proceedings of the East-European workshop on advances in databases and information systems, St. Petersburg
Google Scholar
He H, Wang H, Yang J, Yu PS (2007) Blinks: ranked keyword searches on graphs. In: SIGMOD, Beijing
Google Scholar
Hearst MA (1992) Automatic acquisition of hyponyms from large text corpora. Technical report S2K-92-09
Google Scholar
Hristidis V, Papakonstantinou Y (2002) Discover: keyword search in relational databases. In: VLDB, Hong Kong
Google Scholar
Hsu CN, Dung MT (1998) Generating finite-state transducers for semi-structured data extraction from the web. J Inf Syst (Special issue on Semistructured Data) 23(9):521–538
Google Scholar
http://www.infocious.com
Järvelin K, Kekäläinen J (2000) IR evaluation methods for retrieving highly relevant documents. In: SIGIR, Athens
Google Scholar
Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 60:493–502
Article Google Scholar
Klein D, Manning C (2007) Fast exact inference with a factored model for natural language parsing. In: NIPS, Vancouver
Google Scholar
Kushmerick N (2000) Wrapper induction: efficiency and expressiveness. Artif Intell 118:15–68
Article MathSciNet MATH Google Scholar
Laender A, Ribeiro-Neto B, Silva A, Teixeira J (2002) A brief survey of web data extraction tools. In: SIGMOD record, Madison,
Google Scholar
Laender AHF, Ribeiro-Neto B, da Silva AS (2002) Debye – date extraction by example. Data Knowl Eng 40(2):121–154
Article MATH Google Scholar
Lin F, Cohen WW (2010) A very fast method for clustering big text datasets. In: ECAI, Lisbon
Google Scholar
Liu L, Pu C, Han W (2000) XWRAP: an XML-enabled wrapper construction system for web information sources. In: ICDE, San Diego
Google Scholar
Madhavan J, Cohen S, Dong X, Halevy A, Jeffery S, Ko D, Yu C (2007) Navigating the seas of structured web data. In: CIDR, Asilomar
Google Scholar
Nie Z, Ma Y, Shi S, Wen JR, Ma WY (2007) Web object retrieval. In: WWW, Banff
Google Scholar
Ribeiro-Neto BA, Laender AHF, da Silva AS (1999) Extracting semi-structured data through examples. In: CIKM, Kansas City
Google Scholar
Sahuguet A, Azavant F (2001) Building intelligent web applications using lightweight wrappers. Data Knowl Eng 36:283–316
Article MATH Google Scholar
Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18:613–620
Article MATH Google Scholar
Sayyadian M, LeKhac H, Doan A, Gravano L (2007) Efficient keyword search across heterogeneous relational databases. In: ICDE, Istanbul
Google Scholar
Sekine S (2006) On-demand information extraction. In: COLING/ACL, Sydney
Google Scholar
Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34:233
Article MATH Google Scholar
Udrea O, Getoor L, Miller RJ (2007) Leveraging data and structure in ontology integration. In: SIGMOD, Beijing
Google Scholar
Vanderwende L, Kacmarcik G, Suzuki H, Menezes A (2005) Mindnet: an automatically-created lexical resource. In: HLT/EMNLP, Vancouver
Google Scholar

Download references

Author information

Authors and Affiliations

University of Washington, Seattle, WA, USA
Michael Gubanov & Linda Shapiro
Stanford University, Stanford, CA, USA
Anna Pyayt

Authors

Michael Gubanov
View author publications
You can also search for this author in PubMed Google Scholar
Linda Shapiro
View author publications
You can also search for this author in PubMed Google Scholar
Anna Pyayt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Gubanov .

Editor information

Editors and Affiliations

TOBB University Department of Computer Engineering, Sogutozu Ankara, Turkey
Tansel Özyer
Department of Electrical Engineering Thompson Engineering, University of West Ontario, London, Ontario, Canada
Keivan Kianmehr
Tobb Etü Economics and Technology Univer, Ankara, Ankara, Turkey
Mehmet Tan
Baylor College of Medicine, Houston, Texas, USA
Jia Zeng

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Gubanov, M., Shapiro, L., Pyayt, A. (2013). ReadFast: Structural Information Retrieval from Biomedical Big Text by Natural Language Processing. In: Özyer, T., Kianmehr, K., Tan, M., Zeng, J. (eds) Information Reuse and Integration in Academia and Industry. Springer, Vienna. https://doi.org/10.1007/978-3-7091-1538-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-7091-1538-1_9
Published: 22 August 2013
Publisher Name: Springer, Vienna
Print ISBN: 978-3-7091-1537-4
Online ISBN: 978-3-7091-1538-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics