Textractor: A Framework for Extracting Relevant Domain Concepts from Irregular Corporate Textual Datasets

Ittoo, Ashwin; Maruster, Laura; Wortmann, Hans; Bouma, Gosse

doi:10.1007/978-3-642-12814-1_7

Ashwin Ittoo⁸,
Laura Maruster⁸,
Hans Wortmann⁸ &
…
Gosse Bouma⁸

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 47))

Included in the following conference series:

International Conference on Business Information Systems

1088 Accesses
5 Citations

Abstract

Various information extraction (IE) systems for corporate usage exist. However, none of them target the product development and/or customer service domain, despite significant application potentials and benefits. This domain also poses new scientific challenges, such as the lack of external knowledge resources, and irregularities like ungrammatical constructs in textual data, which compromise successful information extraction. To address these issues, we describe the development of Textractor; an application for accurately extracting relevant concepts from irregular textual narratives in datasets of product development and/or customer service organizations. The extracted information can subsequently be fed to a host of business intelligence activities. We present novel algorithms, combining both statistical and linguistic approaches, for the accurate discovery of relevant domain concepts from highly irregular/ungrammatical texts. Evaluations on real-life corporate data revealed that Textractor extracts domain concepts, realized as single or multi-word terms in ungrammatical texts, with high precision.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ananiadou, S.: A methodology for automatic term recognition. In: 15th Conference on Computational Linguistics, pp. 1034–1038. Association for Computational Linguistics, Morristown (1994)
Chapter Google Scholar
Bourigault, D.: Surface grammatical analysis for the extraction of terminological noun phrases. In: 14th Conference on Computational Linguistics, pp. 977–981. Association for Computational Linguistics, Morristown (1992)
Google Scholar
Buitelaar, P., Cimiano, P., Frank, A., Hartung, M., Racioppa, S.: Ontology-based Information Extraction and Integration from Heterogeneous Data Sources. International Journal of Human Computer Studies 66, 759–788 (2008)
Article Google Scholar
Cavnar, W.B., Trenkle, J.M.: N-Gram-Based Text Categorization. In: Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175. UNLV Publications/Reprographics, Las Vegas (1994)
Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In: 40th Anniversary Meeting of the Association for Computational Linguistics (2002)
Google Scholar
Daille, B., Gaussier, E., Lange, J.M.: Towards automatic extraction of monolingual and bilingual terminology. In: 15th conference on Computational Linguistics, pp. 515–521. Association for Computational Linguistics, Morristown (1994)
Chapter Google Scholar
Frantzi, K.T., Ananiadou, S.: Extracting nested collocations. In: 16th Conference on Computational Linguistics, pp. 41–46. Association for Computational Linguistics, Morristown (1996)
Chapter Google Scholar
Justeson, J., Katz, S.M.: Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1, 9–27 (1995)
Article Google Scholar
Koyama, T., Kageura, K.: Term extraction using verb co-occurrence. In: 3rd International Workshop on Computational Terminology (2004)
Google Scholar
Maynard, D., Ananiadou, S.: Identifying terms by their family and friends. In: 18th conference on Computational Linguistics, pp. 530–536. Association for Computational Linguistics, Morristown (2000)
Chapter Google Scholar
Maynard, D., Yankova, M., Kourakis, A., Kokossis, A.: Ontology-based information extraction for market monitoring and technology watch. In: ESWC Workshop End User Aspects of the Semantic Web, Heraklion, Crete (2005)
Google Scholar
Maynard, D., Saggion, H., Yankova, M., Bontcheva, K., Peters, W.: Natural Language Technology for Information Integration in Business Intelligence. In: Abramowicz, W. (ed.) BIS 2007. LNCS, vol. 4439, pp. 366–380. Springer, Heidelberg (2007)
Chapter Google Scholar
Petkova, V.: An analysis of field feedback in consumer electronics industry. PhD thesis, Eindhoven University of Technology
Google Scholar
Piskorski, J., Tanev, H., Oezden-Wennerberg, P.: Extracting Violent Events from On-line News for Ontology Population. In: Abramowicz, W. (ed.) BIS 2007. LNCS, vol. 4439, pp. 287–300. Springer, Heidelberg (2007)
Chapter Google Scholar
Salton, G.: Developments in automatic text retrieval. Science, 974–979 (1991)
Google Scholar
Schone, P., Jurafsky, D.: Is knowledge-free induction of multiword unit dictionary headwords a solved problem? In: Lee, L., Harman, D. (eds.) Conference on Empirical Methods in Natural Language Processing, pp. 100–108 (2001)
Google Scholar
Stanford Tagger: http://nlp.stanford.edu/software/index.shtml
Unified Medical Language System (UMLS), http://www.nlm.nih.gov/research/umls/
Vivaldi, J., Rodriguez, H.: Improving term extraction by combining different techniques. Terminology 7, 31–48 (2001)
Google Scholar
Wright, S.E., Budin, G.: Term Selection: The Initial Phase of Terminology Management. In: Handbook of Terminology Management, vol. 1, pp. 13–23 (1997)
Google Scholar
Wu, F., Weld, D.S.: Autonomously semantifying Wikipedia. In: sixteenth ACM conference on Conference on information and knowledge management, pp. 41–50. ACM, New York (2007)
Chapter Google Scholar
Xu, F., Kurz, D., Piskorski, J., Schmeier, S.: Term Extraction and Mining of Term Relations from Unrestricted Texts in the Financial Domain. In: Abramowicz, W. (ed.) Businesss Information Systems. Proceedings of BIS 2002, Poznan, Poland (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Economics and Business, University of Groningen, 9747, AE Groningen, The Netherlands
Ashwin Ittoo, Laura Maruster, Hans Wortmann & Gosse Bouma

Authors

Ashwin Ittoo
View author publications
You can also search for this author in PubMed Google Scholar
Laura Maruster
View author publications
You can also search for this author in PubMed Google Scholar
Hans Wortmann
View author publications
You can also search for this author in PubMed Google Scholar
Gosse Bouma
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information Systems, Poznań University of Economics, Al. Niepodległości 10, 61-875, Poznań, Poland
Witold Abramowicz
Institut für Informatik, Freie Universität Berlin, Takustr. 9, 14195, Berlin, Germany
Robert Tolksdorf

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ittoo, A., Maruster, L., Wortmann, H., Bouma, G. (2010). Textractor: A Framework for Extracting Relevant Domain Concepts from Irregular Corporate Textual Datasets. In: Abramowicz, W., Tolksdorf, R. (eds) Business Information Systems. BIS 2010. Lecture Notes in Business Information Processing, vol 47. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12814-1_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-12814-1_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12813-4
Online ISBN: 978-3-642-12814-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics