An Approach for Text Mining Based on Noun Phrases

Pinheiro, Marcello Sandi; do Prado, Hercules Antonio; Ferneda, Edilson; Ladeira, Marcelo

doi:10.1007/978-3-319-19857-6_45

Marcello Sandi Pinheiro⁶,
Hercules Antonio do Prado^7,8,
Edilson Ferneda⁷ &
…
Marcelo Ladeira⁹

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 39))

Included in the following conference series:

International Conference on Intelligent Decision Technologies

1511 Accesses

Abstract

The use of noun phrases as descriptors for text mining vectors has been proposed to overcome the poor semantic of the traditional bag-of-words (BOW). However, the solutions found in the literature are unsatisfactory, mainly due to the use of static definitions for noun phrases and the fact that noun phrases per se do not enable an adequate relevance representation since they are expressions that barely repeat. We present an approach to deal with these problems by (i) introducing a process that enables the definition of noun phrases interactively and (ii) considering similar noun phrases as a unique term. A case study compares both approaches, the one proposed in this paper and the other based on BOW. The main contribution of this paper is the improvement of the preprocessing phase of text mining, leading to better results in the overall process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F.L.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, New York (2005)
Google Scholar
Zanasi, D. (ed.): Text Mining and Its Applications to Intelligence, CRM and knowledge management. Advances in Management Information, WIT Press, Southampton (2007)
Google Scholar
Berry, M.W., Castellanos, M. (eds.): Survey of Text Mining: Clustering, Classification, and Retrieval, 2nd Edn. Springer, New York (2007)
Google Scholar
Konchady, M.: Text mining application programming. Cengage Learning (2006)
Google Scholar
Lopes, M.C.S.: Textual data mining using clustering techniques for Portuguese language. Ph.D. Thesis, COPPE/UFRJ, Rio de Janeiro, Brazil (2004) (in Portuguese)
Google Scholar
Furtado, M.I.V.: Business Intelligence for the private higher education: a text mining approach. Ph.D. Thesis, COPPE/UFRJ, Rio de Janeiro, Brazil (2004) (in Portuguese)
Google Scholar
Gelbukh, A.: Computational linguistics and intelligent text processing. In: Second International Conference—CICLing, Mexico City, Mexico, Springer (2001)
Google Scholar
Kuramoto, H.: Proposing a system of information retrieval by computer assisted with application to Portuguese. PhD thesis, University Lumière, Lyon, France (1999) (in French)
Google Scholar
Silva, C.F., Vieira, R.: Grammatical and syntactic groups in automatic categorization with support vector machines. In: XXV SBC, V ENIA (2005) (in Portuguese)
Google Scholar
Levenshtein, V.: Binary codes capable of correcting spurious insertions and deletions of ones. Probl. Inf. Transm. 1, 8–17 (1965)
Google Scholar
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–48 (2001)
Article Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)
Google Scholar
Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2007)
Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Google Scholar
Bräscher, M.: Automatic processing of ambiguity in information retrieval. Ph.D. Thesis, UnB, Brasília, Brazil (1999) (in Portuguese)
Google Scholar
Silva, C.F., Vieira, R.: Categorization of Portuguese texts with decision trees, SVM and linguistic information. In: XXVII SBC, TIL—V Workshop em Tecnologia da Informação e da Linguagem Humana. Rio de Janeiro, Brazil (2007) (in Portuguese)
Google Scholar
Ratnaparkhi, A.: A maximum entropy part-of-speech tagger. In: Conference on Empirical Methods in Natural Language Processing, pp. 133–142. University of Pennsylvania (1996)
Google Scholar
Aires, R.V.X.: Implementation, adaptation, combination and evaluation of taggers for Portuguese of Brazil. MsC Thesis, USP, São Carlos, Brazil, 2000 (in Portuguese)
Google Scholar
Mac-Morpho: Corpora in Portuguese for tagger MXPost (in Portuguese)
Google Scholar
Matsunaga, L.A.: An automatic text categorization methodology for distribution of bills to standing committees of the legislative chamber of the Brazilian Federal District. Ph.D. Thesis, COPPE/UFRJ, Rio de Janeiro, RJ, Junho, 2007 (in Portuguese)
Google Scholar
Feldman, R., Dagan, I.: Knowledge discovery in textual databases (KDT). In: KDD-95, Montreal, pp. 112–117. AAAI Press, Canada, 20–21 Aug 1995
Google Scholar
Hearst, M.A.: Untangling text data mining. In: 37th Annual Meeting of the Association for Computational Linguistics, pp. 3–10 (1999)
Google Scholar
Silva, C.F., Vieira, R., Osório, F.S.: Use of linguistic information in text categorization using neural networks. In: VIII SBRN, pp. 1–6 (2004) (in Portuguese)
Google Scholar
Metzler, D.: Generalized inverse document frequency. In: 17th ACM Conference on Information and Knowledge Management—CIKM. Napa Valley, California, USA (2008)
Google Scholar
Sirmakessis, S. (ed.): Text mining and its applications: Results of the NEMIS launch conference, vol. 138. Studies in Fuzziness and Soft Computing, Springer (2004)
Google Scholar
Meij, J. (ed.): Dealing with the data flood: mining data, text and multimedia. STT/Beweton, The Hague (2002)
Google Scholar
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1(2), 224–227 (1979)
Article Google Scholar
Papineni, K.: Why inverse document frequency? In: Second Meeting of the North American Chapter of the NAACL. Association for Computational Linguistics (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Brazilian Army CDS – QGEx Setor Militar Urbano, Brasília, DF, 70.630-904, Brazil
Marcello Sandi Pinheiro
Graduate Program on Knowledge and IT Management Catholic, University of Brasilia, SGAN 916 Av. W5, Brasília, DF, 70.790-160, Brazil
Hercules Antonio do Prado & Edilson Ferneda
Embrapa - Management and Strategy Secretariat Parque Estação Biológica – PqEB S/N°, Brasília, DF, 70.770-90, Brazil
Hercules Antonio do Prado
Edifício CIC/EST, Universidade de Brasília – IE - CIC Campus Universitário Darcy Ribeiro, Brasilia, DF, 70.910-900, Brazil
Marcelo Ladeira

Authors

Marcello Sandi Pinheiro
View author publications
You can also search for this author in PubMed Google Scholar
Hercules Antonio do Prado
View author publications
You can also search for this author in PubMed Google Scholar
Edilson Ferneda
View author publications
You can also search for this author in PubMed Google Scholar
Marcelo Ladeira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hercules Antonio do Prado .

Editor information

Editors and Affiliations

FCT, Universidade Nova de Lisboa, Caparica, Portugal
Rui Neves-Silva
Faculty of Education, Science, Technology and Mathematics, University of Canberra, Canberra, Australia
Lakhmi C. Jain
KES International, Shoreham-by-sea, United Kingdom
Robert J. Howlett

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pinheiro, M.S., do Prado, H.A., Ferneda, E., Ladeira, M. (2015). An Approach for Text Mining Based on Noun Phrases. In: Neves-Silva, R., Jain, L., Howlett, R. (eds) Intelligent Decision Technologies. IDT 2017. Smart Innovation, Systems and Technologies, vol 39. Springer, Cham. https://doi.org/10.1007/978-3-319-19857-6_45

Download citation

DOI: https://doi.org/10.1007/978-3-319-19857-6_45
Published: 27 May 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19856-9
Online ISBN: 978-3-319-19857-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics