Abstract
The use of noun phrases as descriptors for text mining vectors has been proposed to overcome the poor semantic of the traditional bag-of-words (BOW). However, the solutions found in the literature are unsatisfactory, mainly due to the use of static definitions for noun phrases and the fact that noun phrases per se do not enable an adequate relevance representation since they are expressions that barely repeat. We present an approach to deal with these problems by (i) introducing a process that enables the definition of noun phrases interactively and (ii) considering similar noun phrases as a unique term. A case study compares both approaches, the one proposed in this paper and the other based on BOW. The main contribution of this paper is the improvement of the preprocessing phase of text mining, leading to better results in the overall process.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F.L.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, New York (2005)
Zanasi, D. (ed.): Text Mining and Its Applications to Intelligence, CRM and knowledge management. Advances in Management Information, WIT Press, Southampton (2007)
Berry, M.W., Castellanos, M. (eds.): Survey of Text Mining: Clustering, Classification, and Retrieval, 2nd Edn. Springer, New York (2007)
Konchady, M.: Text mining application programming. Cengage Learning (2006)
Lopes, M.C.S.: Textual data mining using clustering techniques for Portuguese language. Ph.D. Thesis, COPPE/UFRJ, Rio de Janeiro, Brazil (2004) (in Portuguese)
Furtado, M.I.V.: Business Intelligence for the private higher education: a text mining approach. Ph.D. Thesis, COPPE/UFRJ, Rio de Janeiro, Brazil (2004) (in Portuguese)
Gelbukh, A.: Computational linguistics and intelligent text processing. In: Second International Conference—CICLing, Mexico City, Mexico, Springer (2001)
Kuramoto, H.: Proposing a system of information retrieval by computer assisted with application to Portuguese. PhD thesis, University Lumière, Lyon, France (1999) (in French)
Silva, C.F., Vieira, R.: Grammatical and syntactic groups in automatic categorization with support vector machines. In: XXV SBC, V ENIA (2005) (in Portuguese)
Levenshtein, V.: Binary codes capable of correcting spurious insertions and deletions of ones. Probl. Inf. Transm. 1, 8–17 (1965)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–48 (2001)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)
Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2007)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Bräscher, M.: Automatic processing of ambiguity in information retrieval. Ph.D. Thesis, UnB, Brasília, Brazil (1999) (in Portuguese)
Silva, C.F., Vieira, R.: Categorization of Portuguese texts with decision trees, SVM and linguistic information. In: XXVII SBC, TIL—V Workshop em Tecnologia da Informação e da Linguagem Humana. Rio de Janeiro, Brazil (2007) (in Portuguese)
Ratnaparkhi, A.: A maximum entropy part-of-speech tagger. In: Conference on Empirical Methods in Natural Language Processing, pp. 133–142. University of Pennsylvania (1996)
Aires, R.V.X.: Implementation, adaptation, combination and evaluation of taggers for Portuguese of Brazil. MsC Thesis, USP, São Carlos, Brazil, 2000 (in Portuguese)
Mac-Morpho: Corpora in Portuguese for tagger MXPost (in Portuguese)
Matsunaga, L.A.: An automatic text categorization methodology for distribution of bills to standing committees of the legislative chamber of the Brazilian Federal District. Ph.D. Thesis, COPPE/UFRJ, Rio de Janeiro, RJ, Junho, 2007 (in Portuguese)
Feldman, R., Dagan, I.: Knowledge discovery in textual databases (KDT). In: KDD-95, Montreal, pp. 112–117. AAAI Press, Canada, 20–21 Aug 1995
Hearst, M.A.: Untangling text data mining. In: 37th Annual Meeting of the Association for Computational Linguistics, pp. 3–10 (1999)
Silva, C.F., Vieira, R., Osório, F.S.: Use of linguistic information in text categorization using neural networks. In: VIII SBRN, pp. 1–6 (2004) (in Portuguese)
Metzler, D.: Generalized inverse document frequency. In: 17th ACM Conference on Information and Knowledge Management—CIKM. Napa Valley, California, USA (2008)
Sirmakessis, S. (ed.): Text mining and its applications: Results of the NEMIS launch conference, vol. 138. Studies in Fuzziness and Soft Computing, Springer (2004)
Meij, J. (ed.): Dealing with the data flood: mining data, text and multimedia. STT/Beweton, The Hague (2002)
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1(2), 224–227 (1979)
Papineni, K.: Why inverse document frequency? In: Second Meeting of the North American Chapter of the NAACL. Association for Computational Linguistics (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Pinheiro, M.S., do Prado, H.A., Ferneda, E., Ladeira, M. (2015). An Approach for Text Mining Based on Noun Phrases. In: Neves-Silva, R., Jain, L., Howlett, R. (eds) Intelligent Decision Technologies. IDT 2017. Smart Innovation, Systems and Technologies, vol 39. Springer, Cham. https://doi.org/10.1007/978-3-319-19857-6_45
Download citation
DOI: https://doi.org/10.1007/978-3-319-19857-6_45
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19856-9
Online ISBN: 978-3-319-19857-6
eBook Packages: EngineeringEngineering (R0)