Abstract
Automatic extraction of domain-specific stopword list from a large labeled corpus is discussed. Most researches remove the stopwords using a standard stopword list, and high and low document frequencies. In this paper, a new approach for stopword extraction based on the notion of backward filter level performance and sparsity measure of training data, is proposed. First, we discuss the motivation for updating existing lists or building new ones. Second, based on the proposed backward filter-level performance, we examine the effectiveness of high document frequency filtering for stopword reduction. Finally, a new method for building general and domain-specific stopwords is proposed. The method assumes that a set of candidate stopwords must have minimum information content and prediction capacity, which can be estimated by a classifier performance. The proposed approach is extensively compared with other methods including inverse document frequency and information gain. According to the comparative study, the proposed approach offers more promising results, which guarantee minimum information loss by filtering out most stopwords.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Chen, A., Gey, F.C.: Building an Arabic stemmer for information retrieval. In: TREC (2002)
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the world wide web. In: Proceedings of the 15th National Conference on Artificial Intelligence (AAAI 1998), pp. 509–516 (1998)
Crow, D., De Santo, J.: A hybrid approach to concept extraction and recognition-based matching in the domain of human resources. In: ICTAI, pp. 535–539 (2004)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)
Forman, G.: A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of ICML 2004, Twenty-first international conference on Machine learning, pp. 297–304 (2004)
Hayes, J.H., Dekhtyar, A., Sundaram, S.: Text mining for software engineering: how analyst feedback impacts final results. In: MSR 2005: Proceedings of the 2005 international workshop on Mining software repositories, pp. 1–5. ACM Press, New York (2005)
Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, US, pp. 143–151. Morgan Kaufmann Publishers, San Francisco (1997)
Kawahara, M., Kawano, H.: Mining association algorithm with threshold based on roc analysis. In: Proceedings of the 34th Annual Hawaii International Conference on System Sciences (HICSS-34), vol. 3, pp. 3010–3017. IEEE Computer Society, Los Alamitos (2001)
Koo, S.O., Lim, S.Y., Lee, S.-J.: Building an ontology based on hub words for information retrieval. In: Web Intelligence, pp. 466–469 (2003)
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)
Liu, T., Liu, S., Chen, Z., Ma, W.-Y.: An evaluation on feature selection for text clustering. In: Proceedings of ICML 2003, pp. 488–495 (2003)
Lo, R.T., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval system. The Journal on Digital Information Management: special issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR 2005) 3(1), 3–8 (2005)
Maletic, J.I., Valluri, N.: Automatic software clustering via latent semantic analysis. In: Proceedings 14th IEEE International Conference on Automated Software Engineering (ASE 1999), Cocoa Beach Florida, October 1999, pp. 251–254 (1999)
McCallum, A.K., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: Shavlik, J.W. (ed.) Proceedings of ICML 1998, 15th International Conference on Machine Learning, Madison, US, pp. 359–367. Morgan Kaufmann Publishers, San Francisco (1998)
Petras, V., Perelman, N., Gey, F.C.: UC berkeley at clef-2003 - Russian language experiments and domain-specific retrieval. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 401–411. Springer, Heidelberg (2004)
Rijsbergen, C.J., Harper, D.J., Porter, M.F.: The selection of good search terms. In: Information Processing and Management, pp. 77–91 (1981)
Rogati, M., Yang, Y.: High-performing feature selection for text classification. In: Proceedings of the eleventh international conference on Information and knowledge management, pp. 659–661 (2002)
Savoy, J.: A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science, 944–952 (1999)
Seki, K., Mostafa, J.: An application of text categorization methods to gene ontology annotation. In: SIGIR, pp. 138–145 (2005)
Sinka, M.P., Corne, D.W.: Evolving better stoplists for document clustering and web intelligence. Design and application of hybrid intelligent systems, 1015–1023 (2003)
Sinka, M.P., Corne, D.W.: Towards modernised and web-specific stoplists for web document analysis. In: Proceedings of the IEEE/WIC International Conference on Web Intelligence, pp. 396–402. IEEE Computer Society, Los Alamitos (2003)
Van Rijsbergen, C.J.: Information Retrieval, 2nd edn., Dept. of Computer Science, University of Glasgow (1979)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, US, pp. 412–420. Morgan Kaufmann Publishers, San Francisco (1997)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Makrehchi, M., Kamel, M.S. (2008). Automatic Extraction of Domain-Specific Stopwords from Labeled Documents. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds) Advances in Information Retrieval. ECIR 2008. Lecture Notes in Computer Science, vol 4956. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78646-7_22
Download citation
DOI: https://doi.org/10.1007/978-3-540-78646-7_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78645-0
Online ISBN: 978-3-540-78646-7
eBook Packages: Computer ScienceComputer Science (R0)