Abstract
Selecting features from documents that describe user information needs is challenging due to the nature of text, where redundancy, synonymy, polysemy, noise and high dimensionality are common problems. The assumption that clustered documents describe only one topic can be too simple knowing that most long documents discuss multiple topics. LDA-based models show significant improvement over the cluster-based in information retrieval (IR). However, the integration of both techniques for feature selection (FS) is still limited. In this paper, we propose an innovative and effective cluster- and LDA-based model for relevance FS. The model also integrates a new extended random set theory to generalise the LDA local weights for document terms. It can assign a more discriminative weight to terms based on their appearance in LDA topics and the clustered documents. The experimental results, based on the RCV1 dataset and TREC topics for information filtering (IF), show that our model significantly outperforms eight state-of-the-art baseline models in five standard performance measures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In this paper, terms, words, keywords or unigrams are used interchangeably.
- 2.
We will refer to the proposed model from now on as CBTM-ERS, a Cluster-Based Topic Model using Extended Random Set.
- 3.
- 4.
References
Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C.C., Zhai, C.X. (eds.) Mining Text Data, pp. 77–128. Springer, New York (2012)
Albathan, M., Li, Y., Algarni, A.: Enhanced N-gram extraction using relevance feature discovery. In: Cranefield, S., Nayak, A. (eds.) AI 2013. LNCS, vol. 8272, pp. 453–465. Springer, Cham (2013). doi:10.1007/978-3-319-03680-9_46
Albathan, M., Li, Y., Xu, Y.: Using extended random set to find specific patterns. In: WI 2014, vol. 2, pp. 30–37. IEEE (2014)
Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: KDD 2002, pp. 436–442. ACM (2002)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: SIGIR 2000, pp. 33–40. ACM (2000)
Chao, S., Cai, J., Yang, S., Wang, S.: A clustering based feature selection method using feature information distance for text data. In: Huang, D.-S., Bevilacqua, V., Premaratne, P. (eds.) ICIC 2016. LNCS, vol. 9771, pp. 122–132. Springer, Cham (2016). doi:10.1007/978-3-319-42291-6_12
Das, S., Abraham, A., Konar, A.: Automatic clustering using an improved differential evolution algorithm. IEEE Trans. Syst. Man Cybern.-Part A: Syst. Hum. 38(1), 218–237 (2008)
Ferreira, C.H., de Medeiros, D.M., Santana, F.: Fcfilter: feature selection based on clustering and genetic algorithms. In: CEC 2016, pp. 2106–2113. IEEE (2016)
Gao, Y., Xu, Y., Li, Y.: Pattern-based topics for document modelling in information filtering. IEEE TKDE 27(6), 1629–1642 (2015)
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)
Huang, A.: Similarity measures for text document clustering. In: NZCSRSC 2008, pp. 49–56 (2008)
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)
Krikon, E., Kurland, O.: A study of the integration of passage-, document-, and cluster-based information for re-ranking search results. Inf. Retr. 14(6), 593 (2011)
Kruse, R., Schwecke, E., Heinsohn, J.: Uncertainty and Vagueness in Knowledge Based Systems: Numerical Methods. Springer Science & Business Media, Heidelberg (2012)
Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE TPAMI 31(4), 721–735 (2009)
Li, Y.: Extended random sets for knowledge discovery in information systems. In: Wang, G., Liu, Q., Yao, Y., Skowron, A. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 524–532. Springer, Heidelberg (2003). doi:10.1007/3-540-39205-X_87
Li, Y., Algarni, A., Albathan, M., Shen, Y., Bijaksana, M.A.: Relevance feature discovery for text mining. IEEE TKDE 27(6), 1656–1669 (2015)
Liu, X., Croft, W.B.: Cluster-based retrieval using language models. In: SIGIR 2004, pp. 186–193. ACM (2004)
Macdonald, C., Ounis, I.: Global statistics in proximity weighting models. In: Web N-Gram Workshop, p. 30. Citeseer (2010)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Maxwell, K.T., Croft, W.B.: Compact query term selection using topically related text. In: SIGIR 2013, pp. 583–592. ACM (2013)
McCallum, A.K.: Mallet: A machine learning for language toolkit (2002)
Molchanov, I.: Theory of Random Sets. Springer Science & Business Media, London (2006)
Rasmussen, M., Karypis, G.: gCLUTO: an interactive clustering, visualization, and analysis system. UMN-CS TR-04 21(7) (2004)
Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and Beyond. Now Publishers Inc., Breda (2009)
Robertson, S.E., Soboroff, I.: The TREC 2002 filtering track report. In: TREC, vol. 2002, p. 5 (2002)
Savaresi, S.M., Boley, D.L.: On the performance of bisecting k-means and PDDP. In: ICDM 2001, pp. 1–14. SIAM (2001)
Soboroff, I., Robertson, S.: Building a filtering test collection for TREC 2002. In: SIGIR 2003, pp. 243–250. ACM (2003)
Steinbach, M., Karypis, G., Kumar, V., et al.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining, vol. 400, Boston, pp. 525–526 (2000)
Steyvers, M., Griffiths, T.: Probabilistic topic models. Handb. Latent Semant. Anal. 427(7), 424–440 (2007)
Tagarelli, A., Karypis, G.: Document clustering: the next frontier. In: Data Clustering: Algorithms and Applications, p. 305. CRC Press (2013)
Tasci, S., Gungor, T.: LDA-based keyword selection in text categorization. In: ISCIS 2009, pp. 230–235. IEEE (2009)
Wang, X., McCallum, A., Wei, X.: Topical n-grams: phrase and topic discovery, with an application to information retrieval. In: ICDM 2007, pp. 697–702. IEEE (2007)
Wu, Q., Ye, Y., Ng, M., Su, H., Huang, J.: Exploiting word cluster information for unsupervised feature selection. In: Zhang, B.-T., Orgun, M.A. (eds.) PRICAI 2010. LNCS (LNAI), vol. 6230, pp. 292–303. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15246-7_28
Zhang, Z., Phan, X.H., Horiguchi, S.: An efficient feature selection using hidden topic in text categorization. In: AINAW 2008, pp. 1223–1228. IEEE (2008)
Zhong, N., Li, Y., Wu, S.T.: Effective pattern discovery for text mining. IEEE TKDE 24(1), 30–44 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Alharbi, A.S., Li, Y., Xu, Y. (2017). Integrating LDA with Clustering Technique for Relevance Feature Selection. In: Peng, W., Alahakoon, D., Li, X. (eds) AI 2017: Advances in Artificial Intelligence. AI 2017. Lecture Notes in Computer Science(), vol 10400. Springer, Cham. https://doi.org/10.1007/978-3-319-63004-5_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-63004-5_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63003-8
Online ISBN: 978-3-319-63004-5
eBook Packages: Computer ScienceComputer Science (R0)