Integrating LDA with Clustering Technique for Relevance Feature Selection

Alharbi, Abdullah Semran; Li, Yuefeng; Xu, Yue

doi:10.1007/978-3-319-63004-5_22

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10400))

Included in the following conference series:

Australasian Joint Conference on Artificial Intelligence

1506 Accesses
4 Citations

Abstract

Selecting features from documents that describe user information needs is challenging due to the nature of text, where redundancy, synonymy, polysemy, noise and high dimensionality are common problems. The assumption that clustered documents describe only one topic can be too simple knowing that most long documents discuss multiple topics. LDA-based models show significant improvement over the cluster-based in information retrieval (IR). However, the integration of both techniques for feature selection (FS) is still limited. In this paper, we propose an innovative and effective cluster- and LDA-based model for relevance FS. The model also integrates a new extended random set theory to generalise the LDA local weights for document terms. It can assign a more discriminative weight to terms based on their appearance in LDA topics and the clustered documents. The experimental results, based on the RCV1 dataset and TREC topics for information filtering (IF), show that our model significantly outperforms eight state-of-the-art baseline models in five standard performance measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In this paper, terms, words, keywords or unigrams are used interchangeably.
2.
We will refer to the proposed model from now on as CBTM-ERS, a Cluster-Based Topic Model using Extended Random Set.
3.
http://trec.nist.gov/.
4.
https://www.lemurproject.org/.

References

Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C.C., Zhai, C.X. (eds.) Mining Text Data, pp. 77–128. Springer, New York (2012)
Chapter Google Scholar
Albathan, M., Li, Y., Algarni, A.: Enhanced N-gram extraction using relevance feature discovery. In: Cranefield, S., Nayak, A. (eds.) AI 2013. LNCS, vol. 8272, pp. 453–465. Springer, Cham (2013). doi:10.1007/978-3-319-03680-9_46
Chapter Google Scholar
Albathan, M., Li, Y., Xu, Y.: Using extended random set to find specific patterns. In: WI 2014, vol. 2, pp. 30–37. IEEE (2014)
Google Scholar
Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: KDD 2002, pp. 436–442. ACM (2002)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: SIGIR 2000, pp. 33–40. ACM (2000)
Google Scholar
Chao, S., Cai, J., Yang, S., Wang, S.: A clustering based feature selection method using feature information distance for text data. In: Huang, D.-S., Bevilacqua, V., Premaratne, P. (eds.) ICIC 2016. LNCS, vol. 9771, pp. 122–132. Springer, Cham (2016). doi:10.1007/978-3-319-42291-6_12
Chapter Google Scholar
Das, S., Abraham, A., Konar, A.: Automatic clustering using an improved differential evolution algorithm. IEEE Trans. Syst. Man Cybern.-Part A: Syst. Hum. 38(1), 218–237 (2008)
Article Google Scholar
Ferreira, C.H., de Medeiros, D.M., Santana, F.: Fcfilter: feature selection based on clustering and genetic algorithms. In: CEC 2016, pp. 2106–2113. IEEE (2016)
Google Scholar
Gao, Y., Xu, Y., Li, Y.: Pattern-based topics for document modelling in information filtering. IEEE TKDE 27(6), 1629–1642 (2015)
Google Scholar
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)
Article MATH Google Scholar
Huang, A.: Similarity measures for text document clustering. In: NZCSRSC 2008, pp. 49–56 (2008)
Google Scholar
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)
Article Google Scholar
Krikon, E., Kurland, O.: A study of the integration of passage-, document-, and cluster-based information for re-ranking search results. Inf. Retr. 14(6), 593 (2011)
Article MATH Google Scholar
Kruse, R., Schwecke, E., Heinsohn, J.: Uncertainty and Vagueness in Knowledge Based Systems: Numerical Methods. Springer Science & Business Media, Heidelberg (2012)
MATH Google Scholar
Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE TPAMI 31(4), 721–735 (2009)
Article Google Scholar
Li, Y.: Extended random sets for knowledge discovery in information systems. In: Wang, G., Liu, Q., Yao, Y., Skowron, A. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 524–532. Springer, Heidelberg (2003). doi:10.1007/3-540-39205-X_87
Chapter Google Scholar
Li, Y., Algarni, A., Albathan, M., Shen, Y., Bijaksana, M.A.: Relevance feature discovery for text mining. IEEE TKDE 27(6), 1656–1669 (2015)
Google Scholar
Liu, X., Croft, W.B.: Cluster-based retrieval using language models. In: SIGIR 2004, pp. 186–193. ACM (2004)
Google Scholar
Macdonald, C., Ounis, I.: Global statistics in proximity weighting models. In: Web N-Gram Workshop, p. 30. Citeseer (2010)
Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
Maxwell, K.T., Croft, W.B.: Compact query term selection using topically related text. In: SIGIR 2013, pp. 583–592. ACM (2013)
Google Scholar
McCallum, A.K.: Mallet: A machine learning for language toolkit (2002)
Google Scholar
Molchanov, I.: Theory of Random Sets. Springer Science & Business Media, London (2006)
Google Scholar
Rasmussen, M., Karypis, G.: gCLUTO: an interactive clustering, visualization, and analysis system. UMN-CS TR-04 21(7) (2004)
Google Scholar
Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and Beyond. Now Publishers Inc., Breda (2009)
Google Scholar
Robertson, S.E., Soboroff, I.: The TREC 2002 filtering track report. In: TREC, vol. 2002, p. 5 (2002)
Google Scholar
Savaresi, S.M., Boley, D.L.: On the performance of bisecting k-means and PDDP. In: ICDM 2001, pp. 1–14. SIAM (2001)
Google Scholar
Soboroff, I., Robertson, S.: Building a filtering test collection for TREC 2002. In: SIGIR 2003, pp. 243–250. ACM (2003)
Google Scholar
Steinbach, M., Karypis, G., Kumar, V., et al.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining, vol. 400, Boston, pp. 525–526 (2000)
Google Scholar
Steyvers, M., Griffiths, T.: Probabilistic topic models. Handb. Latent Semant. Anal. 427(7), 424–440 (2007)
Google Scholar
Tagarelli, A., Karypis, G.: Document clustering: the next frontier. In: Data Clustering: Algorithms and Applications, p. 305. CRC Press (2013)
Google Scholar
Tasci, S., Gungor, T.: LDA-based keyword selection in text categorization. In: ISCIS 2009, pp. 230–235. IEEE (2009)
Google Scholar
Wang, X., McCallum, A., Wei, X.: Topical n-grams: phrase and topic discovery, with an application to information retrieval. In: ICDM 2007, pp. 697–702. IEEE (2007)
Google Scholar
Wu, Q., Ye, Y., Ng, M., Su, H., Huang, J.: Exploiting word cluster information for unsupervised feature selection. In: Zhang, B.-T., Orgun, M.A. (eds.) PRICAI 2010. LNCS (LNAI), vol. 6230, pp. 292–303. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15246-7_28
Chapter Google Scholar
Zhang, Z., Phan, X.H., Horiguchi, S.: An efficient feature selection using hidden topic in text categorization. In: AINAW 2008, pp. 1223–1228. IEEE (2008)
Google Scholar
Zhong, N., Li, Y., Wu, S.T.: Effective pattern discovery for text mining. IEEE TKDE 24(1), 30–44 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

School of EECS, Queensland University of Technology, Brisbane, QLD, Australia
Abdullah Semran Alharbi, Yuefeng Li & Yue Xu
Department of CS, Umm Al-Qura University, Makkah, Saudi Arabia
Abdullah Semran Alharbi

Authors

Abdullah Semran Alharbi
View author publications
You can also search for this author in PubMed Google Scholar
Yuefeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Yue Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdullah Semran Alharbi .

Editor information

Editors and Affiliations

La Trobe University, Melbourne, Australia
Wei Peng
La Trobe Business School, La Trobe University, Bundoora, Victoria, Australia
Damminda Alahakoon
RMIT University, Melbourne, Australia
Xiaodong Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alharbi, A.S., Li, Y., Xu, Y. (2017). Integrating LDA with Clustering Technique for Relevance Feature Selection. In: Peng, W., Alahakoon, D., Li, X. (eds) AI 2017: Advances in Artificial Intelligence. AI 2017. Lecture Notes in Computer Science(), vol 10400. Springer, Cham. https://doi.org/10.1007/978-3-319-63004-5_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-63004-5_22
Published: 09 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63003-8
Online ISBN: 978-3-319-63004-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Integrating LDA with Clustering Technique for Relevance Feature Selection