Topic Modeling for Unsupervised Concept Extraction and Document Ranking

Anoop, V. S.; Asharaf, S.; Deepak, P.

doi:10.1007/978-3-319-68385-0_11

V. S. Anoop²⁰,
S. Asharaf²¹ &
P. Deepak²²

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 683))

Included in the following conference series:

The International Symposium on Intelligent Systems Technologies and Applications

1024 Accesses
1 Citations

Abstract

This paper proposes a framework which induces semantically rich concepts from probabilistically generated topics by a topic modeling algorithm. In this method an off-the-shelf tool has been used to extract noun-phrases as word bi-grams and tri-grams from the static document corpus and then models the topics using Latent Dirichlet Allocation algorithm. Additionally, we show that a small extension to our proposed framework can better rank documents in a large collection, which is a well studied area in information retrieval. Experiments conducted on three real world datasets show that this proposed framework outperforms state-of-the-art methods used for extracting concepts and ranking documents. When compared with the baselines chosen, our proposed concept extraction method showed an increased f-measure in the range of 16.65% to 22.04% and the proposed topic modeling guided document retrieval method showed 7.6%–16.61% increase in f-measure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Notes

References

Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)
Google Scholar
Hofmann, T.: Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 289–296. Morgan Kaufmann Publishers Inc. (1999)
Google Scholar
Chang, J., Boyd-Graber, J.L., Gerrish, S., Wang, C., Blei, D.M.: Reading tea leaves: how humans interpret topic models. In: NIPS, vol. 31, pp. 1–9 (2009)
Google Scholar
Arvanitis, A., Wiley, M. T., Hristidis, V.: Efficient Concept-based Document Ranking. In: EDBT, pp. 403–414 (2014)
Google Scholar
Egozi, O., Gabrilovich, E., Markovitch, S.: Concept-based feature generation and selection for information retrieval. In: AAAI, pp. 1132–1137 (2008)
Google Scholar
Celikyilmaz, A., Hakkani-Tr, D.: Concept-based classification for multi-document summarization. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5540–5543 (2011)
Google Scholar
Cambria, E.: An introduction to concept-level sentiment analysis. In: MICAI, vol. 2, pp. 478–483 (2013)
Google Scholar
Asharaf, S., Anoop, V.S., Afzal, A.L.: A framework for meaning aware product discovery in e-commerce. In: Encyclopedia of e-Commerce Development, Implementation, and Management, pp. 1386–1398. IGI Global (2016)
Google Scholar
Anoop, V.S., Asharaf, S.: A topic modeling guided approach for semantic knowledge discovery in e-commerce. Int. J. Interact. Multimedia Artif. Intell. 4, 1–8 (2017)
Article Google Scholar
Wang, X., McCallum, A., Wei, X.: Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In: Proceedings of Seventh IEEE International Conference on Data Mining, ICDM 2007, pp. 697–702 (2007)
Google Scholar
Lindsey, R.V., Headden III, W.P., Stipicevic, M.J.: A phrase-discovering topic model using hierarchical pitman-yor processes. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 214–222 (2012)
Google Scholar
Jameel, S., Lam, W.: An unsupervised topic segmentation model incorporating word order. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 203–212 (2013)
Google Scholar
Yang, G., Wen, D., Chen, N.S., Sutinen, E.: A novel contextual topic model for multi-document summarization. Exp. Syst. Appl. 42(3), 1340–1352 (2015)
Article Google Scholar
Sleeman, J., Finin, T., Joshi, A.: Topic modeling for RDF graphs. In: LD4IE@ ISWC, pp. 48–62 (2015)
Google Scholar
Wallach, H.M.: Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 977–984 (2006)
Google Scholar
Pitman, J., Yor, M.: The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Ann. Probab. 25(2), 855–900 (1997)
Article MathSciNet MATH Google Scholar
El-Kishky, A., Song, Y., Wang, C., Voss, C.R., Han, J.: Scalable topical phrase mining from text corpora. Proc. VLDB Endow. 8(3), 305–316 (2014)
Article Google Scholar
He, Y.: Extracting topical phrases from clinical documents. In: AAAI, pp. 2957–2963 (2016)
Google Scholar
Chemudugunta, C., Smyth, P., Steyvers, M.: Combining concept hierarchies and statistical topic models. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 1469–1470 (2008)
Google Scholar
Anoop, V.S., Asharaf, S., Deepak, P.: Unsupervised concept hierarchy learning: a topic modeling guided approach. Procedia Comput. Sci. 89, 386–394 (2016)
Article Google Scholar
Chemudugunta, C., Holloway, A., Smyth, P., Steyvers, M.: Modeling documents by combining semantic concepts with unsupervised statistical learning. In: International Semantic Web Conference, pp. 229–244 (2008)
Google Scholar
Ramos, J.: Using tf-idf to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning (2003)
Google Scholar
Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M.: Okapi at TREC-3, p. 109. NIST Special Publication, Gaithersburg (1995)
Google Scholar
Mrozinski, J., Whittaker, E., Furui, S.: Collecting a why-question corpus for development and evaluation of an automatic QA-system. In: 46th Annual Meeting of the Association of Computational Linguistics: Human Language Technologies, pp. 443–451 (2008)
Google Scholar
Sarasua, C., Simperl, E., Noy, N.F.: Crowdmap: Crowdsourcing ontology alignment with microtasks. In: International Semantic Web Conference, pp. 525–541 (2012)
Google Scholar
Loria, S.: TextBlob: simplified text processing (2014)
Google Scholar
Shan, D., Zhao, W.X., He, J., Yan, R., Yan, H., Li, X.: Efficient phrase querying with flat position index. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 2001–2004 (2011)
Google Scholar
Patil, M., Thankachan, S.V., Shah, R., Hon, W.K., Vitter, J.S., Chandrasekaran, S.: Inverted indexes for phrases and strings. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 555–564 (2011)
Google Scholar
Fleiss, J.L., Cohen, J.: The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ. Psychol. Measur. 33(3), 613–619 (1973)
Article Google Scholar
Li, B., Wang, B., Zhou, R., Yang, X., Liu, C.: CITPM: a cluster-based iterative topical phrase mining framework. In: International Conference on Database Systems for Advanced Applications, pp. 197–213 (2016)
Google Scholar
Li, X., Jin, W.: Cross-document knowledge discovery using semantic concept topic model. In: 15th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 108–114 (2016)
Google Scholar
Xu, K., Qi, G., Huang, J., Wu, T.: Incorporating Wikipedia concepts and categories as prior knowledge into topic models. Intell. Data Anal. 21(2), 443–461 (2017)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Data Engineering Lab, Indian Institute of Information Technology and Management - Kerala, Thiruvananthapuram, India
V. S. Anoop
Indian Institute of Information Technology and Management - Kerala, Thiruvananthapuram, India
S. Asharaf
Queens University, Belfast, UK
P. Deepak

Authors

V. S. Anoop
View author publications
You can also search for this author in PubMed Google Scholar
S. Asharaf
View author publications
You can also search for this author in PubMed Google Scholar
P. Deepak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to V. S. Anoop .

Editor information

Editors and Affiliations

School of CS/IT, Indian Institute of Information Technology, Trivandrum, Kerala, India
Sabu M. Thampi
Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India
Sushmita Mitra
Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, West Bengal, India
Jayanta Mukhopadhyay
Xiamen University, Xiamen, China
Kuan-Ching Li
Department of Electrical and Electronic, Nazarbayev University, Astana, Kazakhstan
Alex Pappachen James
Dipartimento di Ingegneria, Università degli Studi di Firenze, Firenze, Italy
Stefano Berretti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Anoop, V.S., Asharaf, S., Deepak, P. (2018). Topic Modeling for Unsupervised Concept Extraction and Document Ranking. In: Thampi, S., Mitra, S., Mukhopadhyay, J., Li, KC., James, A., Berretti, S. (eds) Intelligent Systems Technologies and Applications. ISTA 2017. Advances in Intelligent Systems and Computing, vol 683. Springer, Cham. https://doi.org/10.1007/978-3-319-68385-0_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-68385-0_11
Published: 21 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68384-3
Online ISBN: 978-3-319-68385-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics