Skip to main content

Topic Modeling for Unsupervised Concept Extraction and Document Ranking

  • Conference paper
  • First Online:
Intelligent Systems Technologies and Applications (ISTA 2017)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 683))

Abstract

This paper proposes a framework which induces semantically rich concepts from probabilistically generated topics by a topic modeling algorithm. In this method an off-the-shelf tool has been used to extract noun-phrases as word bi-grams and tri-grams from the static document corpus and then models the topics using Latent Dirichlet Allocation algorithm. Additionally, we show that a small extension to our proposed framework can better rank documents in a large collection, which is a well studied area in information retrieval. Experiments conducted on three real world datasets show that this proposed framework outperforms state-of-the-art methods used for extracting concepts and ranking documents. When compared with the baselines chosen, our proposed concept extraction method showed an increased f-measure in the range of 16.65% to 22.04% and the proposed topic modeling guided document retrieval method showed 7.6%–16.61% increase in f-measure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Notes

  1. 1.

    http://mlg.ucd.ie/datasets/bbc.html.

  2. 2.

    https://archive.org/details/stackexchange.

  3. 3.

    http://www.daviddlewis.com/resources/testcollections/reuters21578/.

  4. 4.

    http://mallet.cs.umass.edu/.

References

  1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  2. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)

    Google Scholar 

  3. Hofmann, T.: Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 289–296. Morgan Kaufmann Publishers Inc. (1999)

    Google Scholar 

  4. Chang, J., Boyd-Graber, J.L., Gerrish, S., Wang, C., Blei, D.M.: Reading tea leaves: how humans interpret topic models. In: NIPS, vol. 31, pp. 1–9 (2009)

    Google Scholar 

  5. Arvanitis, A., Wiley, M. T., Hristidis, V.: Efficient Concept-based Document Ranking. In: EDBT, pp. 403–414 (2014)

    Google Scholar 

  6. Egozi, O., Gabrilovich, E., Markovitch, S.: Concept-based feature generation and selection for information retrieval. In: AAAI, pp. 1132–1137 (2008)

    Google Scholar 

  7. Celikyilmaz, A., Hakkani-Tr, D.: Concept-based classification for multi-document summarization. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5540–5543 (2011)

    Google Scholar 

  8. Cambria, E.: An introduction to concept-level sentiment analysis. In: MICAI, vol. 2, pp. 478–483 (2013)

    Google Scholar 

  9. Asharaf, S., Anoop, V.S., Afzal, A.L.: A framework for meaning aware product discovery in e-commerce. In: Encyclopedia of e-Commerce Development, Implementation, and Management, pp. 1386–1398. IGI Global (2016)

    Google Scholar 

  10. Anoop, V.S., Asharaf, S.: A topic modeling guided approach for semantic knowledge discovery in e-commerce. Int. J. Interact. Multimedia Artif. Intell. 4, 1–8 (2017)

    Article  Google Scholar 

  11. Wang, X., McCallum, A., Wei, X.: Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In: Proceedings of Seventh IEEE International Conference on Data Mining, ICDM 2007, pp. 697–702 (2007)

    Google Scholar 

  12. Lindsey, R.V., Headden III, W.P., Stipicevic, M.J.: A phrase-discovering topic model using hierarchical pitman-yor processes. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 214–222 (2012)

    Google Scholar 

  13. Jameel, S., Lam, W.: An unsupervised topic segmentation model incorporating word order. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 203–212 (2013)

    Google Scholar 

  14. Yang, G., Wen, D., Chen, N.S., Sutinen, E.: A novel contextual topic model for multi-document summarization. Exp. Syst. Appl. 42(3), 1340–1352 (2015)

    Article  Google Scholar 

  15. Sleeman, J., Finin, T., Joshi, A.: Topic modeling for RDF graphs. In: LD4IE@ ISWC, pp. 48–62 (2015)

    Google Scholar 

  16. Wallach, H.M.: Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 977–984 (2006)

    Google Scholar 

  17. Pitman, J., Yor, M.: The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Ann. Probab. 25(2), 855–900 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  18. El-Kishky, A., Song, Y., Wang, C., Voss, C.R., Han, J.: Scalable topical phrase mining from text corpora. Proc. VLDB Endow. 8(3), 305–316 (2014)

    Article  Google Scholar 

  19. He, Y.: Extracting topical phrases from clinical documents. In: AAAI, pp. 2957–2963 (2016)

    Google Scholar 

  20. Chemudugunta, C., Smyth, P., Steyvers, M.: Combining concept hierarchies and statistical topic models. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 1469–1470 (2008)

    Google Scholar 

  21. Anoop, V.S., Asharaf, S., Deepak, P.: Unsupervised concept hierarchy learning: a topic modeling guided approach. Procedia Comput. Sci. 89, 386–394 (2016)

    Article  Google Scholar 

  22. Chemudugunta, C., Holloway, A., Smyth, P., Steyvers, M.: Modeling documents by combining semantic concepts with unsupervised statistical learning. In: International Semantic Web Conference, pp. 229–244 (2008)

    Google Scholar 

  23. Ramos, J.: Using tf-idf to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning (2003)

    Google Scholar 

  24. Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M.: Okapi at TREC-3, p. 109. NIST Special Publication, Gaithersburg (1995)

    Google Scholar 

  25. Mrozinski, J., Whittaker, E., Furui, S.: Collecting a why-question corpus for development and evaluation of an automatic QA-system. In: 46th Annual Meeting of the Association of Computational Linguistics: Human Language Technologies, pp. 443–451 (2008)

    Google Scholar 

  26. Sarasua, C., Simperl, E., Noy, N.F.: Crowdmap: Crowdsourcing ontology alignment with microtasks. In: International Semantic Web Conference, pp. 525–541 (2012)

    Google Scholar 

  27. Loria, S.: TextBlob: simplified text processing (2014)

    Google Scholar 

  28. Shan, D., Zhao, W.X., He, J., Yan, R., Yan, H., Li, X.: Efficient phrase querying with flat position index. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 2001–2004 (2011)

    Google Scholar 

  29. Patil, M., Thankachan, S.V., Shah, R., Hon, W.K., Vitter, J.S., Chandrasekaran, S.: Inverted indexes for phrases and strings. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 555–564 (2011)

    Google Scholar 

  30. Fleiss, J.L., Cohen, J.: The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ. Psychol. Measur. 33(3), 613–619 (1973)

    Article  Google Scholar 

  31. Li, B., Wang, B., Zhou, R., Yang, X., Liu, C.: CITPM: a cluster-based iterative topical phrase mining framework. In: International Conference on Database Systems for Advanced Applications, pp. 197–213 (2016)

    Google Scholar 

  32. Li, X., Jin, W.: Cross-document knowledge discovery using semantic concept topic model. In: 15th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 108–114 (2016)

    Google Scholar 

  33. Xu, K., Qi, G., Huang, J., Wu, T.: Incorporating Wikipedia concepts and categories as prior knowledge into topic models. Intell. Data Anal. 21(2), 443–461 (2017)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to V. S. Anoop .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Cite this paper

Anoop, V.S., Asharaf, S., Deepak, P. (2018). Topic Modeling for Unsupervised Concept Extraction and Document Ranking. In: Thampi, S., Mitra, S., Mukhopadhyay, J., Li, KC., James, A., Berretti, S. (eds) Intelligent Systems Technologies and Applications. ISTA 2017. Advances in Intelligent Systems and Computing, vol 683. Springer, Cham. https://doi.org/10.1007/978-3-319-68385-0_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-68385-0_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-68384-3

  • Online ISBN: 978-3-319-68385-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics