Skip to main content

Unsupervised Topic-Oriented Keyphrase Extraction and Its Application to Croatian

  • Conference paper
Text, Speech and Dialogue (TSD 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6836))

Included in the following conference series:

Abstract

Labeling documents with keyphrases is a tedious and expensive task. Most approaches to automatic keyphrases extraction rely on supervised learning and require manually labeled training data. In this paper we propose a fully unsupervised keyphrase extraction method, differing from the usual generic keyphrase extractor in the manner the keyphrases are formed. Our method begins by building topically related word clusters from which document keywords are selected, and then expands the selected keywords into syntactically valid keyphrases. We evaluate our approach on a Croatian document collection annotated by eight human experts, taking into account the high subjectivity of the keyphrase extraction task. The performance of the proposed method reaches up to F1 = 44.5%, which is outperformed by human annotators, but comparable to a supervised approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ahel, R., Dalbelo Bašić, B., Šnajder, J.: Automatic keyphrase extraction from Croatian newspaper articles. In: The Future of Information Sciences, Digital Resources and Knowledge Sharing, pp. 207–218 (2009)

    Google Scholar 

  2. Arthur, D., Vassilvitskii, S.: k-means++: The advantages of careful seeding. In: Proc. of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, Philadelphia, pp. 1027–1035 (2007)

    Google Scholar 

  3. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proc. of COLT 1998, pp. 92–100. ACM, New York (1998)

    Google Scholar 

  4. Delip, R., Deepak, P., Deepak, K.: Corpus based unsupervised labeling of documents. In: FLAIRS Conference, pp. 321–326 (2002)

    Google Scholar 

  5. Ercan, G., Cicekli, I.: Using lexical chains for keyword extraction. Information Processing and Management 43(6), 1705–1714 (2007)

    Article  Google Scholar 

  6. Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: Proc. of IJCAI 1999, pp. 668–673. Morgan Kaufmann Publishers Inc., San Francisco (1999)

    Google Scholar 

  7. Gulla, J.A., Borch, H.O., Ingvaldsen, J.E.: Unsupervised keyphrase extraction for search ontologies. In: Kop, C., Fliedl, G., Mayr, H.C., Métais, E. (eds.) NLDB 2006. LNCS, vol. 3999, pp. 25–36. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  8. Hasan, K.S., Ng, V.: Conundrums in unsupervised keyphrase extraction: Making sense of the state-of-the-art. In: Coling 2010: Posters, Beijing, pp. 365–373 (2010)

    Google Scholar 

  9. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Google Scholar 

  10. Li, D., Li, S., Li, W., Wang, W., Qu, W.: A semi-supervised key phrase extraction approach: learning from title phrases through a document semantic network. In: Proc. of the ACL 2010, ACLShort 2010, pp. 296–300. ACL (2010)

    Google Scholar 

  11. Liu, F., Pennell, D., Liu, F., Liu, Y.: Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Proc. of NAACL 2009, pp. 620–628. ACL (2009)

    Google Scholar 

  12. Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to find exemplar terms for keyphrase extraction. In: Proc. of EMNLP 2009, pp. 257–266. ACL, Singapore (2009)

    Google Scholar 

  13. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proc. of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297 (1967)

    Google Scholar 

  14. McCallum, A., Nigam, K.: A comparison of event models for Naïve Bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization, pp. 41–48. AAAI Press, Menlo Park (1998)

    Google Scholar 

  15. Mijić, J., Dalbelo Bašić, B., Šnajder, J.: Robust keyphrase extraction for a large-scale Croatian news production system. In: Proc. of FASSBL 2010, Dubrovnik, pp. 59–66 (2010)

    Google Scholar 

  16. van Rijsbergen, C.J.: Informaton Retrieval. Butterworths, London (1979)

    Google Scholar 

  17. Turney, P.D.: Learning to extract keyphrases from text. Tech. rep., NRC-IIT (2002)

    Google Scholar 

  18. Šnajder, J., Dalbelo Bašić, B., Tadić, M.: Automatic acquisition of inflectional lexica for morphological normalisation. Information Processing and Management 44(5), 1720–1731 (2008)

    Article  Google Scholar 

  19. Zesch, T., Gurevych, I.: Approximate matching for evaluating keyphrase extraction. In: Proc. of RANLP 2009, pp. 484–489 (2009)

    Google Scholar 

  20. Zhang, K., Xu, H., Tang, J., Li, J.: Keyword extraction using support vector machine. In: Yu, J.X., Kitsuregawa, M., Leong, H.-V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 85–96. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Saratlija, J., Šnajder, J., Dalbelo Bašić, B. (2011). Unsupervised Topic-Oriented Keyphrase Extraction and Its Application to Croatian. In: Habernal, I., Matoušek, V. (eds) Text, Speech and Dialogue. TSD 2011. Lecture Notes in Computer Science(), vol 6836. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23538-2_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23538-2_43

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23537-5

  • Online ISBN: 978-3-642-23538-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics