An Under-Sampling Approach to Imbalanced Automatic Keyphrase Extraction

Ni, Weijian; Liu, Tong; Zeng, Qingtian

doi:10.1007/978-3-642-32281-5_38

Weijian Ni²¹,
Tong Liu²¹ &
Qingtian Zeng²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7418))

Included in the following conference series:

International Conference on Web-Age Information Management

1661 Accesses
1 Citations

Abstract

The task of automatic keyphrase extraction is usually formalized as a supervised learning problem and various learning algorithms have been utilized. However, most of the existing approaches make the assumption that the samples are uniformly distributed between positive (keyphrase) and negative (non-keyphrase) classes which may not be hold in real keyphrase extraction settings. In this paper, we investigate the problem of supervised keyphrase extraction considering a more common case where the candidate phrases are highly imbalanced distributed between classes. Motivated by the observation that the saliency of a candidate phrase can be described from the perspectives of both morphology and occurrence, a multi-view under-sampling approach, named co-sampling, is proposed. In co-sampling, two classifiers are learned separately using two disjoint sets of features and the redundant candidate phrases reliably predicted by one classifier is removed from the training set of the peer classifier. Through the iterative and interactive under-sampling process, useless samples are continuously identified and removed while the performance of the classifier is boosted. Experimental results show that co-sampling outperforms several existing under-sampling approaches on the keyphrase exaction dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Song, M., Song, I.Y., Allen, R.B., Obradovic, Z.: Keyphrase extraction-based query expansion in digital libraries. In: Proceedings of the 6th ACM/IEEE-CS JCDL, pp. 202–209 (2006)
Google Scholar
Lehtonen, M., Doucet, A.: Enhancing Keyword Search with a Keyphrase Index. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2008. LNCS, vol. 5631, pp. 65–70. Springer, Heidelberg (2009)
Chapter Google Scholar
Wu, X., Bolivar, A.: Keyword extraction for contextual advertisement. In: Proceedings of the 17th WWW, pp. 1195–1196 (2008)
Google Scholar
Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: KEA: Practical Automatic Keyphrase Extraction. In: Proceedings of the 4th ACDL, pp. 254–255 (1999)
Google Scholar
Turney, P.D.: Learning Algorithms for Keyphrase Extraction. Information Retrieval 2, 303–336 (2000)
Article Google Scholar
Weiss, G.M., Provost, F.: The Effect of Class Distribution on Classifier Learning: An Empirical Study. Technical Report, Department of Computer Science, Rutgers University (2001)
Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th COLT, pp. 92–100 (1998)
Google Scholar
Turney, P.D.: Learning Algorithms for Keyphrase Extraction. Information Retrieval 2, 303–336 (2000)
Article Google Scholar
Nguyen, T.D., Kan, M.-Y.: Keyphrase Extraction in Scientific Publications. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007)
Chapter Google Scholar
Li, Z., Zhou, D., Juan, Y., Han, J.: Keyword Extraction for Social Snippets. In: Proceedings of the 19th WWW, pp. 1143–1144 (2010)
Google Scholar
Yih, W., Goodman, J., Carvalho, V.R.: Finding Advertising Keywords on Web Pages. In: Proceedings of the 15th WWW, pp. 213–222 (2006)
Google Scholar
Mihalcea, R., Tarau, P.: TextRank: Bringing Order into Texts. In: Proceedings of the 1st EMNLP, pp. 404–411 (2004)
Google Scholar
Litvak, M., Last, M.: Graph-Based Keyword Extraction for Single-Document Summarization. In: Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization, pp. 17–24 (2008)
Google Scholar
Wan, X., Xiao, J.: CollabRank: Towards a Collaborative Approach to Single-Document Keyphrase Extraction. In: Proceedings of the 22nd COLING, pp. 969–976 (2008)
Google Scholar
Liu, Z., Huang, W., Zheng, Y., Sun, M.: Automatic Keyphrase Extraction via Topic Decomposition. In: Proceedings of the 7th EMNLP, pp. 366–376 (2010)
Google Scholar
Liu, X., Wu, J., Zhou, Z.: Exploratory Under-Sampling for Class-Imbalance Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B 39, 539–550 (2009)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research 6, 321–357 (2002)
Google Scholar
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005, Part I. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005)
Chapter Google Scholar
Fan, X., Tang, K., Weise, T.: Margin-Based Over-Sampling Method for Learning from Imbalanced Datasets. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part II. LNCS, vol. 6635, pp. 309–320. Springer, Heidelberg (2011)
Chapter Google Scholar
Sun, Y., Kamel, M.S., Wong, A.K.C., Wang, Y.: Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40, 3358–3378 (2007)
Article MATH Google Scholar
Zhou, Z., Liu, X.: Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem. IEEE Transactions on Knowledge and Data Engineering 18, 63–77 (2006)
Article Google Scholar
Nguyen, T., Zeno, G., Lars, S.: Cost-Sensitive Learning Methods for Imbalanced Data. In: Proceedings of the 2010 IJCNN, pp. 1–8 (2010)
Google Scholar
Li, Y., Zaragoza, H., Herbrich, R., Shawe-Taylor, J., Kandola, J.: The perceptron algorithm with uneven margins. In: Proceedings of the 19th ICML, pp. 379–386 (2002)
Google Scholar
Platt, J.: Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In: Advances in Large Margin Classifiers, pp. 61–74 (2000)
Google Scholar
Muhlenbach, F., Lallich, S., Zighed, D.A.: Identifying and Handling Mislabelled Instances. Journal of Intelligent Information Systems 22, 89–109 (2004)
Article Google Scholar
Ni, W., Huang, Y.: Extracting and Organizing Acronyms based on Ranking. In: Proceedings of the 7th WCICA, pp. 4542–4547 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Shandong University of Science and Technology, Qingdao, Shandong Province, 266510, P.R. China
Weijian Ni, Tong Liu & Qingtian Zeng

Authors

Weijian Ni
View author publications
You can also search for this author in PubMed Google Scholar
Tong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Qingtian Zeng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, No. 92, West Dazhi Street, 150001, Heilongjiang, Harbin, China
Hong Gao
Information and Computer Science Department, University of Hawaii, 1680 East West Road, 96822, Honolulu, HI, USA
Lipyeow Lim
School of Computer Science, Fudan University, No. 220, Handan Road, 200433, Shanghai, China
Wei Wang
School of Computer Science and Technology, Sichuan University, No. 29 Jiuyanqiao Wangjing Road, 610064, Chengdu, Sichuan, China
Chuan Li
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon,, Hong Kong, China
Lei Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ni, W., Liu, T., Zeng, Q. (2012). An Under-Sampling Approach to Imbalanced Automatic Keyphrase Extraction. In: Gao, H., Lim, L., Wang, W., Li, C., Chen, L. (eds) Web-Age Information Management. WAIM 2012. Lecture Notes in Computer Science, vol 7418. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32281-5_38

Download citation

DOI: https://doi.org/10.1007/978-3-642-32281-5_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32280-8
Online ISBN: 978-3-642-32281-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics