Abstract
Semi-supervised learning can be applied to datasets that contain both labeled and unlabeled instances and can result in more accurate predictions compared to fully supervised or unsupervised learning in case limited labeled data is available. A subclass of problems, called Positive-Unlabeled (PU) learning, focuses on cases in which the labeled instances contain only positive examples. Given the lack of negatively labeled data, estimating the general performance is difficult. In this paper, we propose a new approach to approximate the \(F_1\) score for PU learning. It requires an estimate of what fraction of the total number of positive instances is available in the labeled set. We derive theoretical properties of the approach and apply it to several datasets to study its empirical behavior and to compare it to the most well-known score in the field, LL score. Results show that even when the estimate is quite off compared to the real fraction of positive labels the approximation of the \(F_1\) score is significantly better compared with the LL score.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
All code is available on Github: https://github.com/SEYED7037/PU-Learning-Estimating-F1-LOD2020-.
References
Bekker, J., Davis, J.: Learning from positive and unlabeled data: A survey. arXiv preprint arXiv:1811.04820 (2018)
Denis, F., Gilleron, R., Tommasi, M.: Text classification from positive and unlabeled examples (2002)
Dua, D., Graff, C.: UCI machine learning repository (2017) http://archive.ics.uci.edu/ml
Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 213–220. ACM (2008)
Lee, W.S., Liu, B.: Learning with positive and unlabeled examples using weighted logistic regression. ICML. 3, 448–455 (2003)
Li, X., Liu, B.: Learning to classify texts using positive and unlabeled data. IJCAI. 3, 587–592 (2003)
Liu, B.: Web data mining: exploring hyperlinks, contents, and usage data. Springer Science & Business Media, Berlin (2007)
Liu, B., Lee, W.S., Yu, P.S., Li, X.: Partially supervised classification of text documents. In: ICML. vol. 2, pp. 387–394. Citeseer (2002)
Skala, M.: Hypergeometric tail inequalities: ending the insanity. arXiv preprint arXiv:1311.5939 (2013)
Tabatabaei, S.A., Lu, X., Hoogendoorn, M., Reijers, H.A.: Identifying patient groups based on frequent patterns of patient samples. arXiv preprint arXiv:1904.01863 (2019)
Wilcoxon, F.: Some rapid approximate statistical procedures. Ann. New York Acad. Sci. 52(1), 808–814 (1950)
Zhao, Y., Kong, X., Philip, S.Y.: Positive and unlabeled learning for graph classification. In: 2011 IEEE 11th International Conference on Data Mining. pp. 962–971. IEEE (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Tabatabaei, S.A., Klein, J., Hoogendoorn, M. (2020). Estimating the \(F_1\) Score for Learning from Positive and Unlabeled Examples. In: Nicosia, G., et al. Machine Learning, Optimization, and Data Science. LOD 2020. Lecture Notes in Computer Science(), vol 12565. Springer, Cham. https://doi.org/10.1007/978-3-030-64583-0_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-64583-0_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-64582-3
Online ISBN: 978-3-030-64583-0
eBook Packages: Computer ScienceComputer Science (R0)