Abstract
The performance of classification models extremely relies on the quality of training data. However, label imperfection is an inherent fault of training data, which is impossible manually handled in big data environment. Various methods have been proposed to remove label noises in order to improve classification quality, with the side effect of cutting down data bulk. In this paper, we propose a knowledge based approach for tackling mislabeled multi-class big data, in which knowledge graph technique is combined with other data correction method to perceive and correct the error labels in big data. The knowledge graph is built with the medical concepts extracted from online health consulting and medical guidance. Experimental results show our knowledge graph based approach can effectively improve data quality and classification accuracy. Furthermore, this approach can be applied in other data mining tasks requiring deep understanding.
Keywords
Download to read the full chapter text
Chapter PDF
References
Zhu, X., Wu, X.: Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review 22(3), 177–210 (2004)
Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)
Wu, W., Li, H., Wang, H., Zhu, K.Q.: Probase: A probabilistic taxonomy for text understanding. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 481–492. ACM (2012)
Zhang, Y.: Contextualizing consumer health information searching: an analysis of questions in a social q&a community. In: Proceedings of the 1st ACM International Health Informatics Symposium, pp. 210–219. ACM (2010)
Kunz, H., Schaaf, T.: General and specific formalization approach for a balanced scorecard: An expert system with application in health care. Expert Systems with Applications 38(3), 1947–1955 (2011)
Zeng, X., Martinez, T.R.: An algorithm for correcting mislabeled data. Intelligent Data Analysis 5(6), 491–502 (2001)
Wilson, D.R., Martinez, T.R.: Instance pruning techniques. In: ICML, vol. 97, pp. 403–411 (1997)
Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning algorithms. Machine Learning 38(3), 257–286 (2000)
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man and Cybernetics (3), 408–421 (1972)
Aha, D.W., Kibler, D.F.: Noise-tolerant instance-based learning algorithms. In: IJCAI, pp. 794–799. Citeseer (1989)
Brodley, C.E., Friedl, M.A.: Identifying and eliminating mislabeled training instances. In: AAAI/IAAI, vol. 1, pp. 799–805. Citeseer (1996)
Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. arXiv preprint arXiv:1106.0219 (2011)
Teng, C.M.: Evaluating noise correction. In: Mizoguchi, R., Slaney, J.K. (eds.) PRICAI 2000. LNCS, vol. 1886, pp. 188–198. Springer, Heidelberg (2000)
Teng, C.M.: Polishing blemishes: Issues in data correction. IEEE Intelligent Systems 19(2), 34–39 (2004)
Teng, C.M.: A comparison of noise handling techniques. In: FLAIRS Conference, pp. 269–273 (2001)
Li, J., Zhang, K., et al.: Keyword extraction based on tf/idf for chinese news document. Wuhan University Journal of Natural Sciences 12(5), 917–921 (2007)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, vol. 97, pp. 412–420 (1997)
Dijkstra, E.W.: A note on two problems in connexion with graphs. Numerische Mathematik 1(1), 269–271 (1959)
McCallum, A., Nigam, K., et al.: A comparison of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48. Citeseer (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Guo, M., Liu, Y., Li, J., Li, H., Xu, B. (2014). A Knowledge Based Approach for Tackling Mislabeled Multi-class Big Social Data. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds) The Semantic Web: Trends and Challenges. ESWC 2014. Lecture Notes in Computer Science, vol 8465. Springer, Cham. https://doi.org/10.1007/978-3-319-07443-6_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-07443-6_24
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07442-9
Online ISBN: 978-3-319-07443-6
eBook Packages: Computer ScienceComputer Science (R0)