A New Approach for Semi-supervised Online News Classification
Due to the dramatic increasing of information on the Web, text categorization becomes a useful tool to organize the information. Traditional text categorization problem uses a training set from online sources with pre-defined class labels for text documents. Typically a large amount of online training news should be provided in order to learn a satisfactory categorization scheme. We investigate an innovative way to alleviate the problem. For each category, only a small amount of positive training examples for a set of the major concepts associated with the category are needed. We develop a technique which makes use of unlabeled documents since those documents can be easily collected, such as online news from the Web. Our technique exploits the inherent structure in the set of positive training documents guided by the provided concepts of the category. An algorithm for training document adaptation is developed for automatically seeking representative training examples from the unlabeled data collected from the new online source. Some preliminary experiments on real-world news collection have been conducted to demonstrate the effectiveness of our approach.
KeywordsText Categorization Relevance Feedback Unlabeled Data Voter Turnout Positive Training
Unable to display preview. Download preview PDF.
- 1.Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pp. 92–100 (1998)Google Scholar
- 2.Bockhorst, J., Craven, M.: Exploiting relations among concepts to acquire weakly labeled training data. In: Proceedings of the IEEE International Conference on Machine Learning, pp. 43–50 (2002)Google Scholar
- 4.Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of 10th European Conference on Machine Learning, pp. 137–142 (1998)Google Scholar
- 5.Lam, W., Chan, K., Radev, D., Saggion, H., Teufel, S.: Context-based generic cross-lingual retrieval of documents and automated summaries. Journal of American Society for Information Science and Technology, 129–139 (2005)Google Scholar
- 11.Li, X., Liu, B.: Learning to classify texts using positive and unlabeled data. In: Proceedings of Eighteenth Internatinoal Joint Conferences on Artifical Intelligence, pp. 587–594 (2003)Google Scholar
- 12.Liu, B., Lee, W., Yu, P., Li, X.: Partially supervised classification of text documents. In: Proceedings of the IEEE International Conference on Machine Learning, pp. 387–394 (2002)Google Scholar
- 13.Liu, B., Yang, D., Li, X., Lee, W., Yu, P.: Building text classifiers using positive and unlabeled examples. In: Proceedings of the IEEE International Conference on Data Mining, pp. 179–188 (2003)Google Scholar
- 15.Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: Proceedings of the IEEE International Conference on Machine Learning, pp. 412–420 (1997)Google Scholar
- 16.Yu, H., Han, J., Chang, K.: PEBL: Positive examples based learning for web page classification using svm. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 239–248 (2002)Google Scholar