Likelihood-Based Sampling from Databases for Rule Induction Methods
This paper introduces the idea of log-likelihood ratio to measure the similarity between generated training samples and original tracing samples. The ratio is used as a test statistic to determine whether the statistical information of generated training samples(S k ) is almost equivalent to that of original training samples(S 0), denoted by S 0 ≃ S k . If the test statistic obtained rejects the hypothesis S 0 ≃ S k , then these samples are abandoned. Otherwise, the generated samples are accepted and rule induction methods or statistical methods are applied. This method was evaluated to three medical domains. The results show that the proposed method selects training samples which reflect the statistical characteristics of the original training samples although the performance with small samples is not so good.
KeywordsTraining Sample Acceptance Rate Medical Domain Acceptance Ratio Probabilistic Situation
Unable to display preview. Download preview PDF.
- 2.Clark, P., Niblett, T.: The CN2 Induction Algorithm. Machine Learning 3, 261–283 (1989)Google Scholar
- 3.Edwards, A.W.F.: Likelihood, expanded edition. Johns Hopkins University Press, Baltimore (1992)Google Scholar
- 4.Efron, B.: The Jackknife, the Bootstrap and Other Resampling Plans. Society for Industrial and Applied Mathematics, Philadelphia (1982)Google Scholar
- 5.Quinlan, J.R.: C4.5 - Programs for Machine Learning. Morgan Kaufmann, CA (1993)Google Scholar
- 6.Walker, M.G., Olshen, R.A.: Probability Estimation for Biomedical Classification Problems. In: Proceedings of the 16th SCAMC. McGrawHill, New York (1992)Google Scholar