Skip to main content

Mining Rare Events Data by Sampling and Boosting: A Case Study

  • Conference paper

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 54))

Abstract

In data mining, popular model ensemble technique like boosting is often used to improve predictive models performance. When mining data with rare events (far less than 5%), though boosting may improve a model’s overall prediction power, but the accuracy and efficiency of model estimation is negatively impacted when the simple random sampling procedure is employed. In this study we investigate the performance of applying the boosting technique to an imbalanced sample procedure called case-based sampling. We demonstrate the performance of the combined procedure in predicting customer attrition with an actual telecommunications data. Our results show that the combination of boosting and case-based sampling is very effective at alleviating the problem of rare events.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. King, G., Zeng, L.: Logistic Regression in Rare Events Data. Society for Political Methodology, 137–163 (February 2001)

    Google Scholar 

  2. Prentice, R.L.: A Case-cohort Design for Epidemiologic Cohort Studies and Disease Prevention Trials. Biometrika 73, 1–11 (1986)

    Article  MATH  MathSciNet  Google Scholar 

  3. Jacob, R.: Why Some Customers Are More Equal Than Others. Fortune 19, 200–201 (1994)

    Google Scholar 

  4. Walker, O.C., Boyd, H.W., Larreche, J.C.: Marketing Strategy: Planning and Implementation, 3rd edn. Irwin, Boston (1999)

    Google Scholar 

  5. Li, S.: Applications of Demographic Techniques in Modeling Customer Retention. In: Rao, K.V., Wicks, J.W. (eds.) Applied Demography, pp. 183–197. Bowling Green State University, Bowling Green (1994)

    Google Scholar 

  6. Li, S.: Survival Analysis. Marketing Research, 17–23 (Fall 1995)

    Google Scholar 

  7. Breslow, N.E.: Statistics in Epidemiology: The case-Control Study. Journal of the American Statistical Association 91, 14–28 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  8. Hanley, J.A., McNeil, B.J.: The Meaning and Use of the Area under a ROC Curve. Radiology 143, 29–36 (1982)

    Google Scholar 

  9. Ma, G., Hall, W.J.: Confidence Bands for ROC Curves. Medical Decision Making 13, 191–197 (1993)

    Article  Google Scholar 

  10. Au, T., Li, S., Ma, G.: Applications Applying and Evaluating Models to Predict Customer Attrition Using Data Mining Techniques. J. of Cmparative International Management 6, 10–22 (2003)

    Google Scholar 

  11. Friedman, J., Haste, T., Tibshirani, R.: Additive Logistic Regression: A Statistical View of Boosting. The Annals of Statistics 28(2), 337–407 (2000)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Au, T., Chin, ML.I., Ma, G. (2010). Mining Rare Events Data by Sampling and Boosting: A Case Study. In: Prasad, S.K., Vin, H.M., Sahni, S., Jaiswal, M.P., Thipakorn, B. (eds) Information Systems, Technology and Management. ICISTM 2010. Communications in Computer and Information Science, vol 54. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12035-0_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12035-0_38

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12034-3

  • Online ISBN: 978-3-642-12035-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics