Skip to main content

Abstract

The paper presents results of the research related to influence of the level of outliers in the data (train and test data considered separately) on the quality of a model prediction in a classification task. The set of 100 semi–artificial time series was taken into consideration, which independent variables was close to real ones, observed in a underground coal mining environment and dependent variable was generated with the decision tree. For every considered method (decision trees, naive bayes, logistic regression and kNN) a reference model was built (no outliers in the data) which quality was compared with the quality of two models: Out–Out (outliers in train and test data) and Non-out–Out (outliers only in test data). 50 levels of outliers in the data were considered, from 1 % to 50 %. Statistical comparison of models was done on the basis of sign test.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aggarwal, C.C., Yu, P.S.: Outlier detection for high dimensional data. SIGMOD Rec. 30(2), 37–46 (2001). http://doi.acm.org/10.1145/376284.375668

    Article  Google Scholar 

  2. Ahmed, B., Thesen, T., Blackmon, K.E., Zhao, Y., Devinsky, O., Kuzniecky, R., Brodley, C.E.: Hierarchical conditional random fields for outlier detection: an application to detecting epileptogenic cortical malformations. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China (2014)

    Google Scholar 

  3. Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley, Chichester (1994)

    MATH  Google Scholar 

  4. Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the SIAM Internation Conference on Data Mining, pp. 243–254 (2008)

    Google Scholar 

  5. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104. ACM, New York (2000)

    Google Scholar 

  6. Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: OPTICS-OF: identifying local outliers. In: Żytkow, J.M., Rauch, J. (eds.) PKDD 1999. LNCS (LNAI), vol. 1704, pp. 262–270. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  7. Byers, S., Raftery, A.E.: Nearest-neighbor clutter removal for estimating features in spatial point processes. J. Am. Stat. Assoc. 93(442), 577–584 (1998)

    Article  MATH  Google Scholar 

  8. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 226–231 (1996)

    Google Scholar 

  9. Fawcett, T., Provost, F.: Activity monitoring: noticing interesting changes in behavior. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 1999, pp. 53–62. ACM, New York (1999)

    Google Scholar 

  10. Grubbs, F.E.: Sample criteria for testing outlying observations. Ann. Math. Stat. 21(1), 27–58 (1950)

    Article  MathSciNet  MATH  Google Scholar 

  11. Grubbs, F.E.: Procedures for detecting outlying observations in samples. Technometrics 11(1), 1–21 (1969)

    Article  Google Scholar 

  12. Gupta, M., Gao, J., Aggarwal, C., Han, J.: Outlier detection for temporal data: a survey. IEEE Trans. Knowl. Data Eng. 26(9), 2250–2267 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  13. Hawkins, D.M.: Identification of Outliers. Monographs on Applied Probability and Statistics. Springer, Netherlands (1980)

    Book  MATH  Google Scholar 

  14. Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)

    Article  MATH  Google Scholar 

  15. Japkowicz, N., Myers, C., Gluck, M.: A novelty detection approach to classification. In: Proceedings 14th International Joint Conference Artificial Intelligence, pp. 518–523 (1995)

    Google Scholar 

  16. John, G.H.: Robust decision trees: removing outliers from databases. In: Knowledge Discovery and Data Mining, pp. 174–179. AAAI Press (1995)

    Google Scholar 

  17. Johnson, T., Kwok, I., Ng, R.T.: Fast computation of 2-dimensional depth contours. In: Agrawal, R., Stolorz, P.E., Piatetsky-Shapiro, G. (eds.) Internation Conference on Knowledge Discovery and Data Mining (KDD), pp. 224–228. AAAI Press (1998)

    Google Scholar 

  18. Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the 24rd International Conference on Very Large Data Bases, VLDB 1998, pp. 392–403. Morgan Kaufmann Publishers Inc., San Francisco (1998). http://dl.acm.org/citation.cfm?id=645924.671334

  19. Kuna, H., Garcia-Martinez, R., Villatoro, F.: Outlier detection in audit logs for application systems. Inf. Syst. 44, 22–33 (2014)

    Article  Google Scholar 

  20. Ma, J., Perkins, S.: Online novelty detection on temporal sequences. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 613–618. ACM, New York (2003)

    Google Scholar 

  21. Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. SIGMOD Rec. 29(2), 427–438 (2000)

    Article  Google Scholar 

  22. Ritter, G., Gallegos, M.T.: Outliers in statistical pattern recognition and an application to automatic chromosome classification. Pattern Recogn. Lett. 18(6), 525–539 (1997)

    Article  Google Scholar 

  23. Rousseeuw, P.J.: Multivariate estimation with high breakdown point. In: Grossmann, W., Pflug, G., Vincze, I., Wertz, W. (eds.) Mathematical Statistics and Applications, vol. B, pp. 283–297. Reidel, Dordrecht (1985)

    Chapter  Google Scholar 

  24. Ruts, I., Rousseeuw, P.J.: Computing depth contours of bivariate point clouds. Comput. Stat. Data Anal. 23(1), 153–168 (1996)

    Article  MATH  Google Scholar 

  25. Schölkopf, B., Williamson, R.C., Smola, A.J., Shawe-Taylor, J., Platt, J.C.: Support vector method for novelty detection. In: Solla, S., Leen, T., Müller, K. (eds.) Advances in Neural Information Processing Systems 12, pp. 582–588. MIT Press (2000)

    Google Scholar 

  26. Torr, P.H.S., Murray, D.W.: Outlier detection and motion segmentation, vol. 2059, pp. 432–443 (1993)

    Google Scholar 

  27. Tukey, J.: Exploratory Data Analysis. Addison-Wesley Publishing Company, Reading (1977)

    MATH  Google Scholar 

  28. Weisberg, S.: Applied Linear Regression. Wiley Series in Probability and Statistics, 3rd edn. Wiley & Sons, Hoboken (2005)

    Book  MATH  Google Scholar 

Download references

Acknowledgments

This work was partially supported by Polish National Centre for Research and Development (NCBiR) grant PBS2/B9/20/2013 in frame of Applied Research Programmes. The infrastructure was supported by “PL-LAB2020” project, contract POIG.02.03.01-00-104/13-00.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcin Michalak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Kalisch, M., Michalak, M., Sikora, M., Wróbel, Ł., Przystałka, P. (2016). Influence of Outliers Introduction on Predictive Models Quality. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery. BDAS BDAS 2015 2016. Communications in Computer and Information Science, vol 613. Springer, Cham. https://doi.org/10.1007/978-3-319-34099-9_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-34099-9_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-34098-2

  • Online ISBN: 978-3-319-34099-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics