Research on Wine Analysis Based on Data Preprocessing

  • Xinfei Meng
  • Xiaolan ZhuEmail author
  • Shenghao Yang
  • Lu Wang
  • Jun Qi
  • Pei Yang
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1075)


In the times of data increasing explosively, data preprocessing technology is particularly important for extracting information from massive data. In this paper, data preprocessing technology was implemented by building models including missing data imputation, duplicate values removal, outlier detections, data standardization and data statute based on the wine data in the UCI data set. Then the preprocessed data was compared with raw data with K-means algorithm, linear regression model and decision tree classification algorithm. The experimental results showed that after data preprocessing, the clustering error was significantly reduced, the fitness of the linear regression model increased and the classification accuracy of decision tree was higher, which showed the importance of data preprocessing and may have some referenced value to optimize data processing.


Data preprocessing Missing data imputation Duplicate values removal Outlier detection Data standardization Data statute 



This paper is partially supported by The National Natural Science Foundation of China (No. 61563044, 61866031); National Natural Science Foundation of Qinghai Province (No. 2017-ZJ-902); The Applied Basic Research Programs of Science and Technology Department of Sichuan Province (No. 2019YJ0110); Youth Foundation of Qinghai University (No. 2017-QGY-4, 2018-QGY-7); Teaching Research Project of Qinghai University(KC18038, SZ18015, JY201805); Open Research Fund Program of State key Laboratory of Hydroscience and Engineering (No. sklhse-2017-A-05).


  1. 1.
    Zhou, Q.: Analysis of common data preprocessing techniques. World Commun. 26(01), 17–18 (2019)Google Scholar
  2. 2.
    Han, J., et al.: Data preprocessing. In: Han, J., Kamber, M., Pei, J. (eds.) Data Mining, 3rd edn., pp. 83–124. Morgan Kaufmann, Boston (2012)CrossRefGoogle Scholar
  3. 3.
    Dunham, M.H.: Data Mining: Introductory and Advanced Topics. Prentice Hall PTR, Upper Saddle River (2002)Google Scholar
  4. 4.
    Jian, Z., Jin, X.: Research on data preprocess in data mining and its application. Appl. Res. Comput. 7,117–118+157 (2004)Google Scholar
  5. 5.
    Sreenivas, P., Srikrishna, C.V.: An analytical approach for data preprocessing. In: 2013 International Conference on Emerging Trends in Communication, Control, Signal Processing and Computing Applications (C2SPCA), Bangalore, pp. 1–12 (2013)Google Scholar
  6. 6.
    Sun, B.: Research on data-preprocessing for construction of university information systems. In: 2010 International Conference on Computer Application and System Modeling (ICCASM 2010), Taiyuan, pp. V1-459–V1-462 (2010)Google Scholar
  7. 7.
    Liu, K.: Clinical data preprocessing and case studies of POMDP for TCM treatment knowledge discovery. In: IEEE International Conference on E-Health Networking. IEEE (2012)Google Scholar
  8. 8.
    Kumar, M., Kalia, A.: Preprocessing and symbolic representation of stock data. In: Second International Conference on Advanced Computing & Communication Technologies. IEEE (2012)Google Scholar
  9. 9.
    Hawkins, D.: Indentification of Outliers. Chapman and Hall, London (1980)CrossRefGoogle Scholar
  10. 10.
    Laurikkala, J., Juhola, M., Kentala, E.: Informal identification of outliers in medical data. In: Fifth International Workshop on Intelligent Data Analysis in Medicine and Pharmacology, Berlin (2000)Google Scholar
  11. 11.
    Breunig, M., Kriegel, H.P., Ng, R., et al.: LOF: indentifying density based local outliers. In: Proceeding of ACM SIGMOD Conference, Dallas, pp. 93–104 (2009)CrossRefGoogle Scholar
  12. 12.
    Liu, J., Zhang, K., Wang, G.: Comparative study on data standardization methods in comprehensive evaluation. Digit. Technol. Appl. 36(06), 84–85 (2018)Google Scholar
  13. 13.
    Azar, A.T., Hassanien, A.E.: Dimensionality reduction of medical big data using neural-fuzzy classifier. Soft. Comput. 19, 1115–1127 (2015)CrossRefGoogle Scholar
  14. 14.
    Chu, F., Wang, L.P.: Applications of support vector machines to cancer classification with microarray data. Int. J. Neural Syst. 15(6), 475–484 (2005)CrossRefGoogle Scholar
  15. 15.
    Wang, L.P., Chu, F., Xie, W.: Accurate cancer classification using expressions of very few genes. IEEE-ACM Trans. Bioinf. Comput. Biol. 4, 40–53 (2007)CrossRefGoogle Scholar
  16. 16.
    Zhang, L., Wang, L.P., Lin, W.: Semi-supervised biased maximum margin analysis for interactive image retrieval. IEEE Trans. Image Process. 21(4), 2294–2308 (2012)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Gao, H.: Experimental research on decision tree J48 algorithm based on weka platform. J. Hunan Inst. Sci. Technol. (Nat. Sci. Ed.) 30(01), 21–25 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Xinfei Meng
    • 1
  • Xiaolan Zhu
    • 1
    Email author
  • Shenghao Yang
    • 1
  • Lu Wang
    • 1
  • Jun Qi
    • 1
  • Pei Yang
    • 1
  1. 1.Department of Computer Technology and ApplicationsQinghai UniversityXiningChina

Personalised recommendations