Skip to main content

Efficient Subject-Oriented Evaluating and Mining Methods for Data with Schema Uncertainty

  • Conference paper
Advanced Data Mining and Applications (ADMA 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7120))

Included in the following conference series:

  • 943 Accesses

Abstract

With the progressing of data collecting methods, people have already collected scales of data for various application fields such as medical science, meteorology, electronic commerce and so on. To analyze these data needs to integrate data from the various heterogeneous data sets. As historical reasons technically or non-technically, usually, the schemas of the data sets to be integrated are complex and different. Thus to analyze the integrated data may cause ambiguous results for their non-uniform schemas. This paper targets mining this kind of data, and its main contributions include:(1) proposed schema uncertainty to describe data with non-uniform schemas and proposed couple correlation degree (Cor) to evaluate the existence probabilities for records in data with schema uncertainty based on the analyzing subject;(2) designed a data structure ”B-correlation tree” to establish a hierarchical structure for uncertain data with their existence probabilities and discussed the distribution affection by selecting nodes on different levels of B-correlation tree ; (3) proposed a efficient Monte Carlo uncertain data analyzing algorithm, MonteCarlo-evaluate (MCE), based on B-correlation tree for data with schema uncertainty; (4) analyzed the accuracy and convergence property for MCE theoretically; (5) implemented a prototype system by using B-correlation tree and MCE on real medical data and synthetic TPC-H benchmark?[20] data; provided sufficient experiments to test the effectiveness and efficiency of the provided methods. The results of experiments show that: the provided methods can efficient evaluate the schema uncertainty in data and thus can be equal to the tasks of analyzing large scale data with schema uncertainty efficiently.

This work is supported by the National Key Technology R&D Program of China(No. 2009BAK63B08), National High Technology Research and Development Program of China(’863’ Program)(No.2009AA01Z150), National Science& Technology Pillar Program of China(No. 2009BAH44B03), China Postdoctoral Science Foundation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal, C.C.: Managing and Mining Uncertain Data. In: Advances In Database Systems (2009)

    Google Scholar 

  2. Sarma, A.D., Benjelloun, O., Halevy, A., Widom, J.: Working Models for Uncertain Data. In: ICDE 2006 (2006)

    Google Scholar 

  3. Cavallo, R., Pittarelli, M.: The Theory of Probabilistic Databases. In: Proceedings of the 13th VLDB Conference, Brighton (1987)

    Google Scholar 

  4. BarbarB, D., Garcia-Molina, H., Porter, D.: The management of probabilistic data. IEEE Transactions on Knowledge and Data Engineering 4(5), 487–502 (1992)

    Article  Google Scholar 

  5. Chatfield, C.: Model Uncertainty, Data Mining and Statistical Inference. Journal of the Royal Statistical Society. Series A (Statistics in Society) 158(3), 419–466 (1995)

    Article  Google Scholar 

  6. Dong, X.L., Halevy, A., Yu, C.: Data integration with uncertainty. The VLDB Journal 18(2), 469–500 (2009)

    Article  Google Scholar 

  7. Bernecker, T., Kriegel, H.-P., Renz, M., Verhein, F., Zuefle, A.: Probabilistic Frequent Itemset Mining in Uncertain Databases. In: Proc. 15th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD 2009), Paris, France (2009)

    Google Scholar 

  8. http://infolab.stanford.edu/trio

  9. http://www.almaden.ibm.com/cs/projects/avatar

  10. http://www.math.ups.edu/~anierman/umich/prodb

  11. Khoussainova, N., Balazinska, M., Suciu, D.: Towards Correcting Input Data Errors Probabilistically Using Integrity Constraints. In: MobiDE 2006, June 25 (2006)

    Google Scholar 

  12. Jayram, T.S., McGregor, A.: Estimating Statistical Aggregates on Probabilistic Data Streams. In: PODS 2007, June 11-14 (2007)

    Google Scholar 

  13. Metropolis, N., Ulam, S.: The Monte Carlo Method. Journal of the American Statistical Association 44(247), 335–341 (1949)

    Article  MathSciNet  MATH  Google Scholar 

  14. Stigler, S.M.: A Historical View of Statistical Concepts in Psychology and Educational Research. American Journal of Education 101(1), 60–70 (1992)

    Article  Google Scholar 

  15. Jampani, R., Xu, F., Wu, M., Perez, L.L., Jermaine, C., Haas, P.J.: MCDB: a monte carlo approach to managing uncertain data. In: Proceedings of the 2008 ACM SIGMOD (2008)

    Google Scholar 

  16. Xu, F., Beyer, K., Ercegovac, V., Haas, P.J., Shekita, E.J.: E = MC3: managing uncertain enterprise data in a cluster-computing environment. In: Proceedings of the 2009 ACM SIGMOD (2009)

    Google Scholar 

  17. Li, Y., Algarni, A., Zhong, N.: Mining positive and negative patterns for relevance feature discovery. In: KDD 2010 Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2010)

    Google Scholar 

  18. Renyi, A.: Probability theory. NorthHolland, Amsterdam (1970)

    MATH  Google Scholar 

  19. Karp, R., Luby, M.: Monte-Carlo Algorithms for Enumeration and Reliability Problems. In: 24th STOC, pp. 56–64 (1983)

    Google Scholar 

  20. http://www.tpc.org/tpch/

  21. Yang, H., Cai, H.: Clinicopatholog analysis on 46 inborn anencephaluses. Chinese Journal of Birth Health and Heredity (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, Y., Tang, C., Wang, T., Yang, D., Zhu, J. (2011). Efficient Subject-Oriented Evaluating and Mining Methods for Data with Schema Uncertainty. In: Tang, J., King, I., Chen, L., Wang, J. (eds) Advanced Data Mining and Applications. ADMA 2011. Lecture Notes in Computer Science(), vol 7120. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25853-4_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-25853-4_25

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-25852-7

  • Online ISBN: 978-3-642-25853-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics