Abstract
Data quality problem is caused by dirty data. Massive data sets contain dirty data in higher probability. As an important platform for massive data management, it is necessary to manage dirty data in cloud databases. Since traditional data-cleaning-based methods cannot clean dirty data entirely and are costly for massive datasets, a massive dirty data management method is presented in this chapter to obtain query result with quality assurance. To achieve this goal, a dirty database storage structure for cloud databases as well as a multi-level index structure for query processing is presented. Exploiting this index for a query on dirty data, candidates nodes in the cloud are selected to run and process the query efficiently. This chapter discusses the index structure and index-based query processing techniques. Experimental results show the efficiency and effectiveness of the presented techniques.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Eckerson, W.W.: Xml for analysis specification. Technical Report, The Data Warehousing Institute. http://www.tdwi.org/research/display.aspx?ID = 6064, 2002
Raman, A., DeHoratius, N., Ton, Z.: Execution: The missing link in retail operations. Calif. Manag. Rev. 43(3), 136–152 (2001)
Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Fuxman, A., Miller, R.J.: First-order query rewriting for inconsistent databases. In: ICDT, pp. 337–351 (2005)
Fuxman, A., Fazli, E., Miller, R.J.: Conquer: Efficient management of inconsistent databases. In: SIGMOD Conference, pp. 155–166 (2005)
Andritsos, P., Fuxman, A., Miller, R.J.: Clean answers over dirty databases: A probabilistic approach. In: ICDE, p. 30 (2006)
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database system implementation. Prentice-Hall, NJ (2000)
Li, C., Wang, B., Yang, X.: Vgram: Improving performance of approximate queries on string collections using variable-length grams. In: VLDB, pp. 303–314 (2007)
Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern information retrieval. ACM, NY (1999)
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to algorithms, 2nd edn. MIT, MA (2001)
Fredkin, E.: Trie memory. Commun. ACM 3(9), 490–499 (1960)
Schaeffer, S.E.: Graph clustering. Comp. Sci. Rev. 1(1), 27–64 (2007)
Sarawagi, S. , Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD Conference, pp. 743–754 (2004)
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)
Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: SOSP 2003, pp. 29–43
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2) (2008)
Apache Hadoop http://hadoop.apache.org/
Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H.A.: PNUTS: Yahoo!’s hosted data serving platform. PVLDB 1(2), 1277–1288 (2008)
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels,W.: Dynamo: Amazon’s highly available key-value store. In: SIGOPS, pp. 205–220 (2007)
Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E.: Ceph: a scalable, high-performance distributed file system. In: SODI, pp. 307–320 (2006)
Aguilera, M.K., Merchant, A., Shah, M., Veitch, A., Karamanolis, C.: Sinfonia: A new paradigm for building scalable distributed systems. In: SOSP 2007
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI 2004
Yang, H.-C., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-reduce-merge: Simplified relational data processing on large clusters. In: SIGMOD, pp. 1029–1040 (2007)
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. PVLDB 2(1), 922–933 (2009)
Acknowledgements
This research is partially supported by National Science Foundation of China (No. 61003046), the NSFC-RGC of China (No. 60831160525), National Grant of High Technology 863 Program of China (No. 2009AA01Z149), Key Program of the National Natural Science Foundation of China (No. 60933001), National Postdoctoral Foundation of China (No. 20090450126, No. 201003447), Doctoral Fund of Ministry of Education of China (No. 20102302120054), Postdoctoral Foundation of Heilongjiang Province (No. LBH-Z09109), and Development Program for Outstanding Young Teachers in Harbin Institute of Technology (No. HITQNJS.2009.052).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Wang, H., Li, J., Wang, J., Gao, H. (2011). Dirty Data Management in Cloud Database. In: Fiore, S., Aloisio, G. (eds) Grid and Cloud Database Management. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20045-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-20045-8_7
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20044-1
Online ISBN: 978-3-642-20045-8
eBook Packages: Computer ScienceComputer Science (R0)