Dirty Data Management in Cloud Database

Wang, Hongzhi; Li, Jianzhong; Wang, Jinbao; Gao, Hong

doi:10.1007/978-3-642-20045-8_7

Dirty Data Management in Cloud Database

Hongzhi Wang³,
Jianzhong Li³,
Jinbao Wang³ &
…
Hong Gao³

Chapter
First Online: 01 January 2011

1223 Accesses

Abstract

Data quality problem is caused by dirty data. Massive data sets contain dirty data in higher probability. As an important platform for massive data management, it is necessary to manage dirty data in cloud databases. Since traditional data-cleaning-based methods cannot clean dirty data entirely and are costly for massive datasets, a massive dirty data management method is presented in this chapter to obtain query result with quality assurance. To achieve this goal, a dirty database storage structure for cloud databases as well as a multi-level index structure for query processing is presented. Exploiting this index for a query on dirty data, candidates nodes in the cloud are selected to run and process the query efficiently. This chapter discusses the index structure and index-based query processing techniques. Experimental results show the efficiency and effectiveness of the presented techniques.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Eckerson, W.W.: Xml for analysis specification. Technical Report, The Data Warehousing Institute. http://www.tdwi.org/research/display.aspx?ID = 6064, 2002
Google Scholar
Raman, A., DeHoratius, N., Ton, Z.: Execution: The missing link in retail operations. Calif. Manag. Rev. 43(3), 136–152 (2001)
Article Google Scholar
Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Google Scholar
Fuxman, A., Miller, R.J.: First-order query rewriting for inconsistent databases. In: ICDT, pp. 337–351 (2005)
Google Scholar
Fuxman, A., Fazli, E., Miller, R.J.: Conquer: Efficient management of inconsistent databases. In: SIGMOD Conference, pp. 155–166 (2005)
Google Scholar
Andritsos, P., Fuxman, A., Miller, R.J.: Clean answers over dirty databases: A probabilistic approach. In: ICDE, p. 30 (2006)
Google Scholar
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database system implementation. Prentice-Hall, NJ (2000)
Google Scholar
Li, C., Wang, B., Yang, X.: Vgram: Improving performance of approximate queries on string collections using variable-length grams. In: VLDB, pp. 303–314 (2007)
Google Scholar
Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern information retrieval. ACM, NY (1999)
Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to algorithms, 2nd edn. MIT, MA (2001)
MATH Google Scholar
Fredkin, E.: Trie memory. Commun. ACM 3(9), 490–499 (1960)
Article Google Scholar
Schaeffer, S.E.: Graph clustering. Comp. Sci. Rev. 1(1), 27–64 (2007)
Article MATH Google Scholar
Sarawagi, S. , Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD Conference, pp. 743–754 (2004)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: SOSP 2003, pp. 29–43
Google Scholar
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2) (2008)
Google Scholar
Apache Hadoop http://hadoop.apache.org/
Google Scholar
Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H.A.: PNUTS: Yahoo!’s hosted data serving platform. PVLDB 1(2), 1277–1288 (2008)
Google Scholar
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels,W.: Dynamo: Amazon’s highly available key-value store. In: SIGOPS, pp. 205–220 (2007)
Google Scholar
Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E.: Ceph: a scalable, high-performance distributed file system. In: SODI, pp. 307–320 (2006)
Google Scholar
Aguilera, M.K., Merchant, A., Shah, M., Veitch, A., Karamanolis, C.: Sinfonia: A new paradigm for building scalable distributed systems. In: SOSP 2007
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI 2004
Google Scholar
Yang, H.-C., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-reduce-merge: Simplified relational data processing on large clusters. In: SIGMOD, pp. 1029–1040 (2007)
Google Scholar
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. PVLDB 2(1), 922–933 (2009)
Google Scholar

Download references

Acknowledgements

This research is partially supported by National Science Foundation of China (No. 61003046), the NSFC-RGC of China (No. 60831160525), National Grant of High Technology 863 Program of China (No. 2009AA01Z149), Key Program of the National Natural Science Foundation of China (No. 60933001), National Postdoctoral Foundation of China (No. 20090450126, No. 201003447), Doctoral Fund of Ministry of Education of China (No. 20102302120054), Postdoctoral Foundation of Heilongjiang Province (No. LBH-Z09109), and Development Program for Outstanding Young Teachers in Harbin Institute of Technology (No. HITQNJS.2009.052).

Author information

Authors and Affiliations

Harbin Institute of Technology, Harbin, China
Hongzhi Wang, Jianzhong Li, Jinbao Wang & Hong Gao

Authors

Hongzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Jinbao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongzhi Wang .

Editor information

Editors and Affiliations

Faculty of Engineering, Dept. of Innovation Engineering, University of Salento, Via per Monteroni, 73100, Lecce, Italy
Sandro Fiore Ph.D.
Faculty of Engineering, Dept. of Innovation Engineering, University of Salento, Via per Monteroni, 73100, Lecce, Italy
Giovanni Aloisio

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wang, H., Li, J., Wang, J., Gao, H. (2011). Dirty Data Management in Cloud Database. In: Fiore, S., Aloisio, G. (eds) Grid and Cloud Database Management. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20045-8_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-20045-8_7
Published: 17 May 2011
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20044-1
Online ISBN: 978-3-642-20045-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics