High-Level Rules for Integration and Analysis of Data: New Challenges

Alexe, Bogdan; Burdick, Douglas; Hernández, Mauricio A.; Koutrika, Georgia; Krishnamurthy, Rajasekar; Popa, Lucian; Stanoi, Ioana R.; Wisnesky, Ryan

doi:10.1007/978-3-642-41660-6_3

High-Level Rules for Integration and Analysis of Data: New Challenges

Bogdan Alexe²⁰,
Douglas Burdick²⁰,
Mauricio A. Hernández²⁰,
Georgia Koutrika²¹,
Rajasekar Krishnamurthy²⁰,
Lucian Popa²⁰,
Ioana R. Stanoi²⁰ &
…
Ryan Wisnesky²²

Chapter

1233 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8000))

Abstract

Data integration remains a perenially difficult task. The need to access, integrate and make sense of large amounts of data has, in fact, accentuated in recent years. There are now many publicly available sources of data that can provide valuable information in various domains. Concrete examples of public data sources include: bibliographic repositories (DBLP, Cora, Citeseer), online movie databases (IMDB), knowledge bases (Wikipedia, DBpedia, Freebase), social media data (Facebook and Twitter, blogs). Additionally, a number of more specialized public data repositories are starting to play an increasingly important role. These repositories include, for example, the U.S. federal government data, congress and census data, as well as financial reports archived by the U.S. Securities and Exchange Commission (SEC).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alexe, B., ten Cate, B., Kolaitis, P.G., Tan, W.C.: Designing and Refining Schema Mappings via Data Examples. In: SIGMOD, pp. 133–144 (2011)
Google Scholar
Arasu, A., Ré, C., Suciu, D.: Large-Scale Deduplication with Constraints Using Dedupalog. In: ICDE, pp. 952–963 (2009)
Google Scholar
Balakrishnan, S., Chu, V., Hernández, M.A., Ho, H., Krishnamurthy, R., Liu, S., Pieper, J., Pierce, J.S., Popa, L., Robson, C., Shi, L., Stanoi, I.R., Ting, E.L., Vaithyanathan, S., Yang, H.: Midas: Integrating Public Financial Data. In: SIGMOD, pp. 1187–1190 (2010)
Google Scholar
Beyer, K., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M., Kanne, C.C., Ozcan, F., Shekita, E.: Jaql: A Scripting Language for Large Scale Semistructured Data Analysis. In: VLDB (2011)
Google Scholar
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. TKDD 1(1) (2007)
Google Scholar
Bleiholder, J., Naumann, F.: Data Fusion. ACM Comput. Surv. 41(1) (2008)
Google Scholar
Burdick, D., Hernández, M.A., Ho, H., Koutrika, G., Krishnamurthy, R., Popa, L., Stanoi, I.R., Vaithyanathan, S., Das, S.: Extracting, Linking and Integrating Data from Public Sources: A Financial Case Study. IEEE Data Eng. Bull. 34(3), 60–67 (2011)
Google Scholar
Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan., S.: SystemT: An Algebraic Approach to Declarative Information Extraction. In: ACL, pp. 128–137 (2010)
Google Scholar
Chiticariu, L., Kolaitis, P.G., Popa, L.: Interactive Generation of Integrated Schemas. In: SIGMOD Conference, pp. 833–846 (2008)
Google Scholar
Dalvi, N.N., Kumar, R., Pang, B., Ramakrishnan, R., Tomkins, A., Bohannon, P., Keerthi, S., Merugu, S.: A Web of Concepts. In: PODS, pp. 1–12 (2009)
Google Scholar
Doan, A., Naughton, J.F., Ramakrishnan, R., Baid, A., Chai, X., Chen, F., Chen, T., Chu, E., DeRose, P., Gao, B.J., Gokhale, C., Huang, J., Shen, W., Vuong, B.Q.: Information Extraction Challenges in Managing Unstructured Data. SIGMOD Record 37(4), 14–20 (2008)
Article Google Scholar
Dong, X., Halevy, A.Y., Madhavan, J.: Reference Reconciliation in Complex Information Spaces. In: SIGMOD Conference, pp. 85–96 (2005)
Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate Record Detection: A Survey. IEEE TKDE 19(1), 1–16 (2007)
Google Scholar
Fagin, R., Haas, L.M., Hernández, M., Miller, R.J., Popa, L., Velegrakis, Y.: Clio: Schema Mapping Creation and Data Exchange. In: Borgida, A.T., Chaudhri, V.K., Giorgini, P., Yu, E.S. (eds.) Conceptual Modeling: Foundations and Applications. LNCS, vol. 5600, pp. 198–236. Springer, Heidelberg (2009)
Chapter Google Scholar
Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data Exchange: Semantics and Query Answering. TCS 336(1), 89–124 (2005)
Article MathSciNet MATH Google Scholar
Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Interaction between Record Matching and Data Repairing. In: SIGMOD Conference, pp. 469–480 (2011)
Google Scholar
Fellegi, I.P., Sunter, A.B.: A Theory for Record Linkage. J. Am. Statistical Assoc. 64(328), 1183–1210 (1969)
Article Google Scholar
Fletcher, G.H.L., Gyssens, M., Paredaens, J., Gucht, D.V.: On the Expressive Power of the Relational Algebra on Finite Sets of Relation Pairs. IEEE TKDE 21(6), 939–942 (2009)
Google Scholar
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.A.: Declarative Data Cleaning: Language, Model, and Algorithms. In: VLDB, pp. 371–380 (2001)
Google Scholar
Gottlob, G., Koch, C., Baumgartner, R., Herzog, M., Flesca, S.: The Lixto Data Extraction Project - Back and Forth between Theory and Practice. In: PODS, pp. 1–12 (2004)
Google Scholar
Gottlob, G., Senellart, P.: Schema Mapping Discovery from Data Instances. Journal of the Association for Computing Machinery (JACM) 57(2) (2010)
Google Scholar
Hernández, M.A., Koutrika, G., Krishnamurthy, R., Popa, L., Wisnesky, R.: HIL: A High-Level Scripting Language for Entity Integration. In: EDBT, pp. 549–560 (2013)
Google Scholar
Hernández, M.A., Stolfo, S.J.: The Merge/Purge Problem for Large Databases. In: SIGMOD Conference, pp. 127–138 (1995)
Google Scholar
Ohori, A.: A Polymorphic Record Calculus and Its Compilation. ACM Trans. Program. Lang. Syst. 17(6), 844–895 (1995)
Article Google Scholar
Ohori, A., Buneman, P.: Type Inference in a Database Programming Language. In: LISP and Functional Programming, pp. 174–183 (1988)
Google Scholar
Rahm, E., Thor, A., Aumueller, D., Do, H.H., Golovin, N., Kirsten, T.: iFuice - Information Fusion utilizing Instance Correspondences and Peer Mappings. In: WebDB, pp. 7–12 (2005)
Google Scholar
Sarma, A.D., Parameswaran, A.G., Garcia-Molina, H., Widom, J.: Synthesizing View Definitions from Data. In: ICDT, pp. 89–103 (2010)
Google Scholar
Wand, M.: Complete Type Inference for Simple Objects. In: LICS, pp. 37–44 (1987)
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Almaden Research Center, USA
Bogdan Alexe, Douglas Burdick, Mauricio A. Hernández, Rajasekar Krishnamurthy, Lucian Popa & Ioana R. Stanoi
HP Labs, USA
Georgia Koutrika
School of Engineering and Applied Sciences, Harvard University, USA
Ryan Wisnesky

Authors

Bogdan Alexe
View author publications
You can also search for this author in PubMed Google Scholar
Douglas Burdick
View author publications
You can also search for this author in PubMed Google Scholar
Mauricio A. Hernández
View author publications
You can also search for this author in PubMed Google Scholar
Georgia Koutrika
View author publications
You can also search for this author in PubMed Google Scholar
Rajasekar Krishnamurthy
View author publications
You can also search for this author in PubMed Google Scholar
Lucian Popa
View author publications
You can also search for this author in PubMed Google Scholar
Ioana R. Stanoi
View author publications
You can also search for this author in PubMed Google Scholar
Ryan Wisnesky
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Information Science, University of Pennsylvania, 3330 Walnut St., 19104, Philadelphia, PA, USA
Val Tannen
School of Computing, National University of Singapore, 13 Computing Drive, 117417, Singapore, Singapore
Limsoon Wong
School of Informatics, The University of Edinburgh, 10 Crichton Street, EH8 9AB, Edinburgh, UK
Leonid Libkin , Wenfei Fan & Michael Fourman , &
Department of Computer Science, University of California, 1156 High Street, 95064, Santa Cruz, CA, USA
Wang-Chiew Tan

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Alexe, B. et al. (2013). High-Level Rules for Integration and Analysis of Data: New Challenges. In: Tannen, V., Wong, L., Libkin, L., Fan, W., Tan, WC., Fourman, M. (eds) In Search of Elegance in the Theory and Practice of Computation. Lecture Notes in Computer Science, vol 8000. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41660-6_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-41660-6_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41659-0
Online ISBN: 978-3-642-41660-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics