Skip to main content

Salt: Scalable Automated Linking Technology for Data-Intensive Computing

  • Chapter
  • First Online:

Abstract

One of the most complex tasks in a data processing environment is record linkage, the data integration process of accurately matching or clustering records or documents from multiple data sources containing information which refer to the same entity such as a person or business. The massive amount of data being collected at many organizations has led to what is now being called the “Big Data” problem which limits the capability of organizations to process and use their data effectively and makes the record linkage process even more challenging [3, 13]. New high-performance data-intensive computing architectures supporting scalable parallel processing such as Hadoop MapReduce and HPCC allow government, commercial organizations, and research environments to process massive amounts of data and solve complex data processing problems including record linkage. A fundamental challenge of data-intensive computing is developing new algorithms which can scale to search and process big data [17]. SALT (Scalable Automated Linking Technology) is new tool which automatically generates code in the ECL language for the open source HPCC scalable data-intensive computing platform based on a simple specification to address most common data integration tasks including data profiling, data cleansing, data ingest, and record linkage.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Bilenko, M., & Mooney, R. J. (2003, August 24–27). Adaptive duplicate detection using learnable string similarity measures. Proceedings of the KDD ’03 Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, D.C., 39–48.

    Google Scholar 

  2. Branting, L. K. (2003). A comparative evaluation of name-matching algorithms. Proceedings of the ICAIL ’03 9th International Conference on Artificial Intelligence and Law, Edinburgh, Scotland, 224–232.

    Google Scholar 

  3. Christen, P. (2008). Automatic record linkage using seeded nearest neighbor and support vector machine classification. Proceedings of the KDD ’08 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, 151–159.

    Google Scholar 

  4. Cochinwala, M., Dalal, S., Elmagarmid, A. K., & Verykios, V. V. (2001). Record matching: Past, present and future (No. Technical Report CSD-TR #01-013): Department of Computer Sciences, Purdue University.

    Google Scholar 

  5. Cohen, W., & Richman, J. (2001). Learning to match and cluster entity names. Proceedings of the ACM SIGIR’01 workshop on Mathematical /Formal Methods in IR.

    Google Scholar 

  6. Cohen, W., & Richman, J. (2002). Learning to match and cluster large high-dimensional data sets for data integration. Proceedings of the KDD ’02 Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada.

    Google Scholar 

  7. Cohen, W. W. (2000). Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3).

    Google Scholar 

  8. Cohen, W. W., Ravikumar, P., & Fienberg, S. E. (2003, August). A comparison of string distance metrics for name matching tasks. Proceedings of the IJCAI-03 Workshop on Information Integration, Acapulco, Mexico, 73–78.

    Google Scholar 

  9. Dunn, H. L. (1946). Record linkage. American Journal of Public Health, 36, 1412–1415.

    Article  Google Scholar 

  10. Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.

    Article  Google Scholar 

  11. Gravano, L., Ipeirotis, P. G., Koudas, N., & Srivastava, D. (2003, May 20–24). Text joins in an RDBMS for web data integration. Proceedings of the WWW ’03 12th international conference on World Wide Web, Budapest, Hungary.

    Google Scholar 

  12. Gu, L., Baxter, R., Vickers, D., & Rainsford, C. (2003). Record linkage: Current practice and future directions (No. CMIS Technical Report No. 03/83): CSIRO Mathematical and Information Sciences.

    Google Scholar 

  13. Herzog, T. N., Scheuren, F. J., & Winkler, W. E. (2007). Data quality and record linkage techniques. New York: Springer Science and Business Media LLC.

    MATH  Google Scholar 

  14. Jones, K. S. (1972). A statistical interpretation of term specificity and its application in information retrieval. Journal of Documentation, 28(1), 11–21.

    Article  Google Scholar 

  15. Koudas, N., Marathe, A., & Srivastava, D. (2004). Flexible string matching against large databases in practice. Proceedings of the 30th VLDB Conference, Toronto, Canada, 1078–1086.

    Google Scholar 

  16. Maggi, F. (2008). A survey of probabilistic record matching models, techniques and tools (No. Advanced Topics in Information Systems B, Cycle XXII, Scientific Report TR-2008-22): DEI, Politecnico di Milano.

    Google Scholar 

  17. Middleton, A. M. (2010). Data-intensive technologies for cloud computing. In B. Furht & A. Escalante (Eds.), Handbook of cloud computing (pp. 83–136). New York: Springer.

    Chapter  Google Scholar 

  18. Newcombe, H. B., & Kennedy, J. M. (1962). Record linkage. Communications of the ACM, 5(11), 563–566.

    Article  Google Scholar 

  19. Newcombe, H. B., Kennedy, J. M., Axford, S. J., & James, A. P. (1959). Automatic linkage of vital records. Science, 130, 954–959.

    Article  Google Scholar 

  20. Robertson, S. (2004). Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation, 60(5), 503–520.

    Article  Google Scholar 

  21. Winkler, W. E. (1989). Frequency-based matching in Fellegi-Sunter model of record linkage. Proceedings of the Section on Survey Research Methods, American Statistical Association, 778–783.

    Google Scholar 

  22. Winkler, W. E. (1994). Advanced methods for record linkage. Proceedings of the Section on Survey Research Methods, American Statistical Association, 274–279.

    Google Scholar 

  23. Winkler, W. E. (1995). Matching and record linkage. In B. G. Cox, D. A. Binder, B. N. Chinnappa, M. J. Christianson, M. J. Colledge & P. S. Kott (Eds.), Business survey methods. New York: John Wiley & Sons.

    Google Scholar 

  24. Winkler, W. E. (1999). The state of record linkage and current research problems: U.S. Bureau of the Census Statistical Research Division.

    Google Scholar 

  25. Winkler, W. E. (2001). Record linkage software and methods for merging administrative lists (No. Statistical Research Report Series No. RR/2001/03). Washington, D.C.: US Bureau of the Census.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anthony M. Middleton .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Middleton, A.M., Bayliss, D.A. (2011). Salt: Scalable Automated Linking Technology for Data-Intensive Computing. In: Furht, B., Escalante, A. (eds) Handbook of Data Intensive Computing. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1415-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-1415-5_8

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-1414-8

  • Online ISBN: 978-1-4614-1415-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics