Advertisement

SEMI: A Scalable Entity Matching System Based on MapReduce

  • Pingfu Chao
  • Yuming Li
  • Zhu Gao
  • Junhua Fang
  • Xiaofeng He
  • Rong ZhangEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9093)

Abstract

MapReduce framework provides a new platform for data integration on distributed environment. We demonstrate a MapReduce-based entity resolution framework which efficiently solves the matching problem for structured, semi-structured and unstructured entities. We propose a random-based data representation method for reducing network transmission; we implement our design on MapReduce and design two solutions for reducing redundant comparisons. Our demo provides an easy-to-use platform for entity matching and performance analysis. We also compare the performance of our algorithm with the state-of-the-art blocking-based methods.

Keywords

MapReduce Framework Huge Data Entity Resolution Local Sensitive Hashing Entity Pair 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baraglia, R., De Francisci Morales, G., Lucchese, C.: Document similarity self-join with mapreduce. In: Proc. of ICDM, pp. 731–736 (2010)Google Scholar
  2. 2.
    Kolb, L., Thor, A., Rahm, E.: Dedoop: efficient deduplication with hadoop. Proc. of VLDB 5(12), 1878–1881 (2012)CrossRefGoogle Scholar
  3. 3.
    Ravichandran, D., Pantel, P., Hovy, E.: Randomized algorithms and nlp: using locality sensitive hash function for high speed noun clustering. In: Proc. of ACL, pp. 622–629 (2005)Google Scholar
  4. 4.
    Toutanova, K., Manning, C.D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proc. of EMNLP, pp. 63–70 (2000)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Pingfu Chao
    • 1
    • 2
  • Yuming Li
    • 1
    • 2
  • Zhu Gao
    • 2
  • Junhua Fang
    • 1
    • 2
  • Xiaofeng He
    • 1
    • 2
  • Rong Zhang
    • 1
    • 2
    Email author
  1. 1.Institute for Data Science and EngineeringEast China Normal UniversityShanghaiChina
  2. 2.Shanghai Key Laboratory of Trustworthy ComputingEast China Normal UniversityShanghaiChina

Personalised recommendations