SEMI: A Scalable Entity Matching System Based on MapReduce
MapReduce framework provides a new platform for data integration on distributed environment. We demonstrate a MapReduce-based entity resolution framework which efficiently solves the matching problem for structured, semi-structured and unstructured entities. We propose a random-based data representation method for reducing network transmission; we implement our design on MapReduce and design two solutions for reducing redundant comparisons. Our demo provides an easy-to-use platform for entity matching and performance analysis. We also compare the performance of our algorithm with the state-of-the-art blocking-based methods.
KeywordsMapReduce Framework Huge Data Entity Resolution Local Sensitive Hashing Entity Pair
Unable to display preview. Download preview PDF.
- 1.Baraglia, R., De Francisci Morales, G., Lucchese, C.: Document similarity self-join with mapreduce. In: Proc. of ICDM, pp. 731–736 (2010)Google Scholar
- 3.Ravichandran, D., Pantel, P., Hovy, E.: Randomized algorithms and nlp: using locality sensitive hash function for high speed noun clustering. In: Proc. of ACL, pp. 622–629 (2005)Google Scholar
- 4.Toutanova, K., Manning, C.D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proc. of EMNLP, pp. 63–70 (2000)Google Scholar