Knowledge and Information Systems

, Volume 55, Issue 1, pp 171–214 | Cite as

Active instance matching with pairwise constraints and its application to Chinese knowledge base construction

  • Weiming Lu
  • Hao Dai
  • Zhenyu Zhang
  • Chao Wu
  • Yueting Zhuang
Regular Paper
  • 159 Downloads

Abstract

Instance matching is the problem of determining whether two instances describe the same real-world entity or not. Instance matching plays a key role in data integration and data cleansing, especially for building a knowledge base. For example, we can regard each article in encyclopedias as an instance, and a group of articles which refers to the same real-world object as an entity. Therefore, articles about Washington should be distinguished and grouped into different entities such as Washington, D.C (the capital of the USA), George Washington (first president of the USA), Washington (a state of the USA), Washington (a village in West Sussex, England), Washington F.C. (a football club based in Washington, Tyne and Wear, England), Washington, D.C. (a novel). In this paper, we proposed a novel instance matching approach Active Instance Matching with Pairwise Constraints, which can bring the human into the loop of instance matching. The proposed approach can generate candidate pairs in advance to reduce the computational complexity, and then iteratively select the most informative pairs according to the uncertainty, influence, connectivity and diversity of pairs. We evaluated our approach one two publicly available datasets AMINER and WIDE-NUS and then applied our approach to the two large-scale real-world datasets, Baidu Baike and Hudong Baike, to build a Chinese knowledge base. The experiments and practice illustrate the effectiveness of our approach.

Keywords

Instance matching Active clustering Pairwise constraint Knowledge base 

Notes

Acknowledgements

This work is supported by the Zhejiang Provincial Natural Science Foundation of China (No. LY17F020015), the Chinese Knowledge Center of Engineering Science and Technology (CKCEST) and the Fundamental Research Funds for the Central Universities (No. 2017FZA5016).

References

  1. 1.
    Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S (2009) Dbpedia-a crystallization point for the web of data. Web Semant 7(3):154–165CrossRefGoogle Scholar
  2. 2.
    Suchanek FM, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. In: Proceedings of the 16th international conference on World Wide Web. ACM, pp 697–706Google Scholar
  3. 3.
    Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, pp 1247–1250Google Scholar
  4. 4.
    Carlson A, Betteridge J, Kisiel B, Settles B, Hruschka ER Jr, Mitchell TM (2010) Toward an architecture for never-ending language learning, vol  5. In: AAAI, p 3Google Scholar
  5. 5.
    Getoor L, Machanavajjhala A (2012) Entity resolution: tutorial. VLDB, IstanbulGoogle Scholar
  6. 6.
    Getoor L, Machanavajjhala A (2013) Entity resolution for big data. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 1527–1527Google Scholar
  7. 7.
    Suchanek F, Weikum G (2013) Knowledge harvesting in the big-data era. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data. ACM, pp 933–938Google Scholar
  8. 8.
    Stefanidis K, Efthymiou V, Melanie Herschel, Christophides V (2014) Entity resolution in the web of data. In: Proceedings of the companion publication of the 23rd international conference on World wide web companion. International World Wide Web Conferences Steering Committee, pp 203–204Google Scholar
  9. 9.
    Dorneles CF, Gonçalves R, dos Santos Mello R (2011) Approximate data instance matching: a survey. Knowl Inf Syst 27(1):1–21CrossRefGoogle Scholar
  10. 10.
    Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, BerlinCrossRefGoogle Scholar
  11. 11.
    Rong S, Niu X, Xiang EW, Wang H, Yang Q, Yu Y (2012) A machine learning approach for instance matching based on similarity metrics. In: The semantic web–ISWC 2012. Springer, pp 460–475Google Scholar
  12. 12.
    Araujo S, Tran D, DeVries A, Hidders J, Schwabe D (2012) Serimi: class-based disambiguation for effective instance matching over heterogeneous web data. In: WebDB, pp 25–30Google Scholar
  13. 13.
    Sachan M, Hovy E, Xing EP (2015) An active learning approach to coreference resolution. In: 24th international joint conference on artificial intelligence (IJCAI)Google Scholar
  14. 14.
    Suchanek FM, Abiteboul S, Senellart P (2011) Paris: probabilistic alignment of relations, instances, and schema. In: Proceedings of the VLDB endowment vol 5(3). pp 157–168Google Scholar
  15. 15.
    Böhm C, de Melo G, Naumann F, Weikum G (2012) Linda: distributed web-of-data-scale entity matching. In: Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, pp 2104–2108Google Scholar
  16. 16.
    Lacoste-Julien S, Palla K, Davies A, Kasneci G, Graepel T, Ghahramani Z (2013) Sigma: simple greedy matching for aligning large knowledge bases. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 572–580Google Scholar
  17. 17.
    Zhang Y, Tang J, Yang Z, Pei J, Yu PS (2015) Cosnet: connecting heterogeneous social networks with local and global consistency. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1485–1494Google Scholar
  18. 18.
    Wang J, Kraska T (2012) Michael J Franklin, and Jianhua Feng. Crowder: Crowdsourcing entity resolution. In: Proceedings of the VLDB endowment, vol 5(11). pp 1483–1494Google Scholar
  19. 19.
    Vesdapunt N, Bellare K, Dalvi N (2014) Crowdsourcing algorithms for entity resolution. In: Proceedings of the VLDB endowment, vol 7(12)Google Scholar
  20. 20.
    Gokhale C, Das S, Doan AH, Naughton JF, Rampalli N, Shavlik J, Zhu X (2014) Corleone: hands-off crowdsourcing for entity matching. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, pp 601–612Google Scholar
  21. 21.
    Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. In: SDM, vol 4. SIAM, pp 333–344Google Scholar
  22. 22.
    Zhu X, Loy CC, Gong S (2013) Constrained clustering: effective constraint propagation with imperfect oracles. In: IEEE 13th international conference on data mining (ICDM), 2013. IEEE, pp 1307–1312Google Scholar
  23. 23.
    Zhu X, Loy CC, Gong S (2015) Constrained clustering with imperfect oracles. In: IEEE transactions on neural networks and learning systems, vol PP(99). pp 1–1Google Scholar
  24. 24.
    Biswas A, Jacobs D (2014) Active image clustering with pairwise constraints from humans. Int J Comput Vis 108(1–2):133–147MathSciNetCrossRefMATHGoogle Scholar
  25. 25.
    Wang X, Davidson I (2010) Active spectral clustering. In: IEEE 10th international conference on data mining (ICDM), 2010. IEEE, pp 561–568Google Scholar
  26. 26.
    Hassanzadeh O, Kementsietsidis A, Lim L, Miller RJ, Wang M (2009) A framework for semantic link discovery over relational data. In: Proceedings of the 18th ACM conference on information and knowledge management. ACM, pp 1027–1036Google Scholar
  27. 27.
    Zhao H, Ram S (2008) Entity matching across heterogeneous data sources: an approach based on constrained cascade generalization. Data Knowl Eng 66(3):368–381CrossRefGoogle Scholar
  28. 28.
    Nguyen K, Ichise R, Le H-B (2012) Learning approach for domain-independent linked data instance matching. In: Proceedings of the ACM SIGKDD workshop on mining data semantics. ACM, p 7Google Scholar
  29. 29.
    Lu Z, Carreira-Perpinan M, et al (2008) Constrained spectral clustering through affinity propagation. In: IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE, pp 1–8Google Scholar
  30. 30.
    Wang X, Davidson I (2010) Flexible constrained spectral clustering. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 563–572Google Scholar
  31. 31.
    Li Z, Liu J, Tang X (2009) Constrained clustering via spectral regularization. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE, pp 421–428Google Scholar
  32. 32.
    Elsner M, Schudy W (2009) Bounding and comparing methods for correlation clustering beyond ilp. In: Proceedings of the workshop on integer linear programming for natural langauge processing. Association for Computational Linguistics, pp 19–27Google Scholar
  33. 33.
    Wang J, Li G, Kraska T, Franklin MJ, Feng J (2013) Leveraging transitive relations for crowdsourced joins. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data. ACM, pp 229–240Google Scholar
  34. 34.
    Demartini G, Difallah DE, Cudré-Mauroux P (2013) Large-scale linked data integration using probabilistic reasoning and crowdsourcing. VLDB J 22(5):665–687CrossRefGoogle Scholar
  35. 35.
    Niu X, Sun X, Wang H, Rong S, Qi G, Yu Y (2011) Zhishi. me-weaving chinese linking open data. In: The Semantic web—ISWC 2011. Springer, pp 205–220Google Scholar
  36. 36.
    Wang Z, Li J, Wang Z, Li S, Li M, Zhang D, Shi Y, Liu Y, Zhang P, Tang J (2013) Xlore: a large-scale english-chinese bilingual knowledge graph. In: International semantic web conference (Posters & Demos), vol 1035. pp 121–124Google Scholar
  37. 37.
    Zhang X-Y, Wang S, Yun X (2015) Bidirectional active learning: a two-way exploration into unlabeled and labeled data set. IEEE Trans Neural Netw Learn Syst 26(12):3034–3044MathSciNetCrossRefGoogle Scholar
  38. 38.
    Zhang X-Y, Wang S, Zhu X, Yun X, Wu G, Wang Y (2015) Update vs. upgrade. Neurocomputing 162:163–170CrossRefGoogle Scholar
  39. 39.
    Cheng J, Wang K (2007) Active learning for image retrieval with co-svm. Pattern Recognit 40(1):330–334CrossRefMATHGoogle Scholar
  40. 40.
    Zhang X (2014) Interactive patent classification based on multi-classifier fusion and active learning. Neurocomputing 127:200–205CrossRefGoogle Scholar
  41. 41.
    Cai W, Zhang M, Zhang Y (2016) Batch mode active learning for regression with expected model change. In: IEEE transactions on neural networks and learning systems, vol PP(99). pp 1–14Google Scholar
  42. 42.
    Xiong C, Johnson D, Corso JJ (2014) Active clustering with model-based uncertainty reduction. arXiv preprint arXiv:1402.1783
  43. 43.
    Mai ST, He X, Hubig N, Plant C, Bohm C (2013) Active density-based clustering. In: IEEE 13th international conference on data mining (ICDM), 2013. IEEE, pp 508–517Google Scholar
  44. 44.
    Wauthier FL, Jojic N, Jordan MI (2012) Active spectral clustering via iterative uncertainty reduction. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 1339–1347Google Scholar
  45. 45.
    Xiong C, Johnson D, Corso JJ (2012) Spectral active clustering via purification of the k-nearest neighbor graph. In: Proceedings of European conference on data miningGoogle Scholar
  46. 46.
    Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555CrossRefGoogle Scholar
  47. 47.
    Papadakis G, Svirsky J, Gal A, Palpanas T (2016) Comparative analysis of approximate blocking techniques for entity resolution. Proc VLDB Endow 9(9):684–695CrossRefGoogle Scholar
  48. 48.
    Papadakis G, Palpanas T (May 2016) Blocking for large-scale entity resolution: challenges, algorithms, and practical examples. In: 2016 IEEE 32nd international conference on data engineering (ICDE). pp 1436–1439Google Scholar
  49. 49.
    Dalvi B, Mishra A, Cohen WW (2016) Hierarchical semi-supervised classification with incomplete class hierarchies. In: Proceedings of the ninth ACM international conference on web search and data mining. ACM, pp 193–202Google Scholar
  50. 50.
    Dalvi B, Minkov E, Talukdar PP, Cohen WW (2015) Automatic gloss finding for a knowledge base using ontological constraints. In: Proceedings of the eighth ACM international conference on web search and data mining. ACM, pp 369–378Google Scholar
  51. 51.
    Nollenburg M, Wolff A (2011) Drawing and labeling high-quality metro maps by mixed-integer programming. IEEE Trans Vis Comput Graph 17(5):626–641CrossRefGoogle Scholar
  52. 52.
    Sandholm T, Gilpin A, Conitzer V (2005) Mixed-integer programming methods for finding nash equilibria. In: Proceedings of the national conference on artificial intelligence, vol 20. AAAI Press, Menlo Park, p 495Google Scholar
  53. 53.
    Chandra B, Halldórsson MM (2001) Approximation algorithms for dispersion problems. J Algorithms 38:438–465MathSciNetCrossRefMATHGoogle Scholar
  54. 54.
    Dasgupta A, Kumar R, Ravi S (2013) Summarization through submodularity and dispersion. In: ACLGoogle Scholar
  55. 55.
    Ghosh A, Kale S, McAfee P (2011) Who moderates the moderators: crowdsourcing abuse detection in user-generated content. In: Proceedings of the 12th ACM conference on Electronic commerce. ACM, pp 167–176Google Scholar
  56. 56.
    Dalvi N, Dasgupta A, Kumar R, Rastogi V (2013) Aggregating crowdsourced binary ratings. In: Proceedings of the 22nd international conference on World Wide Web. ACM, pp 285–294Google Scholar
  57. 57.
    Karger DR, Oh S, Shah D (2011) Iterative learning for reliable crowdsourcing systems. In: Advances in neural information processing systems. pp 1953–1961Google Scholar
  58. 58.
    Lenstra HW Jr (1983) Integer programming with a fixed number of variables. Math Oper Res 8(4):538–548MathSciNetCrossRefMATHGoogle Scholar
  59. 59.
    Wang X, Tang J, Cheng H, Yu PS (2011) Adana: active name disambiguation. In: IEEE 11th international conference on data mining (ICDM), 2011. IEEE, pp 794–803Google Scholar
  60. 60.
    McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 169–178Google Scholar
  61. 61.
    Papadakis G, Ioannou E, Niederée C, Palpanas T, Nejdl W (2012) Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. In: Proceedings of the fifth ACM international conference on Web search and data mining. ACM, pp 53–62Google Scholar
  62. 62.
    Kenig B, Gal A (2013) Mfiblocks: an effective blocking algorithm for entity resolution. Inf Syst 38(6):908–926CrossRefGoogle Scholar
  63. 63.
    Papadakis George, Ioannou Ekaterini, Palpanas Themis, Niederée Claudia, Nejdl Wolfgang (2013) A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans Knowl Data Eng 25(12):2665–2682CrossRefGoogle Scholar
  64. 64.
    Breiman L (2001) Machine learning. Random For 45(1):5–32Google Scholar
  65. 65.
    Jiang S, Bing L, Zhang Y (2013) Towards an enhanced and adaptable ontology by distilling and assembling online encyclopedias. In: Proceedings of the 22nd ACM international conference on Conference on information and knowledge management. ACM, pp 1703–1708Google Scholar

Copyright information

© Springer-Verlag London Ltd. 2017

Authors and Affiliations

  • Weiming Lu
    • 1
  • Hao Dai
    • 1
  • Zhenyu Zhang
    • 1
  • Chao Wu
    • 2
  • Yueting Zhuang
    • 1
  1. 1.College of Computer Science and TechnologyZhejiang UniversityHangzhou ShiChina
  2. 2.Imperial College LondonLondonUK

Personalised recommendations