IFTA: Iterative filtering by using TF-AICL algorithm for Chinese encyclopedia knowledge refinement

Abstract

The influence of inaccurate knowledge still exists in the Semantic Web. The problem of knowledge inaccuracy in Knowledge Bases (KBs) is one of the largest obstacles that limit the development of Linked Open Data (LOD) and Knowledge Graphs (KGs). To solve the semantic ambiguity and improper classification of knowledge triples in the process of constructing Chinese online encyclopedia KBs, first, a new TF-AICL algorithm is proposed to calculate the concentration level of predicates in each top-category. Second, the predicate which can best represent the features of a top-category is selected, and the related predicate candidate set is extracted. Third, based on the positive and negative examples counting strategy, the predicate candidate set is used as the comparison group to filter each entity. Finally, based on the TF-AICL algorithm, this paper proposes a new iterative filtering method called IFTA. IFTA adopts a new predicate feature extraction method, TF-AICL, which considers the hierarchical features of the predicate. In addition, IFTA can automatically prune, filter and refine large-scale online encyclopedia knowledge in an iterative way. The precision, recall and F-measure results on the BaiduBaike and Hudong datasets indicate that the refining effects on open-domain Chinese encyclopedia KBs by the IFTA method outperform the state-of-the-art methods.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Notes

  1. 1.

    http://baike.baidu.com

  2. 2.

    http://www.baike.com

  3. 3.

    https://wiki.dbpedia.org

References

  1. 1.

    Berners-Lee T, Hendler J, Lassila O (2001) The semantic web. Sci Amer 284(5):34–43

    Article  Google Scholar 

  2. 2.

    Wu F, Weld D S (2007) Autonomously semantifying wikipedia. In: Proceedings of the 2007 ACM Conference on Information and Knowledge Management. ACM, New York, p 41

  3. 3.

    Wu F, Weld D S (2008) Automatically refining the wikipedia infobox ontology. In: Proceedings of the 17th International Conference on World Wide Web. ACM, New York, p 635

  4. 4.

    Suchanek F M, Kasneci G, Weikum G (2008) Yago: A large ontology from wikipedia and wordnet. J Web Semant 6(3):203–217. https://doi.org/10.1016/j.websem.2008.06.001

    Article  Google Scholar 

  5. 5.

    Hoffart J, Suchanek F M, Berberich K, Weikum G (2013) Yago2: A spatially and temporally enhanced knowledge base from wikipedia. Artif Intell 194:28–61. https://doi.org/10.1016/j.artint.2012.06.001

    MathSciNet  Article  Google Scholar 

  6. 6.

    Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S (2009) Dbpedia - a crystallization point for the web of data. J Web Semant 7(3):154–165. https://doi.org/10.1016/j.websem.2009.07.002

    Article  Google Scholar 

  7. 7.

    Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes P N, Hellmann S, Morsey M, van Kleef P, Auer S, Bizer C (2015) Dbpedia – a large-scale, multilingual knowledge base extracted from wikipedia. Semant Web 6(2):167–195. https://doi.org/10.3233/SW-140134

    Article  Google Scholar 

  8. 8.

    Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In: SIGMOD 2008 : proceedings of the ACM SIGMOD international conference on management of data. ACM, Vancouver, pp 1247–1250

  9. 9.

    Bing L, Lam W, Wong T-L (2013) Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. In: Proceedings of the 6th ACM international conference on web search and data mining, WSDM 2013. ACM, New York, p 567

  10. 10.

    Romadhony A, Widyantoro D H, Purwarianti A (2019) Utilizing structured knowledge bases in open ie based event template extraction. Appl Intell 49(1):206–219. https://doi.org/10.1007/s10489-018-1269-0

    Article  Google Scholar 

  11. 11.

    Dong X, Gabrilovich E, Heitz G, Horn W, Lao N, Murphy K, Strohmann T, Sun S, Zhang W (2014) Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 601–610

  12. 12.

    Zhang F, Ma Z M, Tong Q, Cheng J (2018) Storing fuzzy description logic ontology knowledge bases in fuzzy relational databases. Appl Intell 48(1):220–242. https://doi.org/10.1007/s10489-017-0965-5

    Article  Google Scholar 

  13. 13.

    Huang Y, Wang Z (2017) Knowledge base completion by learning to rank model. In: Knowledge graph and semantic computing. Language, knowledge, and intelligence, communications in computer and information science. Springer, pp 1–6

  14. 14.

    Gardner M, Mitchell T (2015) Efficient and expressive knowledge base completion using subgraph feature extraction. In: Proceedings of the 2015 conference on empirical methods in natural language processing. Association for computational linguistics, Stroudsburg, pp 1488–1498

  15. 15.

    Chen Y, Chen L, Xu K (2012) Learning chinese entity attributes from online encyclopedia. In: Web technologies and applications, lecture notes in computer science, vol 7234. Springer Nature, Berlin, pp 179–186

  16. 16.

    Ting W, Fujun J, Tiansheng X (2016) A novel knowledge extraction approach oriented on unstructured information of chinese online encyclopedia. Library and Information Service

  17. 17.

    Wang Z, Wang Z, Li J, Pan J Z (2012) Building a large scale knowledge base from chinese wiki encyclopedia. In: Semantic web, lecture notes in computer science, vol 7185. Springer Nature, Berlin, pp 80–95

  18. 18.

    Li J, Wang C, He X, Zhang R, Gao M (2015) User generated content oriented chinese taxonomy construction. In: Web Ttechnologies and applications: 17th Asia-PacificWeb conference, APWeb 2015, Guangzhou, proceedings, lecture notes in computer science, vol 9313. Springer International Publishing, Cham, pp 623–634

  19. 19.

    Wang X, Jiang L, Shi H, Feng Z, Du P (2012) Jingwei+: A distributed large-scale rdf data server. In: Web technologies and applications, lecture notes in computer science, vol 7235. Springer Nature, Berlin, pp 779–783

  20. 20.

    Fu Y, Wang X, Feng Z, Lv X (2015) Organization and integration of chinese encyclopedia knowledge based on semantic web. Comput Eng Appl 51(14)

  21. 21.

    Papadakis I, Kyprianos K, Stefanidakis M (2015) Linked data uris and libraries: The story so far. D-Lib Mag 21(5/6). https://doi.org/10.1045/may2015-papadakis

  22. 22.

    Isaac A, van der Meij L, Schlobach S, Wang S (2007) An empirical study of instance-based ontology matching. In: The semantic web, lecture notes in computer science, vol 4825. Springer, Berlin, pp 253–266

  23. 23.

    Jain, P, Hitzler, P, Sheth, AP, Verma, K, Yeh, PZ (2010) Ontology alignment for linked open data. In: The semantic web, lecture notes in computer science. Springer, Shanghai, pp 402–417

  24. 24.

    Volz J, Bizer C, Gaedke M, Kobilarov G (2009) Silk-a link discovery framework for the web of data. Ldow 538:53

    Google Scholar 

  25. 25.

    Volz J, Bizer C, Gaedke M, Kobilarov G (2009) Discovering and maintaining links on the web of data. In: The Semantic Web, Lecture notes in computer science. Lecture notes in artificial intelligence, vol 5823. Springer, New York, pp 650–665

  26. 26.

    Dalton J, Dietz L, Allan J (2014) Entity query feature expansion using knowledge base links. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval. ACM, New York, pp 365–374

  27. 27.

    Niu X, Sun X, Wang H, Rong S, Qi G, Yu Y (2011) Zhishi.me - weaving chinese linking open data. In: the semantic web, lecture notes in computer science, vol 7032. Springer, Berlin [Allemagne], pp 205–220

  28. 28.

    Wang Z-, Wang Z-, Li J-, Pan J Z (2012) Knowledge extraction from chinese wiki encyclopedias. J Zhejiang Univ Sci C 13(4):268–280. https://doi.org/10.1631/jzus.C1101008

    Article  Google Scholar 

  29. 29.

    Wang Z, Li J, Wang Z, Tang J (2012) Cross-lingual knowledge linking across wiki knowledge bases. In: WWW’12. Association for computing Machinery, New York, pp 459–468

  30. 30.

    Wang X, Liu K, He S, Liu S, Zhang Y, Zhao J (2017) Multi-source knowledge bases entity alignment by leveraging semantic tags. Chin J Comput 40(3):701–711

    MathSciNet  Google Scholar 

  31. 31.

    Xu B, Xu Y, Liang J, Xie C, Liang B, Cui W, Xiao Y (2017) Cn-dbpedia: A never-ending chinese knowledge extraction system. In: Advances in artificial intelligence, lecture notes in computer science, vol 10351. Springer, Cham, pp 428–438

  32. 32.

    Soru T, Ngomo A-C N (2014) A comparison of supervised learning classifiers for link discovery. In: Proceedings of the 10th international conference on semantic systems. ACM, New York, pp 41–44

  33. 33.

    Lin L, Liu J, Lv Y, Guo F (2020) A similarity model based on reinforcement local maximum connected same destination structure oriented to disordered fusion of knowledge graphs. Appl Intell 50 (9):2867–2886. https://doi.org/10.1007/s10489-020-01673-9

    Article  Google Scholar 

  34. 34.

    Malaviya C, Bhagavatula C, Bosselut A, Choi Y (2020) Commonsense knowledge base completion with structural and semantic context. In: Proceedings of the 30th AAAI conference on artificial intelligence

  35. 35.

    Jin H, Li C, Zhang J, Hou L, Li J, Zhang P (2019) Xlore2: Large-scale cross-lingual knowledge graph construction and application. Data Intell 1(1):77–98. https://doi.org/10.1162/dint_a_00003

    Article  Google Scholar 

  36. 36.

    Bordes A, Weston J, Collobert R, Bengio Y (2011) Learning structured embeddings of knowledge bases. In: Proceedings of the 25th AAAI conference on artificial intelligence, AAAI’11. AAAI Press, pp 301–306

  37. 37.

    Bordes A, Usunier N, Garcia-Duran A, Weston J, Yakhnenko O (2013) Translating embeddings for modeling multi-relational data. In: Advances in neural information processing systems, vol 26. Curran Associates, Inc, pp 2787–2795

  38. 38.

    Wang Z, Zhang J, Feng J, Chen Z (2014) Knowledge graph embedding by translating on hyperplanes. In: Proceedings of the 28th AAAI conference on artificial intelligence, AAAI’14. AAAI Press, pp 1112–1119

  39. 39.

    Lin Y, Liu Z, Sun M, Liu Y, Zhu X (2015) Learning entity and relation embeddings for knowledge graph completion. In: Proceedings of the 29th AAAI conference on artificial intelligence, 2181–2187

  40. 40.

    Wang Z, Li J (2016) Text-enhanced representation learning for knowledge graph. In: Proceedings of the 25th international joint conference on artificial intelligence, IJCAI’16. AAAI Press, pp 1293–1299

  41. 41.

    He S, Liu K, Ji G, Zhao J (2015) Learning to represent knowledge graphs with gaussian embedding. In: Proceedings of the 24th ACM international on conference on information and knowledge management. ACM, pp 623–632

  42. 42.

    Xiao H, Huang M, Zhu X (2016) Transg: A generative model for knowledge graph embedding. In: Proceedings of the 54th annual meeting of the association for computational linguistics, vol 1, pp 2316–2325

  43. 43.

    Nickel M, Rosasco L, Poggio T Holographic embeddings of knowledge graphs

  44. 44.

    Nickel M, Murphy K, Tresp V, Gabrilovich E (2016) A review of relational machine learning for knowledge graphs. Proc IEEE 104(1):11–33. https://doi.org/10.1109/JPROC.2015.2483592

    Article  Google Scholar 

  45. 45.

    Xiong C, Power R, Callan J (2017) Explicit semantic ranking for academic search via knowledge graph embedding. In: Proceedings of the 26th international conference on World Wide Web. International World Wide Web conferences steering committee, pp 1271–1279

  46. 46.

    Zhou Z, Xu G, Zhu W, Li J, Zhang W (5/14/2017–5/19/2017) Structure embedding for knowledge base completion and analytics. In: 2017 international joint conference on neural networks (IJCNN). IEEE, pp 737–743

  47. 47.

    He T, Gao L, Song J, Wang X, Huang K, Li Y (2020) Sneq: Semi-supervised attributed network embedding with attention-based quantisation. In: Proceedings of the 34th international joint conference on artificial intelligence, pp 4091–4098

  48. 48.

    Lin Y, Liu Z, Luan H, Sun M, Rao S, Liu S (2015) Modeling relation paths for representation learning of knowledge bases. In: Proceedings of the 2015 conference on empirical methods in natural language processing, pp 705–714

  49. 49.

    Liu F, Shen Y, Zhang T, Gao H (2020) Entity-related paths modeling for knowledge base completion. Front Comput Sci 14(5). https://doi.org/10.1007/s11704-019-8264-4

  50. 50.

    Socher R, Chen D, Manning C D, Ng A (2013) Reasoning with neural tensor networks for knowledge base completion. In: Advances in neural information processing systems, vol 26. Curran Associates, Inc, pp 926–934

  51. 51.

    Schlichtkrull M, Kipf T N, Bloem P, van den Berg R, Titov I, Welling M (2018) Modeling relational data with graph convolutional networks. In: The semantic web on 15th international conference on extended semantic web conference, Lecture Notes in Computer Science, vol 10843. Springer international PU, pp 593–607

  52. 52.

    Vashishth S, Sanyal S, Nitin V, Agrawal N, Talukdar P (2020) Interacte: Improving convolution-based knowledge graph embeddings by increasing feature interactions. In: Proceedings of the 30th AAAI conference on artificial intelligence

  53. 53.

    Chen X, Jia S, Ding L, Shen H, Xiang Y (2020) Sdt: An integrated model for open-world knowledge graph reasoning. Expert Syst Appl 162:113889. https://doi.org/10.1016/j.eswa.2020.113889

    Article  Google Scholar 

  54. 54.

    Che F, Zhang D, Tao J, Niu M, Zhao B (2020) Parame: Regarding neural network parameters as relation embeddings for knowledge graph completion. In: AAAI, pp 2774–2781

  55. 55.

    Nizzoli L, Avvenuti M, Tesconi M, Cresci S (2020) Geo-semantic-parsing: Ai-powered geoparsing by traversing semantic knowledge graphs. Decis Support Syst 136:113346. https://doi.org/10.1016/j.dss.2020.113346

    Article  Google Scholar 

  56. 56.

    Li Y, Du G, Xiang Y, Li S, Ma L, Shao D, Wang X, Chen H (2020) Towards chinese clinical named entity recognition by dynamic embedding using domain-specific knowledge. J Biomed Inf 106:103435. https://doi.org/10.1016/j.jbi.2020.103435

    Article  Google Scholar 

  57. 57.

    Wang T, Gu H, Wu Z, Gao J (2020) Multi-source knowledge integration based on machine learning algorithms for domain ontology. Neural Comput Appl 32(1):235–245. https://doi.org/10.1007/s00521-018-3806-5

    Article  Google Scholar 

  58. 58.

    Wang T, Gu H, Li J, Xie J (2019) Tritag-nfpf: Knowledge denoising for chinese encyclopedia based on triple tag-constructed potential function. IEEE Access 7:107413–107427. https://doi.org/10.1109/ACCESS.2019.2933249

    Article  Google Scholar 

  59. 59.

    Chen K, Zhang Z, Long J, Zhang H (2016) Turning from tf-idf to tf-igm for term weighting in text classification. Expert Syst Appl 66:245–260. https://doi.org/10.1016/j.eswa.2016.09.009

    Article  Google Scholar 

  60. 60.

    Wang Y, Zhang D, Yuan Y, Liu Q, Yang Y (2018) Improvement of tf-idf algorithm based on knowledge graph. In: 2018 IEEE 16th international conference on software engineering research, management and applications (SERA). IEEE, pp 19–24

  61. 61.

    Jiang F, Zhang Z, Chen P, Liu Y (2018) Naive bayes text categorization algorithm based on tf-idf attribute weighting. In: Proceedings of the 2018 2nd international conference on computer science and artificial intelligence. ACM, New York , pp 521–525

  62. 62.

    Wang T, XU T, TANG Z, TODO Y (2017) Tongsacom: A tongyicicilin and sequence alignment-based ontology mapping model for chinese linked open data. IEICE Trans Inf Syst E100.D(6):1251–1261. https://doi.org/10.1587/transinf.2016EDP7307

    Article  Google Scholar 

  63. 63.

    Liu Q, Liu B, He M, Wu D, Liu Y, Cheng X (2016) Synonymous expansion based entity attribute extraction via online encyclopedia. In: Journal of Chinese information processing

  64. 64.

    Wang Z, Huang Y (2019) Knowledge base completion by inference from both relational and literal facts. In: Advances in knowledge discovery and data mining, LNCS sublibrary. SL 7, Artificial intelligence, vol 11441. Springer, Cham, pp 501–513

  65. 65.

    Galárraga L, Heitz G, Murphy K, Suchanek F M (2014) Canonicalizing open knowledge bases. In: Proceedings of the 23rd ACM international conference on conference on information and knowledge management. ACM, New York, pp 1679–1688

  66. 66.

    Oren E, Gerke S, Decker S (2007) Simple algorithms for predicate suggestions using similarity and co-occurrence. In: Semantic Web: research and applications, lecture notes in computer science, vol 4519. Springer Nature, Berlin, pp 160–174

  67. 67.

    Xu B, Luo Z, Huang L, Liang B, Xiao Y, Yang D, Wang W (2018) Metic: Multi-instance entity typing from corpus. In: CIKM’18, ACM, association for computing machinery, New York, pp 903–912

  68. 68.

    Wu T, Qi G, Luo B, Zhang L, Wang H (2019) Language-independent type inference of the instances from multilingual wikipedia. Int J Semant Web Inf Syst 15(2):22–46. https://doi.org/10.4018/IJSWIS.2019040102

    Article  Google Scholar 

  69. 69.

    Niu X, Rong S, Wang H, Yu Y (2012) An effective rule miner for instance matching in a web of data. In: CIKM’12. ACM, New York, p 1085

  70. 70.

    Zhang X, Yang Q, Ding J, Wang Z (2020) Entity profiling in knowledge graphs. IEEE Access 8:27257–27266. https://doi.org/10.1109/ACCESS.2020.2971567

    Article  Google Scholar 

  71. 71.

    Esuli A, Fagni T, Sebastiani F (2006) Treeboost.mh: A boosting algorithm for multi-label hierarchical text categorization. In: String processing and information retrieval, Lecture Notes in Computer Science, vol 4209. Springer, Berlin, pp 13–24

  72. 72.

    Heß A, Kushmerick N (2004) Iterative ensemble classification for relational data: A case study of semantic web services. In: Machine learning: ECML 2004, lecture notes in computer science, vol 3201. Springer, Berlin, pp 156–167

  73. 73.

    Melo A, Paulheim H, Völker J (2016) Type prediction in rdf knowledge bases using hierarchical multilabel classification. In: Proceedings of the 6th International Conference on Web Intelligence, Mining and Semantics, WIMS ’16. Association for Computing Machinery, New York, pp 1–10

  74. 74.

    Wang T (2020) Knowledge base for baidubaike. Mendeley. https://data.mendeley.com/datasets/wz6zmvjzb3/1

  75. 75.

    Wang T (2020) Knowledge base for hudong. Mendeley. https://data.mendeley.com/datasets/tm3xs3cc8x/1

Download references

Acknowledgements

This work was supported in part by the Scientific Research Project of Beijing Municipal Education Commission (General Social Science Project) [grant number SM201910038010]; National Social Science Fund of China [grant number 19BXW120]; Backup Academic Leaders Grant of Capital University of Economics and Business; and Special Fund of Fundamental Research Expenses of Beijing Municipal University of Capital University of Economics and Business.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Ting Wang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wang, T., Guo, J., Wu, Z. et al. IFTA: Iterative filtering by using TF-AICL algorithm for Chinese encyclopedia knowledge refinement. Appl Intell (2021). https://doi.org/10.1007/s10489-021-02220-w

Download citation

Keywords

  • Knowledge base
  • Online encyclopedia
  • Knowledge refining
  • Iterative algorithm
  • Knowledge graph