Skip to main content
Log in

2D Correlative-Chain Conditional Random Fields for Semantic Annotation of Web Objects

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Semantic annotation of Web objects is a key problem for Web information extraction. The Web contains an abundance of useful semi-structured information about real world objects, and the empirical study shows that strong two-dimensional sequence characteristics and correlative characteristics exist for Web information about objects of the same type across different Web sites. Conditional Random Fields (CRFs) are the state-of-the-art approaches taking the sequence characteristics to do better labeling. However, as the appearance of correlative characteristics between Web object elements, previous CRFs have their limitations for semantic annotation of Web objects and cannot deal with the long distance dependencies between Web object elements efficiently. To better incorporate the long distance dependencies, on one hand, this paper describes long distance dependencies by correlative edges, which are built by making good use of structured information and the characteristics of records from external databases; and on the other hand, this paper presents a two-dimensional Correlative-Chain Conditional Random Fields (2DCC-CRFs) to do semantic annotation of Web objects. This approach extends a classic model, two-dimensional Conditional Random Fields (2DCRFs), by adding correlative edges. Experimental results using a large number of real-world data collected from diverse domains show that the proposed approach can significantly improve the semantic annotation accuracy of Web objects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Zhu J, Nie Z Q, Wen J R, Zhang B, Ma W Y. 2D conditional random fields for Web information extraction. In Proc. the International Conference on Machine Learning, Bonn, Germany, Aug. 7-11, 2005, pp.1044-1051.

  2. Haas L. Beauty and the beast: The theory and practice of information integration. In Proc. the 11th International Conference on Database Theory, Barcelona, Spain, Jan. 10-12, 2007, pp.28-43.

  3. Zhu J, Nie Z Q, Wen J R, Zhang B, Ma W Y. Simultaneous record detection and attribute labeling in Web data extraction. In Proc. the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, USA, Aug. 20-23, 2006, pp.494-503.

  4. Lafferty J, McCallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. the International Conference on Machine Learning, Williamstown, USA, Jun. 28-Jul. 1, 2001, pp.282-289.

  5. Zhai Y H, Liu B. Web data extraction based on partial tree alignment. In Proc. the 14th International World Wide Web Conference, Chiba, Japan, May 10-14, 2005, pp.76-85.

  6. Embley D W, Campbell D M, Jiang Y S et al. Conceptualmodel-based data extraction from multiple-record Web pages. Data and Knowledge Engineering, 1999, 31(3): 227-251.

    Article  MATH  Google Scholar 

  7. Ramakrishnan S M, Ramakrishnan I V, Singh A. Bootstrapping semantic annotation for content-rich HTML documents. In Proc. the 21st International Conference on Data Engineering, Tokyo, Japan, Apr. 5-8, 2005, pp.583-593.

  8. Arlotta L, Crescenzi V, Mecca G, Merialdo P. Automatic annotation of data extracted from large Web sites. In Proc. the 6th International Workshop on Web and Databases, California, USA, Jun. 12-13, 2003, pp.7-12.

  9. Zhao H, Kit C Y. Scaling conditional random fields by one against-the-other decomposition. Journal of Computer Science and Technology, 2008, 23(4): 612-619.

    Article  Google Scholar 

  10. Sutton C, McCallum A. Collective segmentation and labeling of distant entities in information extraction. England: University of Massachusetts, Technical Report: 04-49, July 2004.

  11. Huang J B, Ji H B, Sun H L. Integration of heterogeneous of Web records using mixed skip-chain conditional fields. Journal of Software, 2008, 19(8): 2149-2158. (in Chinese)

    Google Scholar 

  12. Zhu J, Nie Z Q, Zhang B, Wen J R. Dynamic hierarchical Markov random fields for integrate Web data extraction. Journal of Machine Learning Research, 2008, 9(6): 1583-1614.

    Google Scholar 

  13. Cohen W, Sarawagi S. Exploiting dictionaries in named entity extraction: Combining semi-Markov extraction processes and data integration methods. In Proc. the International Conference on Knowledge Discovery and Data Mining, Seattle, USA, Aug. 22-25, 2004, pp.89-98.

  14. Nie Z Q, Wu F, Wen J R, Ma W Y. Extracting objects from the Web. In Proc. the 22nd International Conference on Data Engineering, Atlanta, USA, Apr. 3-7, 2006, p.123.

  15. Hammersley J, Clifford P. Markov fields on finite graphs and lattices. Unpublished manuscript, Oxford University, 1971.

  16. Mansuri I R, Sarawagi S. Integrating unstructured data into relational databases. In Proc. the 22nd International Conference on Data Engineering, Atlanta, USA, Apr. 3-7,2006, p.29.

  17. Liu D C, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 1989, 45(3): 503-528.

    Article  MathSciNet  MATH  Google Scholar 

  18. Kevin P M, Yair W, Michael I J. Loopy belief propagation for approximate inference: An empirical study. In Proc. the 15th Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, Jul. 30-Aug. 1, 1999, pp.467-475.

  19. Weiss Y. Correctness of local probability propagation in graphical models with loops. Neural Computation, 2000, 12(1): 1-41.

    Article  MATH  Google Scholar 

  20. Weiss Y, Freeman W. On the optimality of solutions of the max-product belief propagation algorithm in arbitrary graphs. IEEE Transaction on Information Theory, 2001, 47(2): 736-744.

    Article  MathSciNet  MATH  Google Scholar 

  21. Wang X L, Computer Processing of Natural Language, Beijing: Tsinghua University Press, 2005, pp.58-62. (in Chinese)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qing-Zhong Li.

Additional information

Supported by the National Natural Science Foundation of China under Grant No. 90818001 and the Natural Science Foundation of Shandong Province of China under Grant No. Y2007G24.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ding, YH., Li, QZ., Dong, YQ. et al. 2D Correlative-Chain Conditional Random Fields for Semantic Annotation of Web Objects. J. Comput. Sci. Technol. 25, 761–770 (2010). https://doi.org/10.1007/s11390-010-9363-8

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-010-9363-8

Keywords

Navigation