Advertisement

Leveraging Visual Features and Hierarchical Dependencies for Conference Information Extraction

  • Yue You
  • Guandong Xu
  • Jian Cao
  • Yanchun Zhang
  • Guangyan Huang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7808)

Abstract

Traditional information extraction methods mainly rely on visual feature assisted techniques; but without considering the hierarchical dependencies within the paragraph structure, some important information is missing. This paper proposes an integrated approach for extracting academic information from conference Web pages. Firstly, Web pages are segmented into text blocks by applying a new hybrid page segmentation algorithm which combines visual feature and DOM structure together. Then, these text blocks are labeled by a Tree-structured Random Fields model, and the block functions are differentiated using various features such as visual features, semantic features and hierarchical dependencies. Finally, an additional post-processing is introduced to tune the initial annotation results. Our experimental results on real-world data sets demonstrated that the proposed method is able to effectively and accurately extract the needed academic information from conference Web pages.

Keywords

Information Extraction Visual Feature DOM Structure Tree-structured Conditional Random Fields 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Tang, J., Zhang, J., Zhang, D., Yao, L., Zhu, C., Li, J.Z.: Arnetminer: An expertise oriented search system for web community. In: Semantic Web Challenge. CEUR Workshop Proceedings, vol. 295 (2007)Google Scholar
  2. 2.
    Sun, F., Song, D., Liao, L.: Dom based content extraction via text density. In: SIGIR, pp. 245–254 (2011)Google Scholar
  3. 3.
    Zhu, J., Nie, Z., Wen, J.R., Zhang, B., Ma, W.Y.: Simultaneous record detection and attribute labeling in web data extraction. In: KDD, pp. 494–503 (2006)Google Scholar
  4. 4.
    Zhu, J., Nie, Z., Wen, J.R., Zhang, B., Ma, W.Y.: 2d conditional random fields for web information extraction. In: ICML. ACM International Conference Proceeding Series, vol. 119, pp. 1044–1051 (2005)Google Scholar
  5. 5.
    Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Block-based web search. In: SIGIR, pp. 456–463 (2004)Google Scholar
  6. 6.
    Duan, K.-B., Keerthi, S.S.: Which is the best multiclass SVM method? An empirical study. In: Oza, N.C., Polikar, R., Kittler, J., Roli, F. (eds.) MCS 2005. LNCS, vol. 3541, pp. 278–285. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  7. 7.
    Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: ICML, pp. 282–289 (2001)Google Scholar
  8. 8.
    Bradley, J.K., Guestrin, C.: Learning tree conditional random fields. In: ICML, pp. 127–134 (2010)Google Scholar
  9. 9.
    Tang, J., Hong, M., Li, J., Liang, B.: Tree-structured conditional random fields for semantic annotation. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 640–653. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  10. 10.
    Heckerman, D.: A tutorial on learning with bayesian networks. In: Holmes, D.E., Jain, L.C. (eds.) Innovations in Bayesian Networks. SCI, vol. 156, pp. 33–82. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  11. 11.
    Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB, pp. 109–118 (2001)Google Scholar
  12. 12.
    Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD Conference, pp. 337–348 (2003)Google Scholar
  13. 13.
    Song, R., Liu, H., Wen, J.R., Ma, W.Y.: Learning block importance models for web pages. In: WWW, pp. 203–211 (2004)Google Scholar
  14. 14.
    Wainwright, M.J., Jaakkola, T., Willsky, A.S.: Tree-based reparameterization for approximate inference on loopy graphs. In: NIPS, pp. 1001–1008 (2001)Google Scholar
  15. 15.
    Xiao, Y., Wei, Z., Wang, Z.: A limited memory bfgs-type method for large-scale unconstrained optimization. Computers & Mathematics with Applications 56(4), 1001–1009 (2008)MathSciNetzbMATHCrossRefGoogle Scholar
  16. 16.
    Sha, F., Pereira, F.C.N.: Shallow parsing with conditional random fields. In: HLT-NAACL (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Yue You
    • 1
    • 2
  • Guandong Xu
    • 3
  • Jian Cao
    • 1
  • Yanchun Zhang
    • 2
    • 4
  • Guangyan Huang
    • 2
  1. 1.Department of Computer Science and EngineeringShanghai Jiao Tong UniversityChina
  2. 2.Centre for Applied InformaticsVictoria UniversityAustralia
  3. 3.Advanced Analytics InstituteUniversity of Technology SydneyAustralia
  4. 4.University of Chinese Academy of SciencesChina

Personalised recommendations