Visually Extracting Data Records from Query Result Pages

  • Neil Anderson
  • Jun Hong
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7808)


Web databases are now pervasive. Query result pages are dynamically generated from these databases in response to user-submitted queries. Automatically extracting structured data from query result pages is a challenging problem, as the structure of the data is not explicitly represented. While humans have shown good intuition in visually understanding data records on a query result page as displayed by a web browser, no existing approach to data record extraction has made full use of this intuition. We propose a novel approach, in which we make use of the common sources of evidence that humans use to understand data records on a displayed query result page. These include structural regularity, and visual and content similarity between data records displayed on a query result page. Based on these observations we propose new techniques that can identify each data record individually, while ignoring noise items, such as navigation bars and adverts. We have implemented these techniques in a software prototype, rExtractor, and tested it using two datasets. Our experimental results show that our approach achieves significantly higher accuracy than previous approaches.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD Conference, New York, NY, USA, pp. 337–348 (2003)Google Scholar
  2. 2.
    Cai, D., Yu, S., Wen, J., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  3. 3.
    Chang, C.-H., Lui, S.-C.: Iepad: information extraction based on pattern discovery. In: WWW Conference, New York, NY, USA, pp. 681–688 (2001)Google Scholar
  4. 4.
    Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB Conference, San Francisco, CA, USA, pp. 109–118 (2001)Google Scholar
  5. 5.
  6. 6.
  7. 7.
    Jakob nielsen - usable i.t (2002),
  8. 8.
    Webkit - layout engine,
  9. 9.
    Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: SIGKDD Conference, New York, NY, USA, pp. 601–606 (2003)Google Scholar
  10. 10.
    Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering 22, 447–460 (2010)CrossRefGoogle Scholar
  11. 11.
    Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., Moser, L.E.: Extracting data records from the web using tag path clustering. In: WWW Conference, pp. 981–990 (2008)Google Scholar
  12. 12.
    Nielsen, J., Pernice, K.: Eyetracking Web Usability, 1st edn., pp. 97–110. New Riders (2010)Google Scholar
  13. 13.
    Real, R., Vargas, J.M.: The probabilistic basis of jaccard’s index of similarity. Systematic Biology 45, 380–385 (1996)CrossRefGoogle Scholar
  14. 14.
    Simon, K., Lausen, G.: Viper: augmenting automatic information extraction with visual perceptions. In: CIKM Conference, New York, NY, USA, pp. 381–388 (2005)Google Scholar
  15. 15.
    Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW Conference, New York, NY, USA, pp. 187–196 (2003)Google Scholar
  16. 16.
    Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW Conference, New York, NY, USA, pp. 76–85 (2005)Google Scholar
  17. 17.
    Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: WWW Conference, New York, NY, USA, pp. 66–75 (2005)Google Scholar
  18. 18.
    Zhao, H., Meng, W., Yu, C.: Automatic extraction of dynamic record sections from search engine result pages. In: VLDB Conference, pp. 989–1000 (2006)Google Scholar
  19. 19.
    Zhao, H., Meng, W., Yu, C.: Mining templates from search result records of search engines. In: SIGKDD Conference, New York, NY, USA, pp. 884–893 (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Neil Anderson
    • 1
  • Jun Hong
    • 1
  1. 1.School of Electronics, Electrical Engineering and Computer ScienceQueen’s University BelfastBelfastUK

Personalised recommendations