Mining Web Sites Using Wrapper Induction, Named Entities, and Post-processing

  • Georgios Sigletos
  • Georgios Paliouras
  • Constantine D. Spyropoulos
  • Michalis Hatzopoulos
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3209)


This paper presents a new framework for extracting information from collections of Web pages across different sites. In the proposed framework, a standard wrapper induction algorithm is used that exploits named entity information that has been previously identified. The idea of post-processing the extraction results is introduced for resolving ambiguous fields and improving the overall extraction performance. Post-processing involves the exploitation of two additional sources of information: field transition probabilities, based on a trained bigram model, and confidence scores, estimated for each field by the wrapper induction system. A multiplicative model that is based on the product of those two probabilities is also considered for post-processing. Experiments were conducted on pages describing laptop products, collected from many different sites and in four different languages. The results highlight the effectiveness of the new framework.


Information Extraction Confidence Score Extraction Rule Entity Recognition Defense Advance Research Project Agency 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Defense Advanced Research Projects Agency (DARPA), Proceedings of the 4th Message Understanding Conferences (MUC-4), McLean, Virginia, Morgan Kaufmann (1992)Google Scholar
  2. 2.
    Defense Advanced Research Projects Agency (DARPA), Proceedings of the 5th Message Understanding Conferences (MUC-5), San Mateo, CA, Morgan Kaufmann (1993)Google Scholar
  3. 3.
    Kushmerick, N.: Wrapper induction for Information Extraction, PhD Thesis, Department Of computer Scienc, Univ. Of Washington (1997)Google Scholar
  4. 4.
    Muslea, I., Minton, S., Knoblock, C.: Hierarchical Wrapper Induction for Semistructured Information Sources. Journal Of Autonomous Agents and Multi-Agent Systems 4, 93–114 (2001)CrossRefGoogle Scholar
  5. 5.
    Sonderland, S.: Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning 34-(1/3), 233–272 (1999)CrossRefGoogle Scholar
  6. 6.
    Ciravegna, F.: Adaptive Information Extraction from Text by Rule Induction and Generalization. In: Proceedings of the 17th IJCAI Conference. Seattle (2001)Google Scholar
  7. 7.
    Freitag, D.: Machine Learning for Information Extraction in Informal Domains. Machine Learrning 39, 169–202 (2000)zbMATHCrossRefGoogle Scholar
  8. 8.
    Freitag, D., McCallum, A.K.: Information Extraction using HMMs and Shrinkage. In: AAAI 1999 Workshop on Machine Learning for Information Extraction, pp. 31–36 (1999)Google Scholar
  9. 9.
    Freitag, D., Kushmerick, N.: Boosted Wrapper Induction. In: Proceedings of the 17th AAAI, pp. 59–66 (1999)Google Scholar
  10. 10.
    Grover, C., McDonald, S., Gearailt, D.N., Karkaletsis, V., Farmakiotou, D., Samaritakis, G., Petasis, G., Pazienza, M.T., Vindigni, M., Vichot, F., Wolinski, F.: Multilingual XML-based Named Entity Recognition for E-Retail Domains. In: Proceedings of the LREC 2002, Las Palmas (May 2002)Google Scholar
  11. 11.
    Sigletos, G., Farmakiotou, D., Stamatakis, K., Paliouras, G., Karkaletsis, V.: Annotating Web pages for the needs of Web Information Extraction Applications. Poster at WWW 2003, Budapest Hungary, May 20-24 (2003)Google Scholar
  12. 12.
    Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  13. 13.
    Cohen, W., Fan, W.: Learning page-independent heuristics for extracting data from Web pages. In: The Proceedings of the 8th international WWW conference (WWW 1999). Toronto, Canada (1999)Google Scholar
  14. 14.
    Cohen, W., Hurst, M., Jensen, L.: A Flexible Learning System for Wrapping Tables and Lists in HTML Documents. In: Proceedings of the 11th International WWW Conference, Hawaii, USA (2002)Google Scholar
  15. 15.
    Davulcu, H., Mukherjee, S., Ramakrishman, I.V.: Extraction Techniques for Mining Services from Web Sources. In: IEEE International Conference on Data Mining, Maebashi City, Japan (2002)Google Scholar
  16. 16.
    Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Lonsdale, D.W., Ng, Y.K., Smith, R.D.: Conceptual model-based data extraction from multiple-record web documents. Data and Knowledge Engineering 31(3), 227–251 (1999)zbMATHCrossRefGoogle Scholar
  17. 17.
    Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77-2 (1989)Google Scholar
  18. 18.
    Carrasco, R., Oncina, J.: Learning stochastic regular grammars by means of a statemerging method. In: Carrasco, R.C., Oncina, J. (eds.) ICGI 1994. LNCS, vol. 862, pp. 139–150. Springer, Heidelberg (1994)Google Scholar
  19. 19.
    Muslea, I.: Active Learning with multiple views. PhD Thesis, University of Southern California (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Georgios Sigletos
    • 1
    • 2
  • Georgios Paliouras
    • 1
  • Constantine D. Spyropoulos
    • 1
  • Michalis Hatzopoulos
    • 2
  1. 1.Institute of Informatics and TelecommunicationsNCSR “Demokritos”AthensGreece
  2. 2.Department of Informatics and TelecommunicationsUniversity of AthensAthensGreece

Personalised recommendations