Named Entity Recognition for Web Content Filtering

  • José María Gómez Hidalgo
  • Francisco Carrero García
  • Enrique Puertas Sanz
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3513)


Effective Web content filtering is a necessity in educational and workplace environments, but current approaches are far from perfect. We discuss a model for text-based intelligent Web content filtering, in which shallow linguistic analysis plays a key role. In order to demonstrate how this model can be realized, we have developed a lexical Named Entity Recognition system, and used it to improve the effectiveness of statistical Automated Text Categorization methods. We have performed several experiments that confirm this fact, and encourage the integration of other shallow linguistic processing techniques in intelligent Web content filtering.


Entity Recognition Linear Support Vector Machine Uniform Resource Locator Decision Tree Learner Training Collection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Brunessaux, S., Isidoro, O., Kahl, S., Ferlias, G., Rotta Soares, A.: NetProtect report on currently available COTS filtering tools. Technical report, NetProtect Deliverable NETPROTECT:WP2:D2.2 to the European Commission (2001), Available:
  2. 2.
    Roth, D., van den Bosch, A. (eds.): Proceedings of CoNLL-2002, Taipei, Taiwan, Association for Computational Linguistics, Special Interest Group on Natural Language Learning (2002)Google Scholar
  3. 3.
    Lee, P., Hui, S., Fong, A.: A structural and content-based analysis for web filtering. Internet Research 13, 27–37 (2003)CrossRefGoogle Scholar
  4. 4.
    Gómez, J., de Buenaga, M., Carrero, F., Puertas, E.: Text filtering at POESIA: A new internet content filtering tool for educational environments. Procesamiento del Lenguaje Natural 29, 291–292 (2002)Google Scholar
  5. 5.
    Hepple, M., Ireson, N., Allegrini, P., Marchi, S., Montemagni, S., Gómez, J.: NLPenhanced content filtering within the POESIA project. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004 (2004)Google Scholar
  6. 6.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)CrossRefMathSciNetGoogle Scholar
  7. 7.
    Chandrinos, K.V., Androutsopoulos, I., Paliouras, G., Spyropoulos, C.D.: Automatic Web rating: Filtering obscene content on the Web. In: Borbinha, J.L., Baker, T. (eds.) ECDL 2000. LNCS, vol. 1923, pp. 403–406. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  8. 8.
    Denoyer, L., Vittaut, J.N., Gallinari, P., Brunessaux, S., Brunessaux, S.: Structured multimedia document classification. In: DocEng 2003: Proceedings of the 2003 ACM symposium on Document engineering, pp. 153–160. ACM Press, New York (2003)CrossRefGoogle Scholar
  9. 9.
    Du, R., Safavi-Naini, R., Susilo, W.: Web filtering using text classification. In: Proceedings of the 11th IEEE International Conference on Networks, Sydney, pp. 325–330. IEEE, Los Alamitos (2003)Google Scholar
  10. 10.
    Su, G.Y., Li, J.H., Ma, Y.H., Li, S.H.: Improving the precision of the keywordmatching pornographic text filtering method using a hybrid model. Journal of Zhejiang University SCIENCE 5, 1106–1113 (2004)CrossRefGoogle Scholar
  11. 11.
    Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: Proceedings of ICML 1997, 14th International Conference on Machine Learning (1997)Google Scholar
  12. 12.
    Salton, G.: Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley, Reading (1989)Google Scholar
  13. 13.
    Ghani, R., Slattery, S., Yang, Y.: Hypertext categorization using hyperlink patterns and meta data. In: Brodley, C., Danyluk, A. (eds.) Proceedings of ICML 2001, 18th International Conference on Machine Learning, Williams College, US, pp. 178–185. Morgan Kaufmann, San Francisco (2001)Google Scholar
  14. 14.
    Zhang, T., Damerau, F., Johnson, D.: Text chunking based on a generalization of winnow. J. Mach. Learn. Res. 2, 615–637 (2002)zbMATHCrossRefGoogle Scholar
  15. 15.
    Carreras, X., Màrques, L., Padró, L.: Named entity extraction using adaboost. In: Proceedings of CoNLL-2002, Taipei, Taiwan, pp. 167–170 (2002)Google Scholar
  16. 16.
    Chieu, H., Ng, H.: Named entity recognition: A maximum entropy approach using global information. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), pp. 190–196 (2002)Google Scholar
  17. 17.
    Richardson, C., Resnick, P., Hansen, D., Derry, H.A., Rideout, V.: Does Pornography-Blocking Software Block Access to Health Information on the Internet? Journal of the American Medical Association 288, 2887–2894 (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • José María Gómez Hidalgo
    • 1
  • Francisco Carrero García
    • 1
  • Enrique Puertas Sanz
    • 1
  1. 1.Universidad Europea de MadridMadridSpain

Personalised recommendations