Automatic Discovery of Web Content Related to IT in the Mexican Internet Based on Supervised Classifiers

Martínez-Rodríguez, José-Lázaro; Sosa-Sosa, Víctor-Jesús; López-Arévalo, Iván

doi:10.1007/978-3-642-37807-2_10

José-Lázaro Martínez-Rodríguez²¹,
Víctor-Jesús Sosa-Sosa²¹ &
Iván López-Arévalo²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7629))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

2206 Accesses

Abstract

General web search engines, such as Google, Yahoo and Bing have been very successful information retrieval tools. However, many users with domain-specific interests are still disappointed with the responses obtained from these generic tools. This situation has motivated the creation of domain-specific search engines because they are able to offer increased accuracy with a minor maintenance and infrastructure cost. This paper introduces a method to discover domain-specific web content delimited by a country-context. This method allows a search engine to improve its accuracy for users that are interested in a domain-specific web content from a particular country. Our method is based on supervised classifiers and define country bounds for the search. To delimit the country context, our web content extraction process takes information from different sources, such as the Unified Resource locators (URLs), official government web pages, the Network Information Center (NIC) and the IP numbers reserved to the country of interest. Details of the system architecture are presented. A proof of concept was carried out using the Information and Communication Technologies (ICT) domain in the Mexican context. The testing prototype has obtained encouraging results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Geo Ip locator, http://www.maxmind.com/ (last visit June 2012)
Country IP blocks, http://www.countryipblocks.net/country-blocks/select-formats/ (last visit June 2012)
Ip info DB, http://ipinfodb.com/ (last visit June 2012)
API DNSjava, http://www.dnsjava.org/download (last visit April 2012)
A library for Support Vector Machines, http://www.dnsjava.org/download (last visit June 2012)
Baeza-Yates, R., Castillo, C., Efthimiadis, E.N.: Characterization of national web domains. ACM Trans. Internet Technol. 7 (May 2007), http://doi.acm.org/10.1145/1239971.1239973
Bharat, K., Broder, A.: A technique for measuring the relative size and overlap of public web search engines. Comput. Netw. ISDN Syst. 30(1-7), 379–388 (1998), http://dx.doi.org/10.1016/S0169-75529800127-5
Article Google Scholar
Gulli, A., Signorini, A.: The indexable web is more than 11.5 billion pages. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, WWW 2005, pp. 902–903. ACM, New York (2005), http://doi.acm.org/10.1145/1062745.1062789
Chapter Google Scholar
Hadi, W.M., Salam, M., Al-Widian, J.A.: Performance of nb and svm classifiers in islamic arabic data. In: Proceedings of the 1st International Conference on Intelligent Semantic Web-Services and Applications, ISWSA 2010, pp. 14:1–14:6. ACM, New York (2010), http://doi.acm.org/10.1145/1874590.1874604
Chapter Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009), http://doi.acm.org/10.1145/1656274.1656278
Article Google Scholar
Husby, S., Barbosa, D.: Topic Classification of Blog Posts Using Distant Supervision. In: Proceedings of the Workshop on Semantic Analysis in Social Media. Association for Computational Linguistics, Avignon, France, pp. 28–36 (2012), http://www.aclweb.org/anthology-new/W/W12/#0600
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features (1998)
Google Scholar
Khare, R., Cutting, D.: Nutch: A flexible and scalable open-source web search engine. Tech. rep. (2004)
Google Scholar
Lawrence, S., Giles, C.: Searching the world wide web. Science 280(5360), 98 (1998)
Article Google Scholar
Lawrence, S., Giles, C.: Accessibility of information on the web. Nature 400, 107–109 (1999)
Article Google Scholar
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. J. Mach. Learn. Res. 2, 419–444 (2002), http://dx.doi.org/10.1162/153244302760200687
MATH Google Scholar
McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action, Covers Apache Lucene 3.0, 2nd edn. Manning Publications Co., Greenwich (2010)
Google Scholar
Smiley, D., Pugh, E.: Apache Solr 3 Enterprise Search Server. Packt Publishing, Limited (2011), http://books.google.com.mx/books?id=ChKVwotW8mYC
Tolosa, G., Bordignon, F., Baeza-Yates, R., Castillo, C.: Characterization of the argentinian web. Cybermetrics 11(1), 3 (2007), http://www.cindoc.csic.es/cybermetrics/articles/v11i1p3.html
Google Scholar
Vapnik, V.N.: The nature of statistical learning theory. Springer-Verlag New York, Inc., New York (1995)
MATH Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann Publishers Inc., San Francisco (2005)
MATH Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, ICML 1997, pp. 412–420. Morgan Kaufmann Publishers Inc., San Francisco (1997), http://dl.acm.org/citation.cfm?id=645526.657137
Google Scholar

Download references

Author information

Authors and Affiliations

Information Technology Laboratory at Technologic and Scientific Park TECNOTAM, CINVESTAV IPN, Km. 5.5 highway Cd. Victoria-Soto La Marina, zip code 87130, Cd. Victoria, Tamps., México
José-Lázaro Martínez-Rodríguez, Víctor-Jesús Sosa-Sosa & Iván López-Arévalo

Authors

José-Lázaro Martínez-Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar
Víctor-Jesús Sosa-Sosa
View author publications
You can also search for this author in PubMed Google Scholar
Iván López-Arévalo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Eje Central Lazaro Cardenas Norte, Mexican Petroleum Institute, 152, Col. San Bartolo Atepehuacan, CP 07730, México D.F., Mexico
Ildar Batyrshin
Tecnológico de Monterrey, Campus Estado de México, Carretera Lago de Guadalupe Km 3.5, Atizapán de Zaragoza, ,,, CP 52926, Estado de México, Mexico
Miguel González Mendoza

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Martínez-Rodríguez, JL., Sosa-Sosa, VJ., López-Arévalo, I. (2013). Automatic Discovery of Web Content Related to IT in the Mexican Internet Based on Supervised Classifiers. In: Batyrshin, I., González Mendoza, M. (eds) Advances in Artificial Intelligence. MICAI 2012. Lecture Notes in Computer Science(), vol 7629. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37807-2_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-37807-2_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37806-5
Online ISBN: 978-3-642-37807-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics