Skip to main content

A Semi-supervised Learning Algorithm for Web Information Extraction with Tolerance Rough Sets

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8610))

Abstract

In this paper, we propose a semi-supervised learning algorithm (TPL) to extract categorical noun phrase instances from unstructured web pages based on the tolerance rough sets model (TRSM). TRSM has been successfully employed for document representation, retrieval and classification tasks. However, instead of the vector-space model, our model uses noun phrases which are described in terms of sets of co-occurring contextual patterns. The categorical information that we employ is derived from the Never Ending Language Learner System (NELL) [3]. The performance of the TPL algorithm is compared with the Coupled Bayesian Sets (CBS) algorithm. Experimental results show that TPL is able to achieve comparable performance with CBS in terms of precision.

This research has been supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery grants 194376. We are very grateful to Prof. Estevam R. Hruschka Jr. and to Saurabh Verma for the NELL dataset and for discussions regarding the NELL project. Special Thanks to Prof. James F. Peters for helpful suggestions.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Callan, J., Hoy, M.: Clueweb09 data set (2009), http://lemurproject.org/clueweb09/

  2. Carlson, A., Betteridge, J., Wang, R.C., Hruschka Jr., E.R., Mitchell, T.M.: Coupled semi-supervised learning for information extraction. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 101–110 (2010)

    Google Scholar 

  3. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: Proceedings of the Twenty-Fourth Conference on Artificial Intelligence, AAAI 2010 (2010)

    Google Scholar 

  4. Carlson, A.: All-pairs data set (2010)

    Google Scholar 

  5. Curran, J.R., Murphy, T., Scholz, B.: Minimising semantic drift with mutual exclusion bootstrapping. In: Proc. of PACLING, pp. 172–180 (2007)

    Google Scholar 

  6. Etzioni, O., Fader, A., Christensen, J., Soderland, S., Mausam: Open information extraction: The second generation. In: International Joint Conference on Artificial Intelligence. pp. 3–10 (2011)

    Google Scholar 

  7. Ghahramani, Z., Heller, K.A.: Bayesian sets. Advances in Neural Information Processing Systems 18 (2005)

    Google Scholar 

  8. Ho, T.B., Nguyen, N.B.: Nonhierarchical document clustering based on a tolerance rough set model. International Journal of Intelligent Systems 17, 199–212 (2002)

    Article  MATH  Google Scholar 

  9. Kawasaki, S., Nguyen, N.B., Ho, T.-B.: Hierarchical document clustering based on tolerance rough set model. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 458–463. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  10. Ngo, C.L.: A tolerance rough set approach to clustering web search results. Master’s thesis, Warsaw University (2003)

    Google Scholar 

  11. Pawlak, Z.: Rough sets. International Journal of Computer & Information Sciences 11(5), 341–356 (1982), http://dx.doi.org/10.1007/BF01001956

    Article  MATH  MathSciNet  Google Scholar 

  12. Peters, J., Wasilewski, P.: Tolerance spaces: Origins, theoretical aspects and applications. Information Sciences 195(1-2), 211–225 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  13. Shi, L., Ma, X., Xi, L., Duan, Q., Zhao, J.: Rough set and ensemble learning based semi-supervised algorithm for text classification. Expert Syst. Appl. 38(5), 6300–6306 (2011)

    Article  Google Scholar 

  14. Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundam. Inf. 27(2,3), 245–253 (1996), http://dl.acm.org/citation.cfm?id=2379560.2379571

    MATH  MathSciNet  Google Scholar 

  15. Sørensen, T.: A Method of Establishing Groups of Equal Amplitude in Plant Sociology Based on Similarity of Species Content and Its Application to Analyses of the Vegetation on Danish Commons. Biologiske skrifter, I kommission hos E. Munksgaard (1948), http://books.google.co.in/books?id=rpS8GAAACAAJ

  16. Verma, S., Hruschka Jr., E.R.: Coupled bayesian sets algorithm for semi-supervised learning and information extraction. In: Flach, P.A., De Bie, T., Cristianini, N. (eds.) ECML PKDD 2012, Part II. LNCS, vol. 7524, pp. 307–322. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  17. Virginia, G., Nguyen, H.S.: Lexicon-based document representation. Fundam. Inform. 124(1-2), 27–46 (2013)

    MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Sengoz, C., Ramanna, S. (2014). A Semi-supervised Learning Algorithm for Web Information Extraction with Tolerance Rough Sets. In: Ślȩzak, D., Schaefer, G., Vuong, S.T., Kim, YS. (eds) Active Media Technology. AMT 2014. Lecture Notes in Computer Science, vol 8610. Springer, Cham. https://doi.org/10.1007/978-3-319-09912-5_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-09912-5_1

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-09911-8

  • Online ISBN: 978-3-319-09912-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics