Skip to main content

Towards a System for Ontology-Based Information Extraction from PDF Documents

  • Conference paper
On the Move to Meaningful Internet Systems: OTM 2008 (OTM 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5332))

Abstract

Ontologies enable to directly encode domain knowledge in software applications, so ontology-based systems can exploit the meaning of information for providing advanced and intelligent functionalities. One of the most interesting and promising application of ontologies is information extraction from unstructured documents. In this area the extraction of meaningful information from PDF documents has been recently recognized as an important and challenging problem. This paper proposes an ontology-based information extraction system for PDF documents founded on a well suited knowledge representation approach named self-populating ontology (SPO). The SPO approach combines object-oriented logic-based features with formal grammar capabilities and allows expressing knowledge in term of ontology schemas, instances, and extraction rules (called descriptors) aimed at extracting information having also tabular form. The novel aspect of the SPO approach is that it allows to represent ontologies enriched by rules that enable them to populate them-self with instances extracted from unstructured PDF documents. In the paper the tractability of the SPO approach is proven. Moreover, features and behavior of the prototypical implementation of the SPO system are illustrated by means of a running example.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: VLDB 2001: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 119–128. Morgan Kaufmann Publishers Inc., San Francisco (2001)

    Google Scholar 

  2. Apt, K.R., Blair, H.A., Walker, A.: Towards a theory of declarative knowledge. In: Foundations of Deductive Databases and Logic Programming, pp. 89–148. Morgan Kaufmann, San Francisco (1988)

    Chapter  Google Scholar 

  3. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Reading (1995)

    MATH  Google Scholar 

  4. Knuth, D.E.: Semantics of context-free languages. Mathematical Systems Theory 2(2), 127–145 (1968)

    Article  MathSciNet  MATH  Google Scholar 

  5. Kayed, M., Shaalan, K.F.: A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering 18(10), 1411–1428 (2006); Member-Chia-Hui Chang and Member-Moheb Ramzy Girgis

    Article  Google Scholar 

  6. Laender, A., Ribeiro-Neto, B., Silva, A., Teixeira, J.: A brief survey of web data extraction tools. In: SIGMOD Record, vol. 31 (June 2002)

    Google Scholar 

  7. Huck, G., Fankhauser, P., Aberer, K., Neuhold, E.: Jedi: Extracting and synthesizing information from the web. coopis 0, 32 (1998)

    Google Scholar 

  8. Ludäscher, B., Himmeröder, R., Lausen, G., May, W., Schlepphorst, C.: Managing semistructured data with florid: a deductive object-oriented perspective. Inf. Syst. 23(9), 589–613 (1998)

    Article  Google Scholar 

  9. Gottlob, G., Koch, C., Baumgartner, R., Herzog, M., Flesca, S.: The lixto data extraction project - back and forth between theory and practice. In: PODS, pp. 1–12 (2004)

    Google Scholar 

  10. Papadakis, N.K., Skoutas, D., Raftopoulos, K., Varvarigou, T.A.: Stavies: A system for information extraction from unknown web data sources through automatic web wrapper generation using clustering techniques. IEEE Transactions on Knowledge and Data Engineering 17(12), 1638–1652 (2005)

    Article  Google Scholar 

  11. Marsh, E., Perzanowski, D.: Muc-7 evaluation of information extraction technology: Overview of results. In: Seventh Message Understanding Conference (MUC-7), pp. 1251–1256 (1998)

    Google Scholar 

  12. Ciravegna, F.: Adaptive information extraction from text by rule induction and generalisation. In: IJCAI, pp. 1251–1256 (2001)

    Google Scholar 

  13. Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information extraction in knowitall (preliminary results). In: WWW 2004: Proceedings of the 13th international conference on World Wide Web, pp. 100–110. ACM, New York (2004)

    Google Scholar 

  14. Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: Veloso, M.M. (ed.) IJCAI, pp. 2670–2676 (2007)

    Google Scholar 

  15. Agichtein, E., Gravano, L.: Snowball: extracting relations from large plain-text collections. In: DL 2000: Proceedings of the fifth ACM conference on Digital libraries, pp. 85–94. ACM, New York (2000)

    Google Scholar 

  16. Brin, S.: Extracting patterns and relations from the world wide web. In: WebDB, pp. 172–183 (1998)

    Google Scholar 

  17. Pazienza, M.: Information Extraction. Springer, Heidelberg (1997)

    Google Scholar 

  18. Flesca, S., Garruzzo, S., Masciari, E., Tagarelli, A.: Wrapping pdf documents exploiting uncertain knowledge. In: Dubois, E., Pohl, K. (eds.) CAiSE 2006. LNCS, vol. 4001, pp. 175–189. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  19. Hassan, T., Baumgartner, R.: Intelligent text extraction from pdf documents. In: CIMCA/IAWTIC, pp. 2–6 (2005)

    Google Scholar 

  20. Carme, J., Ceresna, M., Frölich, O., Gottlob, G., Hassan, T., Herzog, M., Holzinger, W., Krüpl, B.: The lixto project: Exploring new frontiers of web data extraction. In: Bell, D.A., Hong, J. (eds.) BNCOD 2006. LNCS, vol. 4042, pp. 1–15. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  21. Srihari, S.N., Lam, S.W., Cullen, P.B., Ho, T.K.: Document image analysis and recognition. In: Bourbakis, N. (ed.) Artificial Intelligence Methods and Applications, pp. 590–617. World Scientific Publishing, Singapore (1992)

    Chapter  Google Scholar 

  22. Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition. IJDAR 7(1), 1–16 (2004)

    Article  Google Scholar 

  23. Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Ng, Y.K., Quass, D., Smith, R.D.: Conceptual-model-based data extraction from multiple-record web pages. Data Knowl. Eng. 31(3), 227–251 (1999)

    Article  MATH  Google Scholar 

  24. Aitken, J.: Learning Information Extraction Rules: An Inductive Logic Programming approach. In: Proceedings of the 15th European Conference on Artificial Intelligence, pp. 355–359 (2002), http://citeseer.ist.psu.edu/586553.html

  25. McDowell, L., Cafarella, M.J.: Ontology-driven information extraction with ontosyphon. In: International Semantic Web Conference, pp. 428–444 (2006)

    Google Scholar 

  26. Cimiano, P., Völker, J.: Text2onto. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 227–238. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  27. Maedche, E., Neumann, G., Staab, S.: Bootstrapping an ontology-based information extraction system. In: Studies in Fuzziness and Soft Computing, Intelligent Exploration of the Web. Springer, Heidelberg (2002)

    Google Scholar 

  28. Saggion, H., Funk, A., Maynard, D., Bontcheva, K.: Ontology-based information extraction for business intelligence. In: ISWC/ASWC, pp. 843–856 (2007)

    Google Scholar 

  29. Wood, M.M., Lydon, S.J., Tablan, V., Maynard, D., Cunningham, H.: Populating a database from parallel texts using ontology-based information extraction. In: Meziane, F., Métais, E. (eds.) NLDB 2004. LNCS, vol. 3136, pp. 254–264. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  30. Popov, B., Kiryakov, A., Kirilov, A., Manov, D., Ognyanoff, D., Goranov, M.: Kim - semantic annotation platform. In: International Semantic Web Conference, pp. 834–849 (2003)

    Google Scholar 

  31. Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to automata theory, languages, and computation. In: SIGACT News, 2nd edn., vol. 32(1), pp. 60–65 (2001)

    Google Scholar 

  32. Ricca, F., Leone, N.: Disjunctive logic programming with types and objects: The dlv+ system. J. Applied Logic 5(3), 545–573 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  33. Kifer, M., Lausen, G., Wu, J.: Logical foundations of object-oriented and frame-based languages. J. ACM 42(4), 741–843 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  34. java.util.regex. Pattern, http://java.sun.com/j2se/1.5.0/docs

  35. Dantsin, E., Eiter, T., Gottlob, G., Voronkov, A.: Complexity and expressive power of logic programming. In: IEEE Conference on Computational Complexity, pp. 82–101 (1997)

    Google Scholar 

  36. Erbach, G.: Bottom-up earley deduction. CoRR cmp-lg/9502004 (1995)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Oro, E., Ruffolo, M. (2008). Towards a System for Ontology-Based Information Extraction from PDF Documents. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems: OTM 2008. OTM 2008. Lecture Notes in Computer Science, vol 5332. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88873-4_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-88873-4_38

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-88872-7

  • Online ISBN: 978-3-540-88873-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics