Towards a System for Ontology-Based Information Extraction from PDF Documents

Oro, Ermelinda; Ruffolo, Massimo

doi:10.1007/978-3-540-88873-4_38

Ermelinda Oro³ &
Massimo Ruffolo⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5332))

Included in the following conference series:

OTM Confederated International Conferences "On the Move to Meaningful Internet Systems"

1196 Accesses
1 Citations

Abstract

Ontologies enable to directly encode domain knowledge in software applications, so ontology-based systems can exploit the meaning of information for providing advanced and intelligent functionalities. One of the most interesting and promising application of ontologies is information extraction from unstructured documents. In this area the extraction of meaningful information from PDF documents has been recently recognized as an important and challenging problem. This paper proposes an ontology-based information extraction system for PDF documents founded on a well suited knowledge representation approach named self-populating ontology (SPO). The SPO approach combines object-oriented logic-based features with formal grammar capabilities and allows expressing knowledge in term of ontology schemas, instances, and extraction rules (called descriptors) aimed at extracting information having also tabular form. The novel aspect of the SPO approach is that it allows to represent ontologies enriched by rules that enable them to populate them-self with instances extracted from unstructured PDF documents. In the paper the tractability of the SPO approach is proven. Moreover, features and behavior of the prototypical implementation of the SPO system are illustrated by means of a running example.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: VLDB 2001: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 119–128. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Google Scholar
Apt, K.R., Blair, H.A., Walker, A.: Towards a theory of declarative knowledge. In: Foundations of Deductive Databases and Logic Programming, pp. 89–148. Morgan Kaufmann, San Francisco (1988)
Chapter Google Scholar
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Reading (1995)
MATH Google Scholar
Knuth, D.E.: Semantics of context-free languages. Mathematical Systems Theory 2(2), 127–145 (1968)
Article MathSciNet MATH Google Scholar
Kayed, M., Shaalan, K.F.: A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering 18(10), 1411–1428 (2006); Member-Chia-Hui Chang and Member-Moheb Ramzy Girgis
Article Google Scholar
Laender, A., Ribeiro-Neto, B., Silva, A., Teixeira, J.: A brief survey of web data extraction tools. In: SIGMOD Record, vol. 31 (June 2002)
Google Scholar
Huck, G., Fankhauser, P., Aberer, K., Neuhold, E.: Jedi: Extracting and synthesizing information from the web. coopis 0, 32 (1998)
Google Scholar
Ludäscher, B., Himmeröder, R., Lausen, G., May, W., Schlepphorst, C.: Managing semistructured data with florid: a deductive object-oriented perspective. Inf. Syst. 23(9), 589–613 (1998)
Article Google Scholar
Gottlob, G., Koch, C., Baumgartner, R., Herzog, M., Flesca, S.: The lixto data extraction project - back and forth between theory and practice. In: PODS, pp. 1–12 (2004)
Google Scholar
Papadakis, N.K., Skoutas, D., Raftopoulos, K., Varvarigou, T.A.: Stavies: A system for information extraction from unknown web data sources through automatic web wrapper generation using clustering techniques. IEEE Transactions on Knowledge and Data Engineering 17(12), 1638–1652 (2005)
Article Google Scholar
Marsh, E., Perzanowski, D.: Muc-7 evaluation of information extraction technology: Overview of results. In: Seventh Message Understanding Conference (MUC-7), pp. 1251–1256 (1998)
Google Scholar
Ciravegna, F.: Adaptive information extraction from text by rule induction and generalisation. In: IJCAI, pp. 1251–1256 (2001)
Google Scholar
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information extraction in knowitall (preliminary results). In: WWW 2004: Proceedings of the 13th international conference on World Wide Web, pp. 100–110. ACM, New York (2004)
Google Scholar
Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: Veloso, M.M. (ed.) IJCAI, pp. 2670–2676 (2007)
Google Scholar
Agichtein, E., Gravano, L.: Snowball: extracting relations from large plain-text collections. In: DL 2000: Proceedings of the fifth ACM conference on Digital libraries, pp. 85–94. ACM, New York (2000)
Google Scholar
Brin, S.: Extracting patterns and relations from the world wide web. In: WebDB, pp. 172–183 (1998)
Google Scholar
Pazienza, M.: Information Extraction. Springer, Heidelberg (1997)
Google Scholar
Flesca, S., Garruzzo, S., Masciari, E., Tagarelli, A.: Wrapping pdf documents exploiting uncertain knowledge. In: Dubois, E., Pohl, K. (eds.) CAiSE 2006. LNCS, vol. 4001, pp. 175–189. Springer, Heidelberg (2006)
Chapter Google Scholar
Hassan, T., Baumgartner, R.: Intelligent text extraction from pdf documents. In: CIMCA/IAWTIC, pp. 2–6 (2005)
Google Scholar
Carme, J., Ceresna, M., Frölich, O., Gottlob, G., Hassan, T., Herzog, M., Holzinger, W., Krüpl, B.: The lixto project: Exploring new frontiers of web data extraction. In: Bell, D.A., Hong, J. (eds.) BNCOD 2006. LNCS, vol. 4042, pp. 1–15. Springer, Heidelberg (2006)
Chapter Google Scholar
Srihari, S.N., Lam, S.W., Cullen, P.B., Ho, T.K.: Document image analysis and recognition. In: Bourbakis, N. (ed.) Artificial Intelligence Methods and Applications, pp. 590–617. World Scientific Publishing, Singapore (1992)
Chapter Google Scholar
Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition. IJDAR 7(1), 1–16 (2004)
Article Google Scholar
Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Ng, Y.K., Quass, D., Smith, R.D.: Conceptual-model-based data extraction from multiple-record web pages. Data Knowl. Eng. 31(3), 227–251 (1999)
Article MATH Google Scholar
Aitken, J.: Learning Information Extraction Rules: An Inductive Logic Programming approach. In: Proceedings of the 15th European Conference on Artificial Intelligence, pp. 355–359 (2002), http://citeseer.ist.psu.edu/586553.html
McDowell, L., Cafarella, M.J.: Ontology-driven information extraction with ontosyphon. In: International Semantic Web Conference, pp. 428–444 (2006)
Google Scholar
Cimiano, P., Völker, J.: Text2onto. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 227–238. Springer, Heidelberg (2005)
Chapter Google Scholar
Maedche, E., Neumann, G., Staab, S.: Bootstrapping an ontology-based information extraction system. In: Studies in Fuzziness and Soft Computing, Intelligent Exploration of the Web. Springer, Heidelberg (2002)
Google Scholar
Saggion, H., Funk, A., Maynard, D., Bontcheva, K.: Ontology-based information extraction for business intelligence. In: ISWC/ASWC, pp. 843–856 (2007)
Google Scholar
Wood, M.M., Lydon, S.J., Tablan, V., Maynard, D., Cunningham, H.: Populating a database from parallel texts using ontology-based information extraction. In: Meziane, F., Métais, E. (eds.) NLDB 2004. LNCS, vol. 3136, pp. 254–264. Springer, Heidelberg (2004)
Chapter Google Scholar
Popov, B., Kiryakov, A., Kirilov, A., Manov, D., Ognyanoff, D., Goranov, M.: Kim - semantic annotation platform. In: International Semantic Web Conference, pp. 834–849 (2003)
Google Scholar
Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to automata theory, languages, and computation. In: SIGACT News, 2nd edn., vol. 32(1), pp. 60–65 (2001)
Google Scholar
Ricca, F., Leone, N.: Disjunctive logic programming with types and objects: The dlv+ system. J. Applied Logic 5(3), 545–573 (2007)
Article MathSciNet MATH Google Scholar
Kifer, M., Lausen, G., Wu, J.: Logical foundations of object-oriented and frame-based languages. J. ACM 42(4), 741–843 (1995)
Article MathSciNet MATH Google Scholar
java.util.regex. Pattern, http://java.sun.com/j2se/1.5.0/docs
Dantsin, E., Eiter, T., Gottlob, G., Voronkov, A.: Complexity and expressive power of logic programming. In: IEEE Conference on Computational Complexity, pp. 82–101 (1997)
Google Scholar
Erbach, G.: Bottom-up earley deduction. CoRR cmp-lg/9502004 (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and System Science (DEIS), Italy
Ermelinda Oro
Institute of High Performance Computing and Networking of CNR (ICAR-CNR), University of Calabria, 87036, Rende (CS), Italy
Massimo Ruffolo

Authors

Ermelinda Oro
View author publications
You can also search for this author in PubMed Google Scholar
Massimo Ruffolo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

STARLab, Vrije Universiteit Brussel (VUB),, Bldg G/10, Pleinlaan 2, 1050, Brussels, Belgium
Robert Meersman
School of Computer Science and Information Technology, RMIT University, Bld 10.10, 376-392 Swanston Street, VIC 3001, Melbourne,, Australia
Zahir Tari

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Oro, E., Ruffolo, M. (2008). Towards a System for Ontology-Based Information Extraction from PDF Documents. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems: OTM 2008. OTM 2008. Lecture Notes in Computer Science, vol 5332. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88873-4_38

Download citation

DOI: https://doi.org/10.1007/978-3-540-88873-4_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88872-7
Online ISBN: 978-3-540-88873-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics