Abstract
Ontologies enable to directly encode domain knowledge in software applications, so ontology-based systems can exploit the meaning of information for providing advanced and intelligent functionalities. One of the most interesting and promising application of ontologies is information extraction from unstructured documents. In this area the extraction of meaningful information from PDF documents has been recently recognized as an important and challenging problem. This paper proposes an ontology-based information extraction system for PDF documents founded on a well suited knowledge representation approach named self-populating ontology (SPO). The SPO approach combines object-oriented logic-based features with formal grammar capabilities and allows expressing knowledge in term of ontology schemas, instances, and extraction rules (called descriptors) aimed at extracting information having also tabular form. The novel aspect of the SPO approach is that it allows to represent ontologies enriched by rules that enable them to populate them-self with instances extracted from unstructured PDF documents. In the paper the tractability of the SPO approach is proven. Moreover, features and behavior of the prototypical implementation of the SPO system are illustrated by means of a running example.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: VLDB 2001: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 119–128. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Apt, K.R., Blair, H.A., Walker, A.: Towards a theory of declarative knowledge. In: Foundations of Deductive Databases and Logic Programming, pp. 89–148. Morgan Kaufmann, San Francisco (1988)
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Reading (1995)
Knuth, D.E.: Semantics of context-free languages. Mathematical Systems Theory 2(2), 127–145 (1968)
Kayed, M., Shaalan, K.F.: A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering 18(10), 1411–1428 (2006); Member-Chia-Hui Chang and Member-Moheb Ramzy Girgis
Laender, A., Ribeiro-Neto, B., Silva, A., Teixeira, J.: A brief survey of web data extraction tools. In: SIGMOD Record, vol. 31 (June 2002)
Huck, G., Fankhauser, P., Aberer, K., Neuhold, E.: Jedi: Extracting and synthesizing information from the web. coopis 0, 32 (1998)
Ludäscher, B., Himmeröder, R., Lausen, G., May, W., Schlepphorst, C.: Managing semistructured data with florid: a deductive object-oriented perspective. Inf. Syst. 23(9), 589–613 (1998)
Gottlob, G., Koch, C., Baumgartner, R., Herzog, M., Flesca, S.: The lixto data extraction project - back and forth between theory and practice. In: PODS, pp. 1–12 (2004)
Papadakis, N.K., Skoutas, D., Raftopoulos, K., Varvarigou, T.A.: Stavies: A system for information extraction from unknown web data sources through automatic web wrapper generation using clustering techniques. IEEE Transactions on Knowledge and Data Engineering 17(12), 1638–1652 (2005)
Marsh, E., Perzanowski, D.: Muc-7 evaluation of information extraction technology: Overview of results. In: Seventh Message Understanding Conference (MUC-7), pp. 1251–1256 (1998)
Ciravegna, F.: Adaptive information extraction from text by rule induction and generalisation. In: IJCAI, pp. 1251–1256 (2001)
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information extraction in knowitall (preliminary results). In: WWW 2004: Proceedings of the 13th international conference on World Wide Web, pp. 100–110. ACM, New York (2004)
Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: Veloso, M.M. (ed.) IJCAI, pp. 2670–2676 (2007)
Agichtein, E., Gravano, L.: Snowball: extracting relations from large plain-text collections. In: DL 2000: Proceedings of the fifth ACM conference on Digital libraries, pp. 85–94. ACM, New York (2000)
Brin, S.: Extracting patterns and relations from the world wide web. In: WebDB, pp. 172–183 (1998)
Pazienza, M.: Information Extraction. Springer, Heidelberg (1997)
Flesca, S., Garruzzo, S., Masciari, E., Tagarelli, A.: Wrapping pdf documents exploiting uncertain knowledge. In: Dubois, E., Pohl, K. (eds.) CAiSE 2006. LNCS, vol. 4001, pp. 175–189. Springer, Heidelberg (2006)
Hassan, T., Baumgartner, R.: Intelligent text extraction from pdf documents. In: CIMCA/IAWTIC, pp. 2–6 (2005)
Carme, J., Ceresna, M., Frölich, O., Gottlob, G., Hassan, T., Herzog, M., Holzinger, W., Krüpl, B.: The lixto project: Exploring new frontiers of web data extraction. In: Bell, D.A., Hong, J. (eds.) BNCOD 2006. LNCS, vol. 4042, pp. 1–15. Springer, Heidelberg (2006)
Srihari, S.N., Lam, S.W., Cullen, P.B., Ho, T.K.: Document image analysis and recognition. In: Bourbakis, N. (ed.) Artificial Intelligence Methods and Applications, pp. 590–617. World Scientific Publishing, Singapore (1992)
Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition. IJDAR 7(1), 1–16 (2004)
Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Ng, Y.K., Quass, D., Smith, R.D.: Conceptual-model-based data extraction from multiple-record web pages. Data Knowl. Eng. 31(3), 227–251 (1999)
Aitken, J.: Learning Information Extraction Rules: An Inductive Logic Programming approach. In: Proceedings of the 15th European Conference on Artificial Intelligence, pp. 355–359 (2002), http://citeseer.ist.psu.edu/586553.html
McDowell, L., Cafarella, M.J.: Ontology-driven information extraction with ontosyphon. In: International Semantic Web Conference, pp. 428–444 (2006)
Cimiano, P., Völker, J.: Text2onto. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 227–238. Springer, Heidelberg (2005)
Maedche, E., Neumann, G., Staab, S.: Bootstrapping an ontology-based information extraction system. In: Studies in Fuzziness and Soft Computing, Intelligent Exploration of the Web. Springer, Heidelberg (2002)
Saggion, H., Funk, A., Maynard, D., Bontcheva, K.: Ontology-based information extraction for business intelligence. In: ISWC/ASWC, pp. 843–856 (2007)
Wood, M.M., Lydon, S.J., Tablan, V., Maynard, D., Cunningham, H.: Populating a database from parallel texts using ontology-based information extraction. In: Meziane, F., Métais, E. (eds.) NLDB 2004. LNCS, vol. 3136, pp. 254–264. Springer, Heidelberg (2004)
Popov, B., Kiryakov, A., Kirilov, A., Manov, D., Ognyanoff, D., Goranov, M.: Kim - semantic annotation platform. In: International Semantic Web Conference, pp. 834–849 (2003)
Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to automata theory, languages, and computation. In: SIGACT News, 2nd edn., vol. 32(1), pp. 60–65 (2001)
Ricca, F., Leone, N.: Disjunctive logic programming with types and objects: The dlv+ system. J. Applied Logic 5(3), 545–573 (2007)
Kifer, M., Lausen, G., Wu, J.: Logical foundations of object-oriented and frame-based languages. J. ACM 42(4), 741–843 (1995)
java.util.regex. Pattern, http://java.sun.com/j2se/1.5.0/docs
Dantsin, E., Eiter, T., Gottlob, G., Voronkov, A.: Complexity and expressive power of logic programming. In: IEEE Conference on Computational Complexity, pp. 82–101 (1997)
Erbach, G.: Bottom-up earley deduction. CoRRÂ cmp-lg/9502004 (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Oro, E., Ruffolo, M. (2008). Towards a System for Ontology-Based Information Extraction from PDF Documents. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems: OTM 2008. OTM 2008. Lecture Notes in Computer Science, vol 5332. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88873-4_38
Download citation
DOI: https://doi.org/10.1007/978-3-540-88873-4_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88872-7
Online ISBN: 978-3-540-88873-4
eBook Packages: Computer ScienceComputer Science (R0)