Abstract
Recent years, the amount of semi-structured documents available electrically has increased dramatically. Semi-structured documents usually are difficult to reuse due to the lack of explicit metadata. To enable integration and retrieval over semi-structured documents, the essential aspects in the documents should be described by metadata explicitly. The metadata could be assigned to documents and present part of their information content using various IE techniques. This paper also provides flexible user interaction mechanism to achieve better performance over less training sample documents. In semantic view extraction, by using similarity based rule induction, we have been able to improve the rule learning procedure. Experimental results show that our approach can significantly outperform most of the existing wrapper methods. We make use of the semantics that resides in document logical structure to help find relations between semantic entities. After semantic annotations of the documents, TIPSI allows those to be indexed with respect to the extracted text entities. To answer the query, TIPSI applies semantic restrictions over the entities in the KB.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abiteboul, S.: Querying semi-structured data. In: Proceedings of the International Conference on Database Theory, Delphi, Greece, January 1997
Summers, K.: Toward a taxonomy of logical document structures. In: Electronic Publishing and the Information Superhighway: Proceedings of the Dartmouth Institute for Advanced Graduate Studies (DAGS), pp. 124–133 (1995)
Tang, J., Li, J., Lu, H., Liang, B., Huang, X., Wang, K.-H.: iASA: Learning to Annotate the Semantic Web. In: Spaccapietra, S. (ed.) Journal on Data Semantics IV. LNCS, vol. 3730, pp. 110–145. Springer, Heidelberg (2005)
Califf, M.E.: Relational learning techniques for natural language information extraction. Ph.D. thesis. University of Texas, Austin (1998)
Soo, V.W., Lee, C.Y., Li, C.-C., Chen, S.L., Chen, C.: Automated semantic annotation and retrieval based on sharable ontology and case-based learning techniques. In: Proceedings of the 2003 Joint Conference on Digital Libraries. IEEE (2003)
Schaffer, C.: Selecting a classification method by cross-validation. Mach. Learn. 13(1), 135–143 (1993)
Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: Proceedings of 17th National Conference on Artificial Intelligence (2000)
Lavelli, A., Califf, M., Ciravegna, F., Freitag, F., Giuliano, D., Kushmerick, C., Romano, N.: A critical survey of the methodology for IE evaluation. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (2004)
Kahan, J., Koivunen, M.R.: Annotea: an open RDF infrastructure for shared web annotations. In: Proceedings of World Wide Web, pp. 623–632 (2001)
Fensel, D., Decker, S., Erdmann, M., Studer, R.: Ontobroker: or how to enable intelligent access to the WWW. In: Proceedings of 11th Banff Knowledge Acquisition for Knowledge-Based Systems Workshop, Banff, Canada (1998)
Mukherjee, S., Yang, G., Ramakrishnan, I.V.: Automatic annotation of content-rich HTML documents: structural and semantic analysis. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 533–549. Springer, Heidelberg (2003)
Handschuh, S., Staab, S., Ciravegna, F.: S-CREAM – Semi-automatic CREAtion of metadata. In: Gómez-Pérez, A., Benjamins, V. (eds.) EKAW 2002. LNCS (LNAI), vol. 2473, pp. 358–372. Springer, Heidelberg (2002)
Vargas-Vera, M., Motta, E., Domingue, J., Lanzoni, M., Stutt, A., Ciravegna, F.: MnM: ontology driven semiautomatic and automatic support for semantic markup. In: Proceedings of the 13th International Conference on Knowledge Engineering and Management (EKAW 2002), Siguenza, Spain (2002)
Ciravegna, F., Dingli, A., Iria, J., Wilks, Y.: Multi-strategy definition of annotation services in Melita. In: ISWC’03 Workshop on Human Language Technology for the Semantic Web and Web Services, pp. 97–107 (2003)
Kogut, P., Holmes, W.: AeroDAML: applying information extraction to generate DAML annotations from web pages (2001)
Popov, B., Kiryakov, A., Manov, D., Kirilov, A., Ognyanoff, D., Goranov, M.: Towards semantic web information extraction. In: Proceedings of the ISWC’03 Workshop on Human Language Technology for the Semantic Web and Web Services, pp. 1−21 (2003)
Hammond, B., Sheth, A., Kochut, K.: Semantic enhancement engine: a modular document enhancement platform for semantic applications over heterogeneous content. In: Kashyap, V., Shklar, L. (eds.) Real World Semantic Web Applications. IOS Press, pp. 29–49, December 2002
Li, J., Yu, Y.: Learning to generate semantic annotation for domain specific sentences. In: Proceedings of the Knowledge Markup and Semantic Annotation Workshop in K-CAP 2001, Victoria, BC (2001)
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., McCurley, K.S., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: A case for automated large- scale semantic annotation. J. Web Semant. Sci., Serv. Agents World Wide Web 1, 115–132 (2003)
Buitelaar, P., Declerck, T.: Linguistic annotation for the semantic web. In: Annotation for the Semantic Web, Frontiers in Artificial Intelligence and Applications Series, Vol. 96. IOS Press (2003)
Handschuh, S., Staab, S.: Annotation for the Semantic Web. Frontiers in Artificial Intelligence and Applications, vol. 96. New IOS Publication (2003)
Acknowledgement
Thanks to anonymous reviewers for their valuable comments. This work was supported by National High Technology Research and Development (863) Program (2011AA01A205).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, K., Li, J., Hong, M., Yan, X., Song, Q. (2014). A Semantics Enabled Intelligent Semi-structured Document Processor. In: Yuan, Y., Wu, X., Lu, Y. (eds) Trustworthy Computing and Services. ISCTCS 2013. Communications in Computer and Information Science, vol 426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-43908-1_41
Download citation
DOI: https://doi.org/10.1007/978-3-662-43908-1_41
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-43907-4
Online ISBN: 978-3-662-43908-1
eBook Packages: Computer ScienceComputer Science (R0)