Structuring Domain-Specific Text Archives by Deriving a Probabilistic XML DTD

Winkler, Karsten; Spiliopoulou, Myra

doi:10.1007/3-540-45681-3_38

Karsten Winkler &
Myra Spiliopoulou⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2431))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

1986 Accesses
3 Citations

Abstract

Domain-specific documents often share an inherent, though undocumented structure. This structure should be made explicit to facilitate efficient, structure-based search in archives as well as information integration. Inferring a semantically structured XML DTD for an archive and subsequently transforming its texts into XML documents is a promising method to reach these objectives. Based on the KDD-driven DIAs-DEM framework, we propose a new method to derive an archive-specific structured XML document type definition (DTD). Our approach utilizes association rule discovery and sequence mining techniques to structure a previously derived flat, i.e. unstructured DTD. We introduce the notion of a probabilistic DTD that is derived by discovering associations among and frequent sequences of XML tags, respectively.

The work of this author is funded by the German Research Society (DFG grant: SP 572/4-1) within the research project DIAsDEM. The German acronym stands for “Data Integration for Legacy Data and Semi-Structured Documents by Means of Data Mining Techniques”.

Download to read the full chapter text

Chapter PDF

Schema Extraction and Integration of Heterogeneous XML Document Collections

Inferring a Relax NG Schema from XML Documents

Transformation of XML Data Sources for Sequential Path Mining

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Sullivan, D.: Document Warehousing and Text Mining. John Wiley & Sons, New York, Chichester, Weinheim (2001)
Google Scholar
Erdmann, M., Maedche, A., Schnurr, H.P., Staab, S.: From manual to semiautomatic semantic annotation: About ontology-based text annotation tools. In: Proceedings of the COLING 2000 Workshop on Semantic Annotation and Intelligent Content, Luxembourg (2000)
Google Scholar
Graubitz, H., Spiliopoulou, M., Winkler, K.: The DIAsDEM framework for converting domain-specific texts into XML documents with data mining techniques. In: Proceedings of the First IEEE Int. Conference on Data Mining, San Jose, CA, USA (2001) 171–178
Google Scholar
Winkler, K., Spiliopoulou, M.: Semi-automated XML tagging of public text archives: A case study. In: Proceedings of EuroWeb 2001 ”The Web in Public Administration“, Pisa, Italy (2001) 271–285
Google Scholar
Nahm, U.Y., Mooney, R. J.: Using information extraction to aid the discovery of prediction rules from text. In: Proceedings of the KDD-2000 Workshop on Text Mining, Boston, MA, USA (2000) 51–58
Google Scholar
Feldman, R., Fresko, M., Kinar, Y., Lindell, Y., Liphstat, O., Rajman, M., Schler, Y., Zamir, O.: Text mining at the term level. In: Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery, Nantes, France (1998) 65–73
Google Scholar
Loh, S., Wives, L.K., Oliveira, J.P.M. d.: Concept-based knowledge discovery in texts extracted from the Web. ACM SIGKDD Explorations 2 (2000) 29–39
Article Google Scholar
Bruder, I., Düsterhöft, A., Becker, M., Bedersdorfer, J., Neumann, G.: GETESS: Constructing a linguistic search index for an Internet search engine. In Bouzeghoub, M., Kedad, Z., Metais, E., eds.: Natural Language Processing and Information Systems. Number 1959 in Lecture Notes in Computer Science. Springer-Verlag (2001) 227–238
Chapter Google Scholar
Sengupta, A., Purao, S.: Transitioning existing content: Inferring organization spezific document structures. In Turowski, K., Fellner, K. J., eds.: Tagungsband der 1. Deutschen Tagung XML 2000, XML Meets Business, Heidelberg, Germany (2000) 130–135
Google Scholar
Moore, G. W., Berman, J. J.: Medical data mining and knowledge discovery. In: Anatomic Pathology Data Mining. Volume 60 of Studies in Fuzziness and Soft Computing., Heidelberg, New York, Physica-Verlag (2001) 72–117
Google Scholar
Lumera, J.: Große Mengen an Altdaten stehen XML-Umstieg im Weg. Computerwoche 27 (2000) 52–53
Google Scholar
Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufman Publishers, San Francisco (2000)
Google Scholar
Wang, K., Liu, H.: Discovering structural association of semistructured data. IEEE Transactions on Knowledge and Data Engineering 12 (2000) 353–371
Article Google Scholar
Laur, P.A., Masseglia, F., Poncelet, P.: Schema mining: Finding regularity among semistructured data. In Zighed, D. A., Komorowski, J., Żytkow, J., eds.: Principles of Data Mining and Knowledge Discovery: 4th European Conference, PKDD 2000. Volume 1910 of Lecture Notes in Artificial Intelligence., Lyon, France, Springer, Berlin, Heidelberg (2000) 498–5043
Google Scholar
Carrasco, R.C., Oncina, J.: Learning deterministic regular grammars from stochastic samples in polynomial time. RAIRO (Theoretical Informatics and Applications) 33 (1999) 1–20
Article MATH MathSciNet Google Scholar
Young-Lai, M., Tompa, F.W.: Stochastic grammatical inference of text database structure. Machine Learning 40 (2000) 111–137
Article Google Scholar
Agrawal, R., Srikant, R.: Mining sequential patterns. In: Proc. of Int. Conf. on Data Engineering, Taipei, Taiwan (1995)
Google Scholar
Baumgarten, M., Büchner, A.G., Anand, S. S., Mulvenna, M.D., Hughes, J.G.: Navigation pattern discovery from internet data. In: [23]. (2000) 70–87
Google Scholar
Gaul, W., Schmidt-Thieme, L.: Mining web navigation path fragments. In: [24]. (2000)
Google Scholar
Spiliopoulou, M.: The laborious way from data mining to web mining. Int. Journal of Comp. Sys., Sci. & Eng., Special Issue on “Semantics of the Web” 14 (1999) 113–126
Google Scholar
Goldman, R., Widom, J.: DataGuides: Enabling query formulation and optimization in semistructured databases. In: VLDB’97, Athens, Greece (1997) 436–445
Google Scholar
Witten, I. H., Frank, E.: Data Mining. Morgan Kaufmann Publishers, San Francisco (2000)
Google Scholar
Masand, B., Spiliopoulou, M., eds.: Advances in Web Usage Mining and User Profiling: Proceedings of the WEBKDD’99Workshop. LNAI 1836, Springer Verlag (2000)
Google Scholar
Kohavi, R., Spiliopoulou, M., Srivastava, J., eds.: KDD’2000 Workshop WEBKDD’ 2000 on Web Mining for E-Commerce — Challenges and Opportunities, Boston, MA, ACM (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of E-Business, Leipzig Graduate School of Management (HHL), Jahnallee 59, D-04109, Leipzig, Germany
Myra Spiliopoulou

Authors

Karsten Winkler
View author publications
You can also search for this author in PubMed Google Scholar
Myra Spiliopoulou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Helsinki, P.O. Box 26, 00014, Helsinki, Finland
Tapio Elomaa , Heikki Mannila & Hannu Toivonen , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Winkler, K., Spiliopoulou, M. (2002). Structuring Domain-Specific Text Archives by Deriving a Probabilistic XML DTD. In: Elomaa, T., Mannila, H., Toivonen, H. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2002. Lecture Notes in Computer Science, vol 2431. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45681-3_38

Download citation

DOI: https://doi.org/10.1007/3-540-45681-3_38
Published: 18 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44037-6
Online ISBN: 978-3-540-45681-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics