Abstract
Domain-specific documents often share an inherent, though undocumented structure. This structure should be made explicit to facilitate efficient, structure-based search in archives as well as information integration. Inferring a semantically structured XML DTD for an archive and subsequently transforming its texts into XML documents is a promising method to reach these objectives. Based on the KDD-driven DIAs-DEM framework, we propose a new method to derive an archive-specific structured XML document type definition (DTD). Our approach utilizes association rule discovery and sequence mining techniques to structure a previously derived flat, i.e. unstructured DTD. We introduce the notion of a probabilistic DTD that is derived by discovering associations among and frequent sequences of XML tags, respectively.
The work of this author is funded by the German Research Society (DFG grant: SP 572/4-1) within the research project DIAsDEM. The German acronym stands for “Data Integration for Legacy Data and Semi-Structured Documents by Means of Data Mining Techniques”.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Sullivan, D.: Document Warehousing and Text Mining. John Wiley & Sons, New York, Chichester, Weinheim (2001)
Erdmann, M., Maedche, A., Schnurr, H.P., Staab, S.: From manual to semiautomatic semantic annotation: About ontology-based text annotation tools. In: Proceedings of the COLING 2000 Workshop on Semantic Annotation and Intelligent Content, Luxembourg (2000)
Graubitz, H., Spiliopoulou, M., Winkler, K.: The DIAsDEM framework for converting domain-specific texts into XML documents with data mining techniques. In: Proceedings of the First IEEE Int. Conference on Data Mining, San Jose, CA, USA (2001) 171–178
Winkler, K., Spiliopoulou, M.: Semi-automated XML tagging of public text archives: A case study. In: Proceedings of EuroWeb 2001 ”The Web in Public Administration“, Pisa, Italy (2001) 271–285
Nahm, U.Y., Mooney, R. J.: Using information extraction to aid the discovery of prediction rules from text. In: Proceedings of the KDD-2000 Workshop on Text Mining, Boston, MA, USA (2000) 51–58
Feldman, R., Fresko, M., Kinar, Y., Lindell, Y., Liphstat, O., Rajman, M., Schler, Y., Zamir, O.: Text mining at the term level. In: Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery, Nantes, France (1998) 65–73
Loh, S., Wives, L.K., Oliveira, J.P.M. d.: Concept-based knowledge discovery in texts extracted from the Web. ACM SIGKDD Explorations 2 (2000) 29–39
Bruder, I., Düsterhöft, A., Becker, M., Bedersdorfer, J., Neumann, G.: GETESS: Constructing a linguistic search index for an Internet search engine. In Bouzeghoub, M., Kedad, Z., Metais, E., eds.: Natural Language Processing and Information Systems. Number 1959 in Lecture Notes in Computer Science. Springer-Verlag (2001) 227–238
Sengupta, A., Purao, S.: Transitioning existing content: Inferring organization spezific document structures. In Turowski, K., Fellner, K. J., eds.: Tagungsband der 1. Deutschen Tagung XML 2000, XML Meets Business, Heidelberg, Germany (2000) 130–135
Moore, G. W., Berman, J. J.: Medical data mining and knowledge discovery. In: Anatomic Pathology Data Mining. Volume 60 of Studies in Fuzziness and Soft Computing., Heidelberg, New York, Physica-Verlag (2001) 72–117
Lumera, J.: Große Mengen an Altdaten stehen XML-Umstieg im Weg. Computerwoche 27 (2000) 52–53
Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufman Publishers, San Francisco (2000)
Wang, K., Liu, H.: Discovering structural association of semistructured data. IEEE Transactions on Knowledge and Data Engineering 12 (2000) 353–371
Laur, P.A., Masseglia, F., Poncelet, P.: Schema mining: Finding regularity among semistructured data. In Zighed, D. A., Komorowski, J., Żytkow, J., eds.: Principles of Data Mining and Knowledge Discovery: 4th European Conference, PKDD 2000. Volume 1910 of Lecture Notes in Artificial Intelligence., Lyon, France, Springer, Berlin, Heidelberg (2000) 498–5043
Carrasco, R.C., Oncina, J.: Learning deterministic regular grammars from stochastic samples in polynomial time. RAIRO (Theoretical Informatics and Applications) 33 (1999) 1–20
Young-Lai, M., Tompa, F.W.: Stochastic grammatical inference of text database structure. Machine Learning 40 (2000) 111–137
Agrawal, R., Srikant, R.: Mining sequential patterns. In: Proc. of Int. Conf. on Data Engineering, Taipei, Taiwan (1995)
Baumgarten, M., Büchner, A.G., Anand, S. S., Mulvenna, M.D., Hughes, J.G.: Navigation pattern discovery from internet data. In: [23]. (2000) 70–87
Gaul, W., Schmidt-Thieme, L.: Mining web navigation path fragments. In: [24]. (2000)
Spiliopoulou, M.: The laborious way from data mining to web mining. Int. Journal of Comp. Sys., Sci. & Eng., Special Issue on “Semantics of the Web” 14 (1999) 113–126
Goldman, R., Widom, J.: DataGuides: Enabling query formulation and optimization in semistructured databases. In: VLDB’97, Athens, Greece (1997) 436–445
Witten, I. H., Frank, E.: Data Mining. Morgan Kaufmann Publishers, San Francisco (2000)
Masand, B., Spiliopoulou, M., eds.: Advances in Web Usage Mining and User Profiling: Proceedings of the WEBKDD’99Workshop. LNAI 1836, Springer Verlag (2000)
Kohavi, R., Spiliopoulou, M., Srivastava, J., eds.: KDD’2000 Workshop WEBKDD’ 2000 on Web Mining for E-Commerce — Challenges and Opportunities, Boston, MA, ACM (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Winkler, K., Spiliopoulou, M. (2002). Structuring Domain-Specific Text Archives by Deriving a Probabilistic XML DTD. In: Elomaa, T., Mannila, H., Toivonen, H. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2002. Lecture Notes in Computer Science, vol 2431. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45681-3_38
Download citation
DOI: https://doi.org/10.1007/3-540-45681-3_38
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44037-6
Online ISBN: 978-3-540-45681-0
eBook Packages: Springer Book Archive