Abstract
Extensible Markup Language(XML) has emerged as a medium for interoperability over the Internet. The XML technology, with its self-describing and extensible tags, is significantly contributing to the next generation semantic web. The present search techniques used for HTML and text documents are not efficient to retrieve relevant XML documents. In chapter four, Self Adaptive Genetic Algorithms for XML Search(SAGAXSearch) is presented to learn about the tags, which are useful in indexing. The indices and relationship strength metrics are used to extract fast and accurate semantically related elements in the XML documents. Experiments are conducted on the DataBase systems and Logic Programming (DBLP) XML corpus and are evaluated for precision and recall. The proposed SAGAXSearch outperforms the existing algorithms XSEarch and XRank with respect to accuracy and query execution time.
As the number of documents published in the form of XML is increasing, there is a need for selective dissemination of XML documents based on user interests. In the proposed technique, a combination of Self Adaptive Genetic Algorithms and multi class Support Vector Machine is used to learn a user model. Based on the feedback from the users, the system automatically adapts to the users preference and interests. The user model and a similarity metric are used for selective dissemination of a continuous stream of XML documents. Experimental evaluations performed over a wide range of XML documents, indicate that the proposed approach significantly improves the performance of the selective dissemination task, with respect to accuracy and efficiency.
On similar grounds, there is a need for categorization of XML documents into specific user interest categories. However, manually performing the categorization task is not feasible due to the sheer amount of XML documents available on the Internet. A machine learning approach to topic categorization which makes use of a multi class SVM for exploiting the semantic content of XML documents is also presented. The SVM is supplemented by a feature selection technique which is used to extract the useful features. Experimental evaluations performed over a wide range of XML documents indicate that the proposed approach significantly improves the performance of the topic categorization task, with respect to accuracy and efficiency.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: Proceedings of International Conference on Seventh World-Wide Web conference (WWW7) (1998)
Luk, R., et al.: A Survey of Search Engines for XML Documents. In: SIGIR Workshop on XML and IR (2000)
Shanmugasundaram, J., et al.: A General Technique for Querying XML Documents using a Relational Database System, SIGMOD Record (2001)
Abiteboul, S.: On views and XML. SIGMOD Record 28(4), 30–38 (1999)
World Wide Web Consortium. XQUERY: A Query Language for XML W3c Working Draft, http://www.w3.org/XML/Query
Florescu, D., Kossmann, D., Manolescu, I.: Integrating Keyword Search into XML Query Processing. The International Journal of Computer and Telecommunications Networking 33(1), 119–135 (2000)
Gordon, M.: Probabilistic and Genetic Algorithms for Document Retrieval. Communications of the ACM 31(1), 1208–1218 (1988)
Yang, J., Korfhage, R.R.: Effects of Query Term Weights Modification in Annual Document Retrieval: A Study Based on a Genetic Algorithm. In: Proceedings of the Second Symposium on Document Analysis and Information Retrieval, pp. 185–271 (1993)
Yang, J., Korfhage, R.R., Rasmussen, E.: Query improvement in Information Retrieval using Genetic Algorithms: A Report on the Experiments of the TREC project. In: Proceedings of the First Text Retrieval Conference (TREC-1), pp. 31–58 (1993)
Pathak, P., Gordon, M., Fan, W.: Effective Information Retrieval using Genetic Algorithms based Matching Functions Adaptation. In: Proceedings of 33rd Hawaii International Conference on System Sciences (2000)
Kim, S., Zhang, B.T.: Genetic Mining of HTML Structures for effective Web Document Retrieval. Applied Intelligence 18, 243–256 (2003)
Hristidis, V., Papakonstantinou, Y., Balmin, A.: Key-word Proximity Search on XML Graphs. In: International Conference on Data Engineering (2003)
Guo, L., et al.: XRANK: Ranked Keyword Search over XML Documents. In: SIGMOD (2003)
Cohen, S., Mamou, J., Kanza, Y., Sagiv, Y.: XSEarch: A Semantic Search Engine for XML. In: VLDB 2003, pp. 45–56 (2003)
DBLP XML Records (2001), http://acm.org/sigmoid/dblp/db/index.html
Shanmugasundaram, K.T., Zhang, C., He, G., DeWitt, D.J., Naughton, J.F.: Relational Databases for Querying XML Documents: Limitations and opportunities. In: VLDB 1999, pp. 302–314 (1999)
Yan, T., Garcia-Molina, H.: The SIFT Information Dissemination System. ACM Transactions on Database Systems TODS 24(4), 529–565 (1999)
Altinel, M., Franklin, M.: Efficient Filtering of XML Documents for Selective Dissemination of Information. In: International Conference on Very Large Databases (VLDB 2000), pp. 53–64 (2000)
Stanoi, I., Mihaila, G., Padmanabhan, S.: A Framework for Selective Dissemination of XML Documents based on Inferred User Profiles. In: Proceedings of the Nineteenth International Conference on Data Engineering (ICDE 2003) (2003)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Weiss, S.M., Apte, C., Damerau, F.J., Johnson, D.E., Oles, F.J., Goetz, T., Hampp, T.: Maximizing Text-Mining Performance. IEEE Intelligent Systems 14(4), 2–8 (1999)
Baker, K.D., McCallum, A.K.: Distributional Clustering of Words for Text Classification. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998), pp. 96–103 (1998)
Theobald, M., Schenkel, R., Weikum, G.: Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data. In: Proceedings of International Workshop on the Web and Databases (WebDB) (2003)
Selamat, A., Omatu, S.: Web Page Feature Selection and Classification using Neural Networks. Information Sciences-Informatics and Computer Science: An International Journal 158(1), 69–88 (1999)
Salton, G.: Automatic Text Processing. Addison-Wesley, Reading (1989)
Turtle, H., Croft, W.B.: Inference Networks for Document Retrieval. In: Proceedings of the Thirteenth International Conference on Research and Development in Information Retrieval, pp. 1–24 (1990)
Christopher, J.C.B.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2, 121–167 (1998)
Joachims, T.: Transductive inference for Text Classification using Support Vector Machines. In: Machine Learning - Proceedings of Sixteenth Inernational Conference (ICML 1999), pp. 200–209 (1999)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Liu, H., Li, J., Wong, L.: A comparative Study on Feature Selection and Classification Methods using Gene Expression Profiles and Proteomic Patterns, Genome Informatics (2002)
Diaz, A.L., Lovell.: XML Generator (1999), http://www.alphaworks.ibm.com/tech/xmlgenerator
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Venugopal, K.R., Srinivasa, K.G., Patnaik, L.M. (2009). Evolutionary Approach for XML Data Mining. In: Soft Computing for Data Mining Applications. Studies in Computational Intelligence, vol 190. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00193-2_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-00193-2_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00192-5
Online ISBN: 978-3-642-00193-2
eBook Packages: EngineeringEngineering (R0)