Evolutionary Approach for XML Data Mining

Venugopal, K. R.; Srinivasa, K. G.; Patnaik, L. M.

doi:10.1007/978-3-642-00193-2_5

Evolutionary Approach for XML Data Mining

K. R. Venugopal,
K. G. Srinivasa &
L. M. Patnaik

Chapter

882 Accesses

Part of the book series: Studies in Computational Intelligence ((SCI,volume 190))

Abstract

Extensible Markup Language(XML) has emerged as a medium for interoperability over the Internet. The XML technology, with its self-describing and extensible tags, is significantly contributing to the next generation semantic web. The present search techniques used for HTML and text documents are not efficient to retrieve relevant XML documents. In chapter four, Self Adaptive Genetic Algorithms for XML Search(SAGAXSearch) is presented to learn about the tags, which are useful in indexing. The indices and relationship strength metrics are used to extract fast and accurate semantically related elements in the XML documents. Experiments are conducted on the DataBase systems and Logic Programming (DBLP) XML corpus and are evaluated for precision and recall. The proposed SAGAXSearch outperforms the existing algorithms XSEarch and XRank with respect to accuracy and query execution time.

As the number of documents published in the form of XML is increasing, there is a need for selective dissemination of XML documents based on user interests. In the proposed technique, a combination of Self Adaptive Genetic Algorithms and multi class Support Vector Machine is used to learn a user model. Based on the feedback from the users, the system automatically adapts to the users preference and interests. The user model and a similarity metric are used for selective dissemination of a continuous stream of XML documents. Experimental evaluations performed over a wide range of XML documents, indicate that the proposed approach significantly improves the performance of the selective dissemination task, with respect to accuracy and efficiency.

On similar grounds, there is a need for categorization of XML documents into specific user interest categories. However, manually performing the categorization task is not feasible due to the sheer amount of XML documents available on the Internet. A machine learning approach to topic categorization which makes use of a multi class SVM for exploiting the semantic content of XML documents is also presented. The SVM is supplemented by a feature selection technique which is used to extract the useful features. Experimental evaluations performed over a wide range of XML documents indicate that the proposed approach significantly improves the performance of the topic categorization task, with respect to accuracy and efficiency.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: Proceedings of International Conference on Seventh World-Wide Web conference (WWW7) (1998)
Google Scholar
Luk, R., et al.: A Survey of Search Engines for XML Documents. In: SIGIR Workshop on XML and IR (2000)
Google Scholar
Shanmugasundaram, J., et al.: A General Technique for Querying XML Documents using a Relational Database System, SIGMOD Record (2001)
Google Scholar
Abiteboul, S.: On views and XML. SIGMOD Record 28(4), 30–38 (1999)
Article Google Scholar
World Wide Web Consortium. XQUERY: A Query Language for XML W3c Working Draft, http://www.w3.org/XML/Query
Florescu, D., Kossmann, D., Manolescu, I.: Integrating Keyword Search into XML Query Processing. The International Journal of Computer and Telecommunications Networking 33(1), 119–135 (2000)
Google Scholar
Gordon, M.: Probabilistic and Genetic Algorithms for Document Retrieval. Communications of the ACM 31(1), 1208–1218 (1988)
Article Google Scholar
Yang, J., Korfhage, R.R.: Effects of Query Term Weights Modification in Annual Document Retrieval: A Study Based on a Genetic Algorithm. In: Proceedings of the Second Symposium on Document Analysis and Information Retrieval, pp. 185–271 (1993)
Google Scholar
Yang, J., Korfhage, R.R., Rasmussen, E.: Query improvement in Information Retrieval using Genetic Algorithms: A Report on the Experiments of the TREC project. In: Proceedings of the First Text Retrieval Conference (TREC-1), pp. 31–58 (1993)
Google Scholar
Pathak, P., Gordon, M., Fan, W.: Effective Information Retrieval using Genetic Algorithms based Matching Functions Adaptation. In: Proceedings of 33rd Hawaii International Conference on System Sciences (2000)
Google Scholar
Kim, S., Zhang, B.T.: Genetic Mining of HTML Structures for effective Web Document Retrieval. Applied Intelligence 18, 243–256 (2003)
Article MathSciNet Google Scholar
Hristidis, V., Papakonstantinou, Y., Balmin, A.: Key-word Proximity Search on XML Graphs. In: International Conference on Data Engineering (2003)
Google Scholar
Guo, L., et al.: XRANK: Ranked Keyword Search over XML Documents. In: SIGMOD (2003)
Google Scholar
Cohen, S., Mamou, J., Kanza, Y., Sagiv, Y.: XSEarch: A Semantic Search Engine for XML. In: VLDB 2003, pp. 45–56 (2003)
Google Scholar
DBLP XML Records (2001), http://acm.org/sigmoid/dblp/db/index.html
Shanmugasundaram, K.T., Zhang, C., He, G., DeWitt, D.J., Naughton, J.F.: Relational Databases for Querying XML Documents: Limitations and opportunities. In: VLDB 1999, pp. 302–314 (1999)
Google Scholar
Yan, T., Garcia-Molina, H.: The SIFT Information Dissemination System. ACM Transactions on Database Systems TODS 24(4), 529–565 (1999)
Article Google Scholar
Altinel, M., Franklin, M.: Efficient Filtering of XML Documents for Selective Dissemination of Information. In: International Conference on Very Large Databases (VLDB 2000), pp. 53–64 (2000)
Google Scholar
Stanoi, I., Mihaila, G., Padmanabhan, S.: A Framework for Selective Dissemination of XML Documents based on Inferred User Profiles. In: Proceedings of the Nineteenth International Conference on Data Engineering (ICDE 2003) (2003)
Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Weiss, S.M., Apte, C., Damerau, F.J., Johnson, D.E., Oles, F.J., Goetz, T., Hampp, T.: Maximizing Text-Mining Performance. IEEE Intelligent Systems 14(4), 2–8 (1999)
Article Google Scholar
Baker, K.D., McCallum, A.K.: Distributional Clustering of Words for Text Classification. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998), pp. 96–103 (1998)
Google Scholar
Theobald, M., Schenkel, R., Weikum, G.: Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data. In: Proceedings of International Workshop on the Web and Databases (WebDB) (2003)
Google Scholar
Selamat, A., Omatu, S.: Web Page Feature Selection and Classification using Neural Networks. Information Sciences-Informatics and Computer Science: An International Journal 158(1), 69–88 (1999)
MathSciNet Google Scholar
Salton, G.: Automatic Text Processing. Addison-Wesley, Reading (1989)
Google Scholar
Turtle, H., Croft, W.B.: Inference Networks for Document Retrieval. In: Proceedings of the Thirteenth International Conference on Research and Development in Information Retrieval, pp. 1–24 (1990)
Google Scholar
Christopher, J.C.B.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2, 121–167 (1998)
Article Google Scholar
Joachims, T.: Transductive inference for Text Classification using Support Vector Machines. In: Machine Learning - Proceedings of Sixteenth Inernational Conference (ICML 1999), pp. 200–209 (1999)
Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Liu, H., Li, J., Wong, L.: A comparative Study on Feature Selection and Classification Methods using Gene Expression Profiles and Proteomic Patterns, Genome Informatics (2002)
Google Scholar
Diaz, A.L., Lovell.: XML Generator (1999), http://www.alphaworks.ibm.com/tech/xmlgenerator

Download references

Authors

K. R. Venugopal
View author publications
You can also search for this author in PubMed Google Scholar
K. G. Srinivasa
View author publications
You can also search for this author in PubMed Google Scholar
L. M. Patnaik
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Venugopal, K.R., Srinivasa, K.G., Patnaik, L.M. (2009). Evolutionary Approach for XML Data Mining. In: Soft Computing for Data Mining Applications. Studies in Computational Intelligence, vol 190. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00193-2_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-00193-2_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00192-5
Online ISBN: 978-3-642-00193-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Abstract

Buying options