Skip to main content

Evolutionary Approach for XML Data Mining

  • Chapter
  • 882 Accesses

Part of the book series: Studies in Computational Intelligence ((SCI,volume 190))

Abstract

Extensible Markup Language(XML) has emerged as a medium for interoperability over the Internet. The XML technology, with its self-describing and extensible tags, is significantly contributing to the next generation semantic web. The present search techniques used for HTML and text documents are not efficient to retrieve relevant XML documents. In chapter four, Self Adaptive Genetic Algorithms for XML Search(SAGAXSearch) is presented to learn about the tags, which are useful in indexing. The indices and relationship strength metrics are used to extract fast and accurate semantically related elements in the XML documents. Experiments are conducted on the DataBase systems and Logic Programming (DBLP) XML corpus and are evaluated for precision and recall. The proposed SAGAXSearch outperforms the existing algorithms XSEarch and XRank with respect to accuracy and query execution time.

As the number of documents published in the form of XML is increasing, there is a need for selective dissemination of XML documents based on user interests. In the proposed technique, a combination of Self Adaptive Genetic Algorithms and multi class Support Vector Machine is used to learn a user model. Based on the feedback from the users, the system automatically adapts to the users preference and interests. The user model and a similarity metric are used for selective dissemination of a continuous stream of XML documents. Experimental evaluations performed over a wide range of XML documents, indicate that the proposed approach significantly improves the performance of the selective dissemination task, with respect to accuracy and efficiency.

On similar grounds, there is a need for categorization of XML documents into specific user interest categories. However, manually performing the categorization task is not feasible due to the sheer amount of XML documents available on the Internet. A machine learning approach to topic categorization which makes use of a multi class SVM for exploiting the semantic content of XML documents is also presented. The SVM is supplemented by a feature selection technique which is used to extract the useful features. Experimental evaluations performed over a wide range of XML documents indicate that the proposed approach significantly improves the performance of the topic categorization task, with respect to accuracy and efficiency.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: Proceedings of International Conference on Seventh World-Wide Web conference (WWW7) (1998)

    Google Scholar 

  2. Luk, R., et al.: A Survey of Search Engines for XML Documents. In: SIGIR Workshop on XML and IR (2000)

    Google Scholar 

  3. Shanmugasundaram, J., et al.: A General Technique for Querying XML Documents using a Relational Database System, SIGMOD Record (2001)

    Google Scholar 

  4. Abiteboul, S.: On views and XML. SIGMOD Record 28(4), 30–38 (1999)

    Article  Google Scholar 

  5. World Wide Web Consortium. XQUERY: A Query Language for XML W3c Working Draft, http://www.w3.org/XML/Query

  6. Florescu, D., Kossmann, D., Manolescu, I.: Integrating Keyword Search into XML Query Processing. The International Journal of Computer and Telecommunications Networking 33(1), 119–135 (2000)

    Google Scholar 

  7. Gordon, M.: Probabilistic and Genetic Algorithms for Document Retrieval. Communications of the ACM 31(1), 1208–1218 (1988)

    Article  Google Scholar 

  8. Yang, J., Korfhage, R.R.: Effects of Query Term Weights Modification in Annual Document Retrieval: A Study Based on a Genetic Algorithm. In: Proceedings of the Second Symposium on Document Analysis and Information Retrieval, pp. 185–271 (1993)

    Google Scholar 

  9. Yang, J., Korfhage, R.R., Rasmussen, E.: Query improvement in Information Retrieval using Genetic Algorithms: A Report on the Experiments of the TREC project. In: Proceedings of the First Text Retrieval Conference (TREC-1), pp. 31–58 (1993)

    Google Scholar 

  10. Pathak, P., Gordon, M., Fan, W.: Effective Information Retrieval using Genetic Algorithms based Matching Functions Adaptation. In: Proceedings of 33rd Hawaii International Conference on System Sciences (2000)

    Google Scholar 

  11. Kim, S., Zhang, B.T.: Genetic Mining of HTML Structures for effective Web Document Retrieval. Applied Intelligence 18, 243–256 (2003)

    Article  MathSciNet  Google Scholar 

  12. Hristidis, V., Papakonstantinou, Y., Balmin, A.: Key-word Proximity Search on XML Graphs. In: International Conference on Data Engineering (2003)

    Google Scholar 

  13. Guo, L., et al.: XRANK: Ranked Keyword Search over XML Documents. In: SIGMOD (2003)

    Google Scholar 

  14. Cohen, S., Mamou, J., Kanza, Y., Sagiv, Y.: XSEarch: A Semantic Search Engine for XML. In: VLDB 2003, pp. 45–56 (2003)

    Google Scholar 

  15. DBLP XML Records (2001), http://acm.org/sigmoid/dblp/db/index.html

  16. Shanmugasundaram, K.T., Zhang, C., He, G., DeWitt, D.J., Naughton, J.F.: Relational Databases for Querying XML Documents: Limitations and opportunities. In: VLDB 1999, pp. 302–314 (1999)

    Google Scholar 

  17. Yan, T., Garcia-Molina, H.: The SIFT Information Dissemination System. ACM Transactions on Database Systems TODS 24(4), 529–565 (1999)

    Article  Google Scholar 

  18. Altinel, M., Franklin, M.: Efficient Filtering of XML Documents for Selective Dissemination of Information. In: International Conference on Very Large Databases (VLDB 2000), pp. 53–64 (2000)

    Google Scholar 

  19. Stanoi, I., Mihaila, G., Padmanabhan, S.: A Framework for Selective Dissemination of XML Documents based on Inferred User Profiles. In: Proceedings of the Nineteenth International Conference on Data Engineering (ICDE 2003) (2003)

    Google Scholar 

  20. Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  21. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  22. Weiss, S.M., Apte, C., Damerau, F.J., Johnson, D.E., Oles, F.J., Goetz, T., Hampp, T.: Maximizing Text-Mining Performance. IEEE Intelligent Systems 14(4), 2–8 (1999)

    Article  Google Scholar 

  23. Baker, K.D., McCallum, A.K.: Distributional Clustering of Words for Text Classification. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998), pp. 96–103 (1998)

    Google Scholar 

  24. Theobald, M., Schenkel, R., Weikum, G.: Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data. In: Proceedings of International Workshop on the Web and Databases (WebDB) (2003)

    Google Scholar 

  25. Selamat, A., Omatu, S.: Web Page Feature Selection and Classification using Neural Networks. Information Sciences-Informatics and Computer Science: An International Journal 158(1), 69–88 (1999)

    MathSciNet  Google Scholar 

  26. Salton, G.: Automatic Text Processing. Addison-Wesley, Reading (1989)

    Google Scholar 

  27. Turtle, H., Croft, W.B.: Inference Networks for Document Retrieval. In: Proceedings of the Thirteenth International Conference on Research and Development in Information Retrieval, pp. 1–24 (1990)

    Google Scholar 

  28. Christopher, J.C.B.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2, 121–167 (1998)

    Article  Google Scholar 

  29. Joachims, T.: Transductive inference for Text Classification using Support Vector Machines. In: Machine Learning - Proceedings of Sixteenth Inernational Conference (ICML 1999), pp. 200–209 (1999)

    Google Scholar 

  30. Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  31. Liu, H., Li, J., Wong, L.: A comparative Study on Feature Selection and Classification Methods using Gene Expression Profiles and Proteomic Patterns, Genome Informatics (2002)

    Google Scholar 

  32. Diaz, A.L., Lovell.: XML Generator (1999), http://www.alphaworks.ibm.com/tech/xmlgenerator

Download references

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Venugopal, K.R., Srinivasa, K.G., Patnaik, L.M. (2009). Evolutionary Approach for XML Data Mining. In: Soft Computing for Data Mining Applications. Studies in Computational Intelligence, vol 190. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00193-2_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-00193-2_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-00192-5

  • Online ISBN: 978-3-642-00193-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics