Skip to main content

Clustering Deep Web Databases Semantically

  • Conference paper
Information Retrieval Technology (AIRS 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4993))

Included in the following conference series:

Abstract

Deep Web database clustering is a key operation in organizing Deep Web resources. Cosine similarity in Vector Space Model (VSM) is used as the similarity computation in traditional ways. However it cannot denote the semantic similarity between the contents of two databases. In this paper how to cluster Deep Web databases semantically is discussed. Firstly, a fuzzy semantic measure, which integrates ontology and fuzzy set theory to compute semantic similarity between the visible features of two Deep Web forms, is proposed, and then a hybrid Particle Swarm Optimization (PSO) algorithm is provided for Deep Web databases clustering. Finally the clustering results are evaluated according to Average Similarity of Document to the Cluster Centroid (ASDC) and Rand Index (RI). Experiments show that: 1) the hybrid PSO approach has the higher ASDC values than those based on PSO and K-Means approaches. It means the hybrid PSO approach has the higher intra cluster similarity and lowest inter cluster similarity; 2) the clustering results based on fuzzy semantic similarity have higher ASDC values and higher RI values than those based on cosine similarity. It reflects the conclusion that the fuzzy semantic similarity approach can explore latent semantics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Hedley, Y.-L., Younas, M., James, A.: The categorisation of hidden web databases through concept specificity and coverage. In: Advanced Information Networking and Applications, 2005. 19th International Conference on AINA 2005, March 28-30, 2005, vol. 2(2), pp. 671–676 (2005)

    Google Scholar 

  2. Peng, Q., Meng, W., He, H., Yu, C.T.: WISE-cluster: clustering e-commerce search engines automatically. In: Proceedings of the 6th ACM International Workshop on Web Information and Data Management, Washington, pp. 104–111 (2004)

    Google Scholar 

  3. He, B., Tao, T., Chang, K.C.-C.: Organizing structured web sources by query schemas: a clustering approach. In: CIKM, pp. 22–31 (2004)

    Google Scholar 

  4. Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2006)

    Google Scholar 

  5. Cui, X., Potok, T.E., Palathingal, P.: Object Clustering using Particle Swarm Optimization. In: Proceedings of the 2005 IEEE Swarm Intelligence Symposium, Pasadena, California, USA, June 2005, pp. 185–191 (2005)

    Google Scholar 

  6. Shan, S.M., Deng, G.S., He, Y.H.: Data Clustering using Hybridization of Clustering Based on Grid and Density with PSO. In: IEEE International Conference on Service Operations and Logistics, and Informatics, Shanghai, June 2006, pp. 868–872 (2006)

    Google Scholar 

  7. Van der Merwe, D.W., Engelbrecht, A.P.: Data Clustering using Particle Swarm Optimization. In: The 2003 Congress on Evolutionary Computation, vol. 1, pp. 215–220 (2003)

    Google Scholar 

  8. Srinoy, S., Kurutach, W.: Combination Artificial Ant Clustering and K-PSO Clustering Approach to Network Security Model. In: ICHIT 2006. International Conference on Hybrid Information Technology, Cheju Island, Korea, vol. 2, pp. 128–134 (2006)

    Google Scholar 

  9. Chen, C.-Y., Ye, F.: Particle Swarm Optimization Algorithm and Its Application to Clustering Analysis. In: Proceedings of the 2004 IEEE international Conference on Networking, Sensing Control, Taipei, Taiwan, March 2004, vol. 2, pp. 789–794 (2004)

    Google Scholar 

  10. http://www.11thhourvacations.com

  11. Halevy, A.Y.: Why your data don’t mix. ACM Queue 3(8) (2005)

    Google Scholar 

  12. Ru, Y., Horowitz, E.: Indexing the invisibleWeb: a survey. Online Information Review 29(3), 249–265 (2005)

    Article  Google Scholar 

  13. Caverlee, J., Liu, L., Buttler, D.: Probe, Cluster, and Discover:Focused Extraction of QA-Pagelets from the Deep Web

    Google Scholar 

  14. Barbosa, L., Freire, J., Silva, A.: Organizing hidden-web databases by clustering visible web documents. In: Data Engineering, 2007. IEEE 23rd International Conference on ICDE 2007, April 15-20, 2007, pp. 326–335 (2007)

    Google Scholar 

  15. Bloehdorn, S., Cimiano, P., Hotho, A.: Learning Ontologies to Improve Text Clustering and Classification. In: Data and Information Analysis to Knowledge Engineering, pp. 334–341. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  16. Castells, P., Fernańdez, M., Vallet, D.: An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval. IEEE Transactions on Knowledge and Data Engineering 19(2), 261–272 (2007)

    Article  Google Scholar 

  17. Shamsfard, M., Nematzadeh, A., Motiee, S.: ORank: An Ontology Based System for Ranking Objects. International Journal Of Computer Science 1(3), 1306–4428 (2006)

    Google Scholar 

  18. Varelas, G., Voutsakis, E., Raftopoulou, P.: Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web. In: Proceedings of the 7th annual ACM international workshop on Web information and data management, Bremen, Germany, pp. 10–16 (2005)

    Google Scholar 

  19. Zhang, X., Jing, L., Hu, X., Ng, M., Zhou, X.: A Comparative Study of Ontology Based Term Similarity Measures on PubMed Object Clustering, http://www.pages.drexel.edu/~xz38/pdf/209_Zhang_DASFAA07.pdf

  20. Chaudhri, V.K., Farquhar, A., Fikes, R., Karp, P.D., Rice, J.P.: OKBC: A Progammatic Foundation for Knowledge Base Interoperability. In: Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, Madison, Wisconsin, United States, pp. 600–607 (1998)

    Google Scholar 

  21. http://iew3.technion.ac.il/OntoBuilder

  22. http://protege.stanford.edu

  23. Zadeh, L.A.: Similarity Relations and Fuzzy Orderings. Information Science 3, 177–200 (1971)

    Article  MATH  MathSciNet  Google Scholar 

  24. Thomopoulos, R., Buche, P., Haemmerle, O.: Fuzzy Sets Defined on a Hierarchical Domain. IEEE Transaction on knowledge and engineering 18(10), 1397–1410 (2006)

    Article  Google Scholar 

  25. Zadeh, L.A.: Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems 100(supp. 1), 9–34 (1978)

    Google Scholar 

  26. Brucker, P.: On the complexity of clustering problems. In: Beckmenn, M., Kunzi, H.P. (eds.) Optimization and Operations Research. Lecture Notes in Economics and Malhemorical Sysrem, vol. lS7, pp. 45–54. Springer, Berlin (1978)

    Google Scholar 

  27. http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8/index.html

  28. http://aip.completeplanet.com

  29. http://www.invisible-web.net

Download references

Author information

Authors and Affiliations

Authors

Editor information

Hang Li Ting Liu Wei-Ying Ma Tetsuya Sakai Kam-Fai Wong Guodong Zhou

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Song, L., Ma, J., Yan, P., Lian, L., Zhang, D. (2008). Clustering Deep Web Databases Semantically. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-68636-1_35

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68633-0

  • Online ISBN: 978-3-540-68636-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics