An Effective Data Processing Method for Fast Clustering

Moon, Hyun-Joo; Kim, Sangheon; Moon, Jongbae; Lee, Eun-Ser

doi:10.1007/978-3-540-69848-7_27

Hyun-Joo Moon¹,
Sangheon Kim¹,
Jongbae Moon² &
…
Eun-Ser Lee³

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5073))

Included in the following conference series:

International Conference on Computational Science and Its Applications

1565 Accesses
2 Citations
3 Altmetric

Abstract

Because of the extensive diffusion of Internet usage, heterogeneous computing platforms, and ubiquitous computing technologies, Web data that are usually written in XML format are explosively increased. With the growth of Web data and the importance of their clustering, we need similarity detection method because it is a fundamental technology for efficient document management. In this paper, we introduce a similarity detection method that can check both semantic similarity and structural similarity between XML DTDs. For semantic checking, we adopt ontology technology, and we apply longest common string and longest nesting common string methods for structural checking. Our similarity detection method uses multi-tag sequences instead of traversing XML schema trees, so that it gets fast and reasonable results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Extensible Markup Language (XML) 1.0 (1998), http://www.w3c.org/TR/REC-xml
Lian, W., Cheung, D.W., Yiu, S.: An Efficient and Scalable Algorithm for Clustering XML Documents by Structure. IEEE Transactions on Knowledge and Data Engineering 16(1) (January 2004)
Google Scholar
Costa, G., Manco, G., Ortale, R., Tagarelli, A.: A Tree- Based Approach to Clustering XML Documents by Structure. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 137–148. Springer, Heidelberg (2004)
Google Scholar
Klein, P., Tirthapura, S., Sharvit, D., Kimia, B.: A Tree-edit -distance Algorithm for Comparing Simple, Closed Shapes. In: Proceedings of the 11th Annual ACM SIAM Symposium of Discrete Algorithms, pp. 696–704 (2000)
Google Scholar
Dalamagas, T., Cheng, T., Winkel, K.J., Sellis, T.: Clustering XML Documents Using Structural Summaries. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 547–556. Springer, Heidelberg (2004)
Google Scholar
Borenstein, E., Sharon, E., Ullman, S.: Combining Top-down and Bottom-up Segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (June 2004)
Google Scholar
Ekram, R.A., Adma, A., Baysal, O.: Diffx: An Algorithm to Detect Changes in Multi-Version XML Documents. In: Proceedings of the 2005 Conference on the Centre for Advanced Studies on Collaborative Research (2005)
Google Scholar
Zhang, K., Wang, J.T., Shasha, D.: On the Editing Distance between Undirected Acyclic Grahphs and Related Problems. In: Proceedings of the 6th Annual Symposium of Combinatorial Pattern Matching (1995)
Google Scholar
Rafiei, D., Mendelzon, A.: Similarity-Based Queries for Time Series Data. In: Proceedings of the ACM International Conference on Management of Data, pp. 13–24 (May 1997)
Google Scholar
Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: Clustering XML Schemas for Effective Integration. In: Proceedings of the 11th ACM International Conference on Information and Knowledge Management, pp. 292–299 (2002)
Google Scholar
Moon, H.J., Kim, K.J., Park, G.C., Yoo, C.W.: Effective Similarity Discovery from Semi-structured Documents. International Journal of Multimedia and Ubiquitous Engineering 1(4), 12–18 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Cultural Contents, Hankuk University of Foreign Studies, 270 Imun 2-dong, Dongdaemun-gu, Seoul, 130-082, Korea
Hyun-Joo Moon & Sangheon Kim
Korea Institute of Science and Technology Information, 52-11, Eoeun-dong,Yuseong-gu, Daejeon, 305-806, Korea
Jongbae Moon
Dept. of Computer Engineering, Andong National University, 388 Songcheon-dong, Andong-city, Gyeongsangbuk-do, 760-749, Korea
Eun-Ser Lee

Authors

Hyun-Joo Moon
View author publications
You can also search for this author in PubMed Google Scholar
Sangheon Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jongbae Moon
View author publications
You can also search for this author in PubMed Google Scholar
Eun-Ser Lee
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Osvaldo Gervasi Beniamino Murgante Antonio Laganà David Taniar Youngsong Mun Marina L. Gavrilova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Moon, HJ., Kim, S., Moon, J., Lee, ES. (2008). An Effective Data Processing Method for Fast Clustering. In: Gervasi, O., Murgante, B., Laganà, A., Taniar, D., Mun, Y., Gavrilova, M.L. (eds) Computational Science and Its Applications – ICCSA 2008. ICCSA 2008. Lecture Notes in Computer Science, vol 5073. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69848-7_27

Download citation

DOI: https://doi.org/10.1007/978-3-540-69848-7_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69840-1
Online ISBN: 978-3-540-69848-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics