Abstract
Grouping operators summarize data in DBMS arranging elements in groups using identity comparisons. However, for metric data, grouping by identity is seldom useful, since adopting the concept of similarity is often a better fit. There are operators that can group data elements using similarity. However, the existing operators do not achieve good results for certain data domains or distributions. The major contributions of this work are a novel operator called the SGB-Vote that assign groups using an election involving already assigned groups and an extension for current operators bounds each group to a maximum amount of the nearest neighbors. The operators were implemented in a framework and evaluated using real and synthetic datasets from diverse domains considering both quality of and execution time. The results obtained show that the proposed operators produce higher quality groups in all tested datasets and highlight that the operators can efficiently run inside a DBMS.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Barioni, M.C.N., Kaster, D.D.S., Razente, H.L., Traina, A.J.M., Traina Jr., C.: Querying Multimedia Data by Similarity in Relational DBMS. In: Yan, L., Ma, Z. (eds.) Advanced Database Query Systems: Techniques, Applications and Technologies, chap. 14, pp. 323–359. IGI Global, Hershey, NY, USA (2010)
Barioni, M.C.N., Razente, H.L., Traina, A.J.M., Traina Jr., C.: SIREN: a similarity retrieval engine for complex data. In: VLDB, pp. 1155–1158. ACM (2006)
Carvalho, L.O., de Oliveira, W.D., Pola, I.R.V., Traina, A.J.M., Traina, C.: A wider concept for similarity joins. JIDM 5(3), 210–223 (2014)
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book. Prentice Hall Press, Upper Saddle River (2008)
Gray, J., Bosworth, A., Layman, A., Pirahesh, H.: Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-total. In: ICDE, pp. 152–159. IEEE Computer Society (1996)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, Burlington (2000)
Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. 33(2), 7:1–7:38 (2008)
Kaster, D.S., Bugatti, P.H., Traina, A.J.M., Traina, C.: FMI-SiR: a flexible and efficient module for similarity searching on oracle database. JIDM 1(2), 229–244 (2010)
Li, C., Wang, M., Lim, L., Wang, H., Chang, K.C.: Supporting ranking and clustering as generalized order-by and group-by. In: SIGMOD Conference, pp. 127–138. ACM (2007)
Marri, W.J.A., Malluhi, Q., Ouzzani, M., Tang, M., Aref, W.G.: The similarity-aware relational intersect database operator. In: Traina, A.J.M., Traina, C., Cordeiro, R.L.F. (eds.) SISAP 2014. LNCS, vol. 8821, pp. 164–175. Springer, Cham (2014). doi:10.1007/978-3-319-11988-5_15
Oliveira, P.H., Fraideinberze, A.C., Laverde, N.A., Gualdron, H., Gonzaga, A.S., Ferreira, L.D., Oliveira, W.D., Rodrigues Jr., J.F., Cordeiro, R.L.F., Traina Jr., C., Traina, A.J.M., de Sousa, E.P.M.: On the support of a similarity-enabled relational database management system in civilian crisis situations. In: ICEIS (1), pp. 119–126. SciTePress (2016)
Pola, I.R.V., Cordeiro, R.L.F., Traina, C., Traina, A.J.M.: A new concept of sets to handle similarity in databases: the SimSets. In: Brisaboa, N., Pedreira, O., Zezula, P. (eds.) SISAP 2013. LNCS, vol. 8199, pp. 30–42. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41062-8_4
Schallehn, E., Sattler, K., Saake, G.: Advanced grouping and aggregation for data integration. In: CIKM, pp. 547–549. ACM (2001)
Sedgewick, R., Wayne, K.: Algorithms, 4th edn. Addison-Wesley, Boston (2011)
Silva, Y.N., Aly, A.M., Aref, W.G., Larson, P.: SimDB: a similarity-aware database system. In: SIGMOD Conference, pp. 1243–1246. ACM (2010)
Silva, Y.N., Aref, W.G., Ali, M.H.: Similarity group-by. In: ICDE, pp. 904–915. IEEE Computer Society (2009)
Silva, Y.N., Aref, W.G., Ali, M.H.: The similarity join database operator. In: ICDE, pp. 892–903. IEEE Computer Society (2010)
Tang, M., Tahboub, R.Y., Aref, W.G., Atallah, M.J., Malluhi, Q.M., Ouzzani, M., Silva, Y.N.: Similarity group-by operators for multi-dimensional relational data. IEEE Trans. Knowl. Data Eng. 28(2), 510–523 (2016)
Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search - The Metric Space Approach. Advances in Database Systems, vol. 32. Kluwer, Dordrecht (2006)
Zhang, C., Huang, Y.: Cluster By: a new sql extension for spatial data aggregation. In: GIS, p. 53. ACM (2007)
Acknowledgments
This research is partially funded by FAPESP, CNPq, CAPES, and the RESCUER Project, as well as by the European Commission (Grant: 614154) and by the CNPq/MCTI (Grant: 490084/2013-3).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Laverde, N.A., Cazzolato, M.T., Traina, A.J.M., Traina, C. (2017). Semantic Similarity Group By Operators for Metric Data. In: Beecks, C., Borutta, F., Kröger, P., Seidl, T. (eds) Similarity Search and Applications. SISAP 2017. Lecture Notes in Computer Science(), vol 10609. Springer, Cham. https://doi.org/10.1007/978-3-319-68474-1_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-68474-1_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68473-4
Online ISBN: 978-3-319-68474-1
eBook Packages: Computer ScienceComputer Science (R0)