Database and Expert Systems Applications pp 59-64 | Cite as

# Duplicates Detection, Counting, and Removal

## Abstract

The need to detect and eliminate duplicate elements arises in many applications such as the processing of relational database operations, comparison of complex objects, transitive closure, and protocol verification. Given a multiset, the process of detecting duplicates, eliminating them, sorting the remaining distinct elements, and counting the number occurrences of each in the multiset is a computationally intensive task, especially when complex objects are involved. In this paper, the computational complexity of performing such a task is addressed. The computational complexity study is based on a modified comparison-based decision tree. It is shown that, to a multiset M(n,L) of n elements L of which are distinct, we can associate a decision tree of L!S(n,L) external nodes. S(n,L) represents Stirling number of the second kind. This result suggests that upper and lower bounds to perform such a task are different from those of sorting a set of the same cardinality. It is also shown that a comparison based sorting algorithm can be adapted to perform such a task. In addition, the analytical performance of the adapted sorting algorithm is addressed.

## Keywords

Relational Database Distinct Element Transitive Closure External Node Database Machine## Preview

Unable to display preview. Download preview PDF.

## References

- [1]Abdelguerfi. M., “Special Function Unit for Statistical Aggregation Function”,
*Sixth International Workshop on Database Machines*, June 1989, France, pp. 187-201, (**Database Machines**, Lecture Notes in Computer Science, Edited by H. Boral and P. Faudemay, Springer-Verlag).Google Scholar - [2]Abdelguerfi, M., Sood, A. K., “Computational Complexity of Sorting and Joining Relations with Duplicates,”
*IEEE Transactions on Knowledge and Data Engineering*, December 1991, pp. 496-503.Google Scholar - [3]—, A. K., “A Bus Connected Cellular Array Unit for Relational Database Machines, in
**Database Machines and Knowledge Base Machines**, edited by M. Kitsuregawa, and H. Tanaka, 1988, Kluwer Academic Publishers, pp. 243-256.Google Scholar - [4]Babb, E., “Implementing a Relational Database by Means of Specialized Hardware”,
*ACM Trans. on Database Systems*, Vol. 4, No. 1, March 1979, pp. 1–29.CrossRefGoogle Scholar - [5]Bitton, D., DewiTT, D. J., “Duplicate Records Eliminations in Large Files”,
*ACM Trans. on Database Systems*, 1983, No. 8, pp. 255–265.MATHCrossRefGoogle Scholar - [6]Codd, E. F., “Relational Model of Data for Large Shared Data Banks”,
*Comm. ACM*, June 1970, pp. 377-387.Google Scholar - [7]Dobkin, D., Munro, J., “Determining the Mode”,
*Theoretical Computer Science*, Nov. 1980, Vol. 12, pp. 255–265.MathSciNetMATHCrossRefGoogle Scholar - [8]Flajolet, P., Martin, G. N., “Probabilistic Counting Algorithms for Database Applications,”
*J. of Com. and Syst. Science*, Vol. 31, 1985, pp. 182–209.MathSciNetMATHCrossRefGoogle Scholar - [9]Knuth, D. E.,
**The Art of Computer Programming**, Vol. 3, Addison-Wesley, Reading, Massachusets, 1973.Google Scholar - [10]Kuspert, K., Saake, G., Wegner, L., “Duplicate Detection and Deletion in the Extended NF
^{2}Data Mode”,*Third International Conference on the Foundations of data Organization and Algorithms*, Paris, France, June 1989, pp. 83-100.Google Scholar - [11]Menon, V. V., “On the Maximum of Stirling Number of the Second Kind”,
*J. of Combinatorial Theory (A)*, 1973, Vol. 15, pp. 11–24.MathSciNetMATHCrossRefGoogle Scholar - [12]Noshika, K. “Predicting the Number of Distinct Elements in a Multiset”,
*SIAM J. of Computing*, Vol. 11, No. 4, 1982, pp. 611–619.CrossRefGoogle Scholar - [13]Sood, A. K., Abdelguerfi, M., Shu, S., “Hardware Implementation of Relational Algebra Operations,” in Database Modern Trends and Applications, Nato ASI Editions, Series F, Springer-Verlag, 1986, pp. 341-380.Google Scholar
- [14]Stockmeyer, L. J., Wong, C. K., “On the Number of Comparisons to Find the Intersection of Two Relations”,
*SIAM J. of Computing*, Vol. 8, no. 3, 1979, pp. 388–404.MathSciNetMATHCrossRefGoogle Scholar - [15]Lu, M. J., Carey, J., “Some Experimental Results on Distributed Join Algorithms in a Local Network”
*Proceedings of the 11-th International Conference on VLDB, Stockholm*, August 21–23, 1985, pp. 292-304.Google Scholar - [16]Astrahan, M. M., “System R: Relational Approach to Database Management”,
*ACM Transactions on Database Systems*, Vol. 1, No. 2, June 1976, pp. 97–137.CrossRefGoogle Scholar - [17]Astrahan, M. M., Schkolnick, M., Whang, K. Y., “Approximating the Number of Unique Values of an Attribute Without Sorting”,
*Information Systems*, 1987, Vol. 12, No. 1, pp. 11–15.CrossRefGoogle Scholar - [18]Lai, M. Y., Lee, T. T., “Protocol Verification Using Relational Database Systems”,
*Proceedings of the Third International Conference on Data Engineering*, 1987, pp. 347-354.Google Scholar - [19]Lai, M. Y., Lee, T. T., “A Relational Algebraic Approach to Protocol Verification”,
*IEEE Transactions on Software Engineering*, 1988, pp. 184-193.Google Scholar - [20]Topkar, V. A., Frieder, O., Sood, A. K., “Duplicate Removal on Hypercube Engines: An Experimental Analysis”, (private communications).Google Scholar
- [21]Abdelguerfi, M., Khalaf, S., Sood, A. K., “Bit-Serial Parallel VLSI Processing Unit for the Histogramming Operation”,
*IEEE Transactions on Circuit and Systems*Vol. 37, No. 7, July 1990, pp. 948–954CrossRefGoogle Scholar - [22]Astrahan, M. M., Schkolnick, M., Whang, K. Y., “Approximating the Number of Unique Values of an Attribute Without Sorting”,
*Information Systems*, 1987, Vol. 12, No. 1, pp. 11–15.CrossRefGoogle Scholar - [23]Khoshafian, S., Frank, D., “Implementation Techniques for Object Oriented Databases”, in: Advances in Object-Oriented Database Systems, Dittrich, K. R. (editor), Springer-Verlag, Sep. 1988, pp. 60-79.Google Scholar
- [23]Khoshafian, S., Frank, D., “Implementation Techniques for Object Oriented Databases”, in: Advances in Object-Oriented Database Systems, Dittrich, K. R. (editor), Springer-Verlag, Sep. 1988, pp. 60-79.Google Scholar
- [24]Khoshafian, S., Copeland, G., “Object Identity”,
*Proceedings of the First International Conference on OOPSLA*, Portland, October 1986.Google Scholar - [25]Bancillon, F., et al., “FAD, a Simple and Powerful Database language”,
*Proceedings of the 13th International Conference on VLDB*, Brighton, England, September 1987.Google Scholar - [26]Ullman, J. D., “ Database Theory — Past and Future”,
*Proceeedings of the PODS*, 1987.Google Scholar - [27]Dittrich, K. R., “Object-Oriented Database Systems: The Notion and the Issues”,
*Proceedings of the International Workshop on Object oriented Database Systems*, Pacific Grove, CA, September 1986.Google Scholar - [28]Mullin, J. K., “Optimal Semijoins for Distributed Database Systems”,
*IEEE Transactions on Software Engineering*, Vol. 16, No. 5, May 1990, pp. 558–560.CrossRefGoogle Scholar - [29]Bancillon, F., Ramakrshan, R., “An Amateur’s Introduction to Recursive Query Processing”,
*Proceedings of the 1986 ACM-SIGMOD Conference on Management of Data*, May 1986, pp. 16-52.Google Scholar - [30]Abdelguerfi, M., Sood, A. K., “Sorting and Joining Relations with Duplicate Attribute Values”, Databases:Theory, Design and Applicatins, Editors: N, Rishe, S. Navathe, and D. Tal, IEEE Computer Society Press, May 1991, pp. 21-36.Google Scholar
- [31]Abdelguerfi, M., “ Duplicate Detection, Counting, and Removal”, Departmental Technical Report, 1991.Google Scholar