Abstract
The purchasing of customer databases, which is becoming more and more common, has led to a big problem: the use of purchased databases to mount a collusion attack, which is when purchasers of a database illegally combine their versions of it in order to de-anonymize the private information it contains. However, the purchasing of customer database is only available in the black market. In this paper, we first investigated the relationship between the price of distributing a database and its collusion resistance. A fingerprint is embedded in database so that illegal distributors can be identified. The fingerprints are constructed on the basic of concatenated codes. After the fingerprint is embedded, the price of distributing the database and its collusion resistance are modelled as decreasing functions. The less expensive the database is, the less collusion resistance the database owner deals with. There are upper and lower bounds for the collusion capabilities. To the best of our knowledge, this scheme is unique in that the tradeoff between the price of distributing a database and its collusion resistance is based on a mathematical model. Second, we propose a guideline to sell customer database legally with profit and risk evaluation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
A zettabyte is 1,000,000,000,000,000,000,000 bytes. Imagine that every person in Vietnam (population of 92.5 million in 2014) took a digital photo every second of every day for over three months. All of those photos put together would equal about one zettabyte.
- 2.
Big Data is a term that refers to “large, diverse, complex, longitudinal, and/or distributed datasets generated from instruments, sensors, Internet transactions, email, video, click streams, and/or all other digital sources available today and in the future” [30].
References
White House. Big Data: Seizing Opportunities, Preserving Values (2014)
Financial Times. Digital hunter-gatherers (2013). http://www.ft.com/intl/cms/s/0/f840dbc0-d34f-11e2-b3ff-00144feab7de.html#axzz3eS2n1CZx
The Guardian. How much is your personal data worth? (2014). http://www.theguardian.com/news/datablog/2014/apr/22/how-much-is-personal-data-worth
Forbes. The black market price of your personal info (2010). http://www.forbes.com/2010/11/29/black-market-price-of-your-info-personal-finance.html
Gantz, J., Reinsel, D.: The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze Future 2007, 1–16 (2012)
Kieseberg, P., Schrittwieser, S., Mulazzani, M., Echizen, I., Weippl, E.: An algorithm for collusion-resistant anonymization and fingerprinting of sensitive microdata. Electron. Markets 24(2), 113–124 (2014)
Motwani, R., Xu, Y.: Efficient algorithms for masking and finding quasi-identifiers. In: Proceedings of the Conference on Very Large Data Bases (VLDB), pp. 83–93 (2007)
Lodha, S.P., Thomas, D.: Probabilistic anonymity. In: Bonchi, F., Malin, B., Saygın, Y. (eds.) PInKDD 2007. LNCS, vol. 4890, pp. 56–79. Springer, Heidelberg (2008)
Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain., Fuzziness Knowl.-Based Syst. 10(05), 571–588 (2002)
Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain., Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)
El Emam, K., Dankar, F.K., Isaa, R., Jonker, E., Amyot, D., Cogo, E., Corriveau, J.-P., et al.: A globally optimal k-anonymity method for the de-identification of health data. J. Am. Med. Inform. Assoc. 16(5), 670–682 (2009)
Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)
Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical report, SRI International (1998)
Schrittwieser, S., Kieseberg, P., Echizen, I., Wohlgemuth, S., Sonehara, N., Weippl, E.: An algorithm for k-anonymity-based fingerprinting. In: Shi, Y.Q., Kim, H.-J., Perez-Gonzalez, F. (eds.) IWDW 2011. LNCS, vol. 7128, pp. 439–452. Springer, Heidelberg (2012)
Willenborg, L., Kardaun, J.: Fingerprints in microdata sets. In: CBS (1999)
Bui, T.V., Nguyen, B.Q., Nguyen, T.D., Sonehara, N., Echizen, I.: Robust fingerprinting codes for database. In: Aversa, R., Kołodziej, J., Zhang, J., Amato, F., Fortino, G. (eds.) ICA3PP 2013, Part II. LNCS, vol. 8286, pp. 167–176. Springer, Heidelberg (2013)
Bui, T.V., Nguyen, B.Q., Nguyen, T.D., Sonehara, N., Echizen, I.: Robust fingerprinting codes for database using non-adaptive group testing. Int. J. Big Data Intell. 2(2), 81–90 (2015)
Li, Y., Swarup, V., Jajodia, S.: Fingerprinting relational databases: schemes and specialties. IEEE Trans. Dependable Secure Comput. 2(1), 34–45 (2005)
Guo, F., Wang, J., Li, D.: Fingerprinting relational databases. In: Proceedings of the ACM Son Applied Computing, pp. 487–492. ACM (2006)
Agrawal, R., Kiernan, J.: Watermarking relational databases. In: Proceedings of the 28th International Conference on VLDB, pp. 155–166 (2002)
Forney Jr., G.D.: Concatenated codes. DTIC Document (1965)
Reed, I.S., Solomon, G.: Polynomial codes over certain finite fields. J. Soc. Ind. Appl. Math. 8(2), 300–304 (1960)
Wicker, S.B., Bhargava, V.K.: Reed-Solomon Codes and Their Applications. Wiley-IEEE Press, New York (1999)
Gilbert, E.N.: A comparison of signalling alphabets. Bell Syst. Tech. J. 31(3), 504–522 (1952)
Varshamov, R.R.: Estimate of the number of signals in error correcting codes. Dokl. Akad. Nauk SSSR 117(5), 739–741 (1957)
Rosvall, M., Bergstrom, C.T.: Maps of random walks on complex networks reveal community structure. Proc. Natl. Acad. Sci. 105(4), 1118–1123 (2008)
McAfee, A., Brynjolfsson, E.: Big data: the management revolution. Harvard Bus. Rev. 90, 60–68 (2012)
Varian, H.R.: Big data: new tricks for econometrics. J. Econ. Perspect. 28, 3–27 (2014)
Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)
NSF-NIH Interagency Initiative. Core techniques and technologies for advancing big data science and engineering (BIGDATA) (2012)
Google. http://investor.fb.com/releasedetail.cfm?ReleaseID=893395
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Omitted Proofs from Sect. 5
A Omitted Proofs from Sect. 5
Proof of Theorem 19
Proof
Set \(n = n_1 n_2\). Since C is a matrix of \(n_1n_2 \times 2^{k_1k_2}\), each codeword of C has a length of \(n_1n_2\). Each database is represented by a codeword of C. Since the Hamming weight in a codeword is \(n_1w\), each entry of a codeword is randomly assigned to 1 with probability \(p_j = \frac{n_1 w_j}{n_1 n_2} = \frac{w_j}{n_2}\).
Colluders achieve perfect collusion if \(\prod _{i \in C, i=1}^{|C|} x_{ij}=0\) for all \(j=1,\ldots ,n\). The probability of a row contains at least one 0 is:
The probability of n rows whose each row containing at least one 0 each is:
When \(C_{in}\) is a constant codeword weight code, \(w_{min} = w_{max} = w_i\). Therefore, the Eq. 4 holds.
Proof of Theorem 22
Proof
In accordance with Definition 12, the price of distributing a database is
Since C has \(2^{k_1k_2}\) codewords, the total price of distributing databases embedded with these codewords is
Since the number of attributes of the database is unchanged, the block length of \(C_{in}\) is unchanged. Suppose that there is another \(w'_{max}\)-weight code \(C'_{in}\) \([n_2, k'_2, d'_2]_2\). The price of distributing the database when using \(C'\) generated using \(C_{out}\) and \(C'_{in}\) is:
To complete our proof, we prove that if the price of distributing the databases when using \(C'\) is larger than the price of distributing the databases when using C, \(k_2 > k'_2\). Indeed, we have:
We consider three cases:
-
1.
If \(k_2 < k'_2\),
$$\begin{aligned} 2^{k_1 (k_2 - k'_2)} \le \frac{1}{2^{k_1}} < \frac{1}{n_1} < \frac{n_1 w_{min}}{n_1 w'_{max} + 1} \end{aligned}$$(10)because \(0 < w_{min}, w'_{max} \le n_2 \le n_1\). It is contradicted to inequality 6.
-
2.
If \(k_2 = k'_2\), \(d_2 = d'_2\) because \(k_2\) and \(k'_2\) are the largest numbers such that \(k_2 \le n_2 - d_2 + 1\) and \(k'_2 \le n_2 \ d'_2 + 1\). Therefore, the price of distributing the databases when using \(C'\) is equal to the price of distributing the databases when using C. It is contradicted to our hypothesis.
-
3.
If \(k_2 > k'_2\), the inequality 6 holds.
Thus, if the price of distributing the databases when using \(C'\) is larger than the price of distributing the databases when using C, \(k_2 > k'_2\). If \(k'_2 < k_2\), \(d'_2 > d_2\).
According to Corollary 20, if the relative distance of \(C_{in}\) increases (decreases), the probability of perfect collusion decreases (increases). According to Corollary 15, if the code rate of \(C_{in}\) increases, the price of distributing databases using C increases. Therefore, the lower the price of distributing a database, the less collusion resistance the database owner deals with.
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Bui, T.V., Nguyen, T.D., Sonehara, N., Echizen, I. (2015). Tradeoff Between the Price of Distributing a Database and Its Collusion Resistance Based on Concatenated Codes. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9529. Springer, Cham. https://doi.org/10.1007/978-3-319-27122-4_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-27122-4_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27121-7
Online ISBN: 978-3-319-27122-4
eBook Packages: Computer ScienceComputer Science (R0)