Skip to main content

Tradeoff Between the Price of Distributing a Database and Its Collusion Resistance Based on Concatenated Codes

  • Conference paper
  • First Online:
Book cover Algorithms and Architectures for Parallel Processing (ICA3PP 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9529))

  • 1310 Accesses

Abstract

The purchasing of customer databases, which is becoming more and more common, has led to a big problem: the use of purchased databases to mount a collusion attack, which is when purchasers of a database illegally combine their versions of it in order to de-anonymize the private information it contains. However, the purchasing of customer database is only available in the black market. In this paper, we first investigated the relationship between the price of distributing a database and its collusion resistance. A fingerprint is embedded in database so that illegal distributors can be identified. The fingerprints are constructed on the basic of concatenated codes. After the fingerprint is embedded, the price of distributing the database and its collusion resistance are modelled as decreasing functions. The less expensive the database is, the less collusion resistance the database owner deals with. There are upper and lower bounds for the collusion capabilities. To the best of our knowledge, this scheme is unique in that the tradeoff between the price of distributing a database and its collusion resistance is based on a mathematical model. Second, we propose a guideline to sell customer database legally with profit and risk evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A zettabyte is 1,000,000,000,000,000,000,000 bytes. Imagine that every person in Vietnam (population of 92.5 million in 2014) took a digital photo every second of every day for over three months. All of those photos put together would equal about one zettabyte.

  2. 2.

    Big Data is a term that refers to “large, diverse, complex, longitudinal, and/or distributed datasets generated from instruments, sensors, Internet transactions, email, video, click streams, and/or all other digital sources available today and in the future” [30].

References

  1. White House. Big Data: Seizing Opportunities, Preserving Values (2014)

    Google Scholar 

  2. Financial Times. Digital hunter-gatherers (2013). http://www.ft.com/intl/cms/s/0/f840dbc0-d34f-11e2-b3ff-00144feab7de.html#axzz3eS2n1CZx

  3. The Guardian. How much is your personal data worth? (2014). http://www.theguardian.com/news/datablog/2014/apr/22/how-much-is-personal-data-worth

  4. Forbes. The black market price of your personal info (2010). http://www.forbes.com/2010/11/29/black-market-price-of-your-info-personal-finance.html

  5. Gantz, J., Reinsel, D.: The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze Future 2007, 1–16 (2012)

    Google Scholar 

  6. Kieseberg, P., Schrittwieser, S., Mulazzani, M., Echizen, I., Weippl, E.: An algorithm for collusion-resistant anonymization and fingerprinting of sensitive microdata. Electron. Markets 24(2), 113–124 (2014)

    Article  Google Scholar 

  7. Motwani, R., Xu, Y.: Efficient algorithms for masking and finding quasi-identifiers. In: Proceedings of the Conference on Very Large Data Bases (VLDB), pp. 83–93 (2007)

    Google Scholar 

  8. Lodha, S.P., Thomas, D.: Probabilistic anonymity. In: Bonchi, F., Malin, B., Saygın, Y. (eds.) PInKDD 2007. LNCS, vol. 4890, pp. 56–79. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  9. Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain., Fuzziness Knowl.-Based Syst. 10(05), 571–588 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  10. Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain., Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  11. El Emam, K., Dankar, F.K., Isaa, R., Jonker, E., Amyot, D., Cogo, E., Corriveau, J.-P., et al.: A globally optimal k-anonymity method for the de-identification of health data. J. Am. Med. Inform. Assoc. 16(5), 670–682 (2009)

    Article  Google Scholar 

  12. Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)

    Article  Google Scholar 

  13. Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical report, SRI International (1998)

    Google Scholar 

  14. Schrittwieser, S., Kieseberg, P., Echizen, I., Wohlgemuth, S., Sonehara, N., Weippl, E.: An algorithm for k-anonymity-based fingerprinting. In: Shi, Y.Q., Kim, H.-J., Perez-Gonzalez, F. (eds.) IWDW 2011. LNCS, vol. 7128, pp. 439–452. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  15. Willenborg, L., Kardaun, J.: Fingerprints in microdata sets. In: CBS (1999)

    Google Scholar 

  16. Bui, T.V., Nguyen, B.Q., Nguyen, T.D., Sonehara, N., Echizen, I.: Robust fingerprinting codes for database. In: Aversa, R., Kołodziej, J., Zhang, J., Amato, F., Fortino, G. (eds.) ICA3PP 2013, Part II. LNCS, vol. 8286, pp. 167–176. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  17. Bui, T.V., Nguyen, B.Q., Nguyen, T.D., Sonehara, N., Echizen, I.: Robust fingerprinting codes for database using non-adaptive group testing. Int. J. Big Data Intell. 2(2), 81–90 (2015)

    Article  Google Scholar 

  18. Li, Y., Swarup, V., Jajodia, S.: Fingerprinting relational databases: schemes and specialties. IEEE Trans. Dependable Secure Comput. 2(1), 34–45 (2005)

    Article  Google Scholar 

  19. Guo, F., Wang, J., Li, D.: Fingerprinting relational databases. In: Proceedings of the ACM Son Applied Computing, pp. 487–492. ACM (2006)

    Google Scholar 

  20. Agrawal, R., Kiernan, J.: Watermarking relational databases. In: Proceedings of the 28th International Conference on VLDB, pp. 155–166 (2002)

    Google Scholar 

  21. Forney Jr., G.D.: Concatenated codes. DTIC Document (1965)

    Google Scholar 

  22. Reed, I.S., Solomon, G.: Polynomial codes over certain finite fields. J. Soc. Ind. Appl. Math. 8(2), 300–304 (1960)

    Article  MathSciNet  MATH  Google Scholar 

  23. Wicker, S.B., Bhargava, V.K.: Reed-Solomon Codes and Their Applications. Wiley-IEEE Press, New York (1999)

    Book  MATH  Google Scholar 

  24. Gilbert, E.N.: A comparison of signalling alphabets. Bell Syst. Tech. J. 31(3), 504–522 (1952)

    Article  Google Scholar 

  25. Varshamov, R.R.: Estimate of the number of signals in error correcting codes. Dokl. Akad. Nauk SSSR 117(5), 739–741 (1957)

    MathSciNet  MATH  Google Scholar 

  26. Rosvall, M., Bergstrom, C.T.: Maps of random walks on complex networks reveal community structure. Proc. Natl. Acad. Sci. 105(4), 1118–1123 (2008)

    Article  Google Scholar 

  27. McAfee, A., Brynjolfsson, E.: Big data: the management revolution. Harvard Bus. Rev. 90, 60–68 (2012)

    Google Scholar 

  28. Varian, H.R.: Big data: new tricks for econometrics. J. Econ. Perspect. 28, 3–27 (2014)

    Article  Google Scholar 

  29. Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)

    Article  Google Scholar 

  30. NSF-NIH Interagency Initiative. Core techniques and technologies for advancing big data science and engineering (BIGDATA) (2012)

    Google Scholar 

  31. Google. http://investor.fb.com/releasedetail.cfm?ReleaseID=893395

  32. Google. http://www.google.com/zeitgeist/2012/#the-world

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thach V. Bui .

Editor information

Editors and Affiliations

A Omitted Proofs from Sect. 5

A Omitted Proofs from Sect. 5

Proof of Theorem 19

Proof

Set \(n = n_1 n_2\). Since C is a matrix of \(n_1n_2 \times 2^{k_1k_2}\), each codeword of C has a length of \(n_1n_2\). Each database is represented by a codeword of C. Since the Hamming weight in a codeword is \(n_1w\), each entry of a codeword is randomly assigned to 1 with probability \(p_j = \frac{n_1 w_j}{n_1 n_2} = \frac{w_j}{n_2}\).

Colluders achieve perfect collusion if \(\prod _{i \in C, i=1}^{|C|} x_{ij}=0\) for all \(j=1,\ldots ,n\). The probability of a row contains at least one 0 is:

$$\begin{aligned} 1 - \prod _{i=1}^c p_{j_i} = 1 - \frac{\prod _{i=1}^{c}w_{j_i}}{n_2^c} \le 1 - \left( \frac{w_{min}}{n_2} \right) ^c \end{aligned}$$
(3)

The probability of n rows whose each row containing at least one 0 each is:

$$\begin{aligned} \left( 1 - \prod _{i=1}^c p_{j_i} \right) ^n \le \left( 1 - \left( \frac{w_{min}}{n_2} \right) ^c \right) ^{n_1 n_2} \end{aligned}$$
(4)

When \(C_{in}\) is a constant codeword weight code, \(w_{min} = w_{max} = w_i\). Therefore, the Eq. 4 holds.

Proof of Theorem 22

Proof

In accordance with Definition 12, the price of distributing a database is

$$\begin{aligned} \frac{1}{wt(\texttt {a~codeword}) + 1}< & {} price(\texttt {a~database})= \frac{1}{wt(\texttt {a~codeword}) + \frac{sum(\texttt {a~codeword}) - wt(\texttt {a~codeword})}{sum(\texttt {a~codeword})}} \nonumber \\< & {} \frac{1}{w_{min}} \end{aligned}$$
(5)

Since C has \(2^{k_1k_2}\) codewords, the total price of distributing databases embedded with these codewords is

$$\begin{aligned} \sum _{i=1}^{2^{k_1 k_2}} \frac{1}{wt(\texttt {a~codeword}) + \frac{sum(\texttt {a~codeword}) - wt(\texttt {a~codeword})}{sum(\texttt {a~codeword})}} < 2^{k_1 k_2} \times \frac{1}{n_1w_{min}} \end{aligned}$$

Since the number of attributes of the database is unchanged, the block length of \(C_{in}\) is unchanged. Suppose that there is another \(w'_{max}\)-weight code \(C'_{in}\) \([n_2, k'_2, d'_2]_2\). The price of distributing the database when using \(C'\) generated using \(C_{out}\) and \(C'_{in}\) is:

$$\begin{aligned} \sum _{i=1}^{2^{k_1 k'_2}} \frac{1}{wt(\texttt {a~codeword}) + \frac{sum(\texttt {a~codeword}) - wt(\texttt {a~codeword})}{sum(\texttt {a~codeword})}} > 2^{k_1 k'_2} \times \frac{1}{n_1 w'_{max} + 1} \end{aligned}$$

To complete our proof, we prove that if the price of distributing the databases when using \(C'\) is larger than the price of distributing the databases when using C, \(k_2 > k'_2\). Indeed, we have:

$$\begin{aligned} 2^{k_1 k_2} \times \frac{1}{n_1w_{min}}> & {} \sum _{i=1}^{2^{k_1 k_2}} \frac{1}{wt(\texttt {a~codeword}) + \frac{sum(\texttt {a~codeword}) - wt(\texttt {a~codeword})}{sum(\texttt {a~codeword})}}\end{aligned}$$
(6)
$$\begin{aligned}> & {} \sum _{i=1}^{2^{k_1 k'_2}} \frac{1}{wt(\texttt {a~codeword}) + \frac{sum(\texttt {a~codeword}) - wt(\texttt {a~codeword})}{sum(\texttt {a~codeword})}}\end{aligned}$$
(7)
$$\begin{aligned}> & {} 2^{k_1 k'_2} \times \frac{1}{n_1 w'_{max} + 1}\end{aligned}$$
(8)
$$\begin{aligned} \Rightarrow 2^{k_1 (k_2 - k'_2)}> & {} \frac{n_1 w_{min}}{n_1 w'_{max} + 1} \end{aligned}$$
(9)

We consider three cases:

  1. 1.

    If \(k_2 < k'_2\),

    $$\begin{aligned} 2^{k_1 (k_2 - k'_2)} \le \frac{1}{2^{k_1}} < \frac{1}{n_1} < \frac{n_1 w_{min}}{n_1 w'_{max} + 1} \end{aligned}$$
    (10)

    because \(0 < w_{min}, w'_{max} \le n_2 \le n_1\). It is contradicted to inequality 6.

  2. 2.

    If \(k_2 = k'_2\), \(d_2 = d'_2\) because \(k_2\) and \(k'_2\) are the largest numbers such that \(k_2 \le n_2 - d_2 + 1\) and \(k'_2 \le n_2 \ d'_2 + 1\). Therefore, the price of distributing the databases when using \(C'\) is equal to the price of distributing the databases when using C. It is contradicted to our hypothesis.

  3. 3.

    If \(k_2 > k'_2\), the inequality 6 holds.

Thus, if the price of distributing the databases when using \(C'\) is larger than the price of distributing the databases when using C, \(k_2 > k'_2\). If \(k'_2 < k_2\), \(d'_2 > d_2\).

According to Corollary 20, if the relative distance of \(C_{in}\) increases (decreases), the probability of perfect collusion decreases (increases). According to Corollary 15, if the code rate of \(C_{in}\) increases, the price of distributing databases using C increases. Therefore, the lower the price of distributing a database, the less collusion resistance the database owner deals with.

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Bui, T.V., Nguyen, T.D., Sonehara, N., Echizen, I. (2015). Tradeoff Between the Price of Distributing a Database and Its Collusion Resistance Based on Concatenated Codes. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9529. Springer, Cham. https://doi.org/10.1007/978-3-319-27122-4_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27122-4_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27121-7

  • Online ISBN: 978-3-319-27122-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics