PrivacyPreserving Hypothesis Testing for Reduced Cancer Risk on Daily Physical Activity
 368 Downloads
Abstract
Privacy preserving data mining for medical information is an important issue to guarantee confidentiality of integrated multiple data sets. In this paper, we propose a secured scheme to estimate related risk of cancers accurately and effectively in a privacypreserving way. We study models to configure the appropriate set of attributes to reduce risk of identity of an individual from being determined. We examine the proposed privacy preserving protocol for encrypted hypothesis test, using actual cohort data supplied by National Cancer Center.
Keywords
Privacy Privacypreserving data mining Epidemiology Hypothesis testingIntroduction
Background
Risk factors for cancers have been widely investigated in conventional works. For examples, Cardis et al. [1] at International Agency for Research on Cancer carried out collaborative studies of cancer risks after low doses of ionizing radiation among nearly 600,000 radiation works in the nuclear industry in 15 countries. The result indicates the excess relative risk for cancers other than leukemia was 0.97 per Sv, 95% confidence interval 0.14 to 1.97. They also figure out that excess risk of cancer exists for nuclear workers even at the low doses and dose rate.
However, during these studies of cancers, confidentiality and criticality of privacy information should be considered because of exposure of cancers. For big data mining, integration of multiple data collected via ubiquitous sensors, smartphones, and portable devices makes epidemiological study more accurate. To achieve accurate data processing as well as privacy preserving, we consider the issues as follows:
 privacy issues of patients

Given in a confidential dataset, even for medical studies, no cancer patients want to be exposed due to privacy concerns.
 inconsistent identities in multiple datasets

Proprietary identifiers are used to identify individuals. However, in case of integration of multiple datasets, it is difficult to assume a global identity. In case of integrated datasets with inconsistent identifiers, finding alternatives of identities has been a challenge.
A set of personal attributes, which is used to identify individuals, includes name, address, and telephone numbers, etc. However, models should be studied to configure appropriate set of attributes, especially optimal combination of attributes because of the unavailability of uniqueness of personal attributes.
Related works
In the conventional works, statistical inference applied to single or multiple datasets has been proposed in many works. Privacy preserving algorithm is required because of the possibility of identifying individual participants by publicly available aggregate statistics, pointed out by Homer et al. [2]. Statistical estimator is studied by Smith [3] and Rakesh et al. [4] and riskutility has been discussed in Fienberg et al. [5]. Binary hypothesis testing under privacy constraints for large datasets has been studied in Liao et al. [6]. Privacy preserving protocol for radiation data and partitioned data are discussed by Kikuchi, et al. [7] and Vaidya et al. [8].
In order to figure out features of medical data and to clarify correlation between different parameters in a dataset or causal correlation between attributes from different datasets, hypothesis testing supplies an effective statistical inference to determine distribution in a certain dataset, however, in most of conventional works, raw data is used for analysis.
In this work, we propose a new private preserving hypothesis testing protocol, as well as specifying the best combination of significant personal attributions as the quasiidentifier to identify particular user in multiple datasets for statistics inference.
Our contributions
In this paper, we propose privacypreserving schemes to estimate the relative risk (RR) of cancers, using the cryptographic protocol Private Set Intersections (PSI) to achieve secured epidemiological processing including set intersection for mortality rate, and evaluation of test statistics for hypothesis testing. Confidentiality of data is preserved even after intersection of two subsets.

propose a privacypreserving protocol for hypothesis testing using a set of personal attributes as quasiidentifiers

take an experiment of the proposed protocol to estimate relative risk of cancer in terms of quantity daily physical activities
Privacypreserving hypothesis testing
Purpose statement
In this paper, we consider a use case of data analysis toward distributed datasets supplied by different providers.
In case of risk of radiation, suppose party A be an agency which maintains lists of workers who are exposed to dose of radiation. This kind of data is available since there are regulations specifying the limit of total annual dose of radiation and employees in nuclearpower stations are supposed to declare the record of dose of radiation in many countries. In Japan, working under more than 50 mSv is prohibited [9, 10]. Party B is a hospital for cancers and keeps a dataset of cancer patients.
Both of parties A and B should keep their datasets X_{ A } and X_{ B } confidential, however, correlation between the risk of cancer and dose of radiation should be contributive and useful for further medical care and research. Thus, a privacypreserving scheme for confidential computing for distributed dataset is required.
Death rates or mortality rate for both datasets are compared to clarify the correlation. The mortality rate is adjusted for different distributions of age groups in both of datasets. Let X_{A,y} be a subset of X_{ A } with increments of 10 years. Then X_{ A } can be partitioned as X_{ A } = X_{A,30} ∪ X_{A,40} ∪… ∪ X_{A,80}. The expected numbers of subjects to death can be known as standardized mortality rate.
Relative risk
We examine the risk factors for a disease by dividing participants into two groups with and without exposure in a cohort study. The relative risk is defined as the ratio of a diseased member receiving exposure to a diseased member without receiving exposure [12].
Contingency table for a casecontrol study
Smoking  Nonsmoking  Total  

Cancer  a  b  n _{1} 
Noncancer  c  d  n _{2} 
Total  m _{1}  m _{2}  N 
A RR greater than 1.0 indicates an increased risk of disease in exposed group. Hypothesis test is used to examine the confidence of RR in this paper as follows:
 null hypothesis:

H_{0}: The proportion of participants who suffer from cancer equals between smoking participants and nonsmoking participants;
 alternative hypothesis:

H_{ A }: The proportion of smoking participants who suffer from cancer differs from nonsmoking participants;
Private set intersection
Private set intersection (PSI) is a cryptographic protocol which allows multiple parties to compute the intersection of their private sets without revealing anything about their sets. A number of protocols have been proposed so far after Freedman, Nissim and Pinkas proposed the first PSI protocol using polynomial expression of sets in [17]. Abadi et. al showed the delegated PSI protocol on outsourced private datasets, which assumes the use in cloud data store in [18]. Among then, the following three works can be used to estimate the size of intersection X ∩ Y , which corresponds to the population of smoker (X) suffering from cancer (Y ).
FNP04 (oblivious polynomial evaluation) [17]
The scheme presented in [17] uses oblivious polynomial evaluation in which elements of set are represented as polynomials f(x) over a finite field. It is a twoparty protocol with one party encoding its elements x_{1},x_{2},… as the roots of the polynomial and the other party evaluating f(y_{1}),f(y_{2}),… in a privacy preserving way. The evaluation of the polynomial turns to be 0 if and only if x_{ i } = y_{ j } ∈ X ∩ Y. The drawback of the protocol is the computational complexity. The running cost is proportional to the order of polynomial which equals to the number of elements of the set.
SSP (Secure Scalar Product) [16]
Scalar product of two vectors is performed securely using an additive homomorphic publickey algorithm. It is a twoparty protocol which can be used by many applications as one of primary building block. The many instances of additive homomorphic algorithms include Paillier encryption [14], Latticebased encryption, and ellipticcurve cryptosystem. The fundamental version allows vectors of arbitrary values. While, the set intersection requires Boolean vector consisting of 1 or 0 value indicating membership of subsets. Hence, it is too expensive (in terms of computational cost) to evaluate the size of set intersection in privacy preserving manner.
AES03 (commutative oneway function) [13]
Agrawal, et. al. used a commutative PohligHellman cipher [15] and secure hash function as building blocks to construct twoparty protocol to compute set intersection. As one of the extension, they also showed the modified protocol that obtains only the size of intersection without seeing the elements of the intersection. The idea of their protocol is that commutative property of two independent encryptions allows to figure out the common element of two subsets. Let f and g be (symmetric but commutative) encryption privately generated by Alice and Bob, respectively. A common element x can be identified by testing that f(g(a)) = g(f(a)) because of the commutative property of encryptions.
We show Algorithm 1 that replaces commutative encryptions f and g by PohlingHellman cipher, defined simply as f(m) = H(m)^{ u } mod p and g(m) = H(m)^{ v } mod p, respectively. Note that the algorithm is the version of size of intersection only and can be used to determine the elements of intersection by modifying the Step 2 to send the (H(x)^{ u })^{ v } together with the ciphertext H(x)^{ u }.
Hypothesis testing
Privacy search oracle model
Personal attribute may be used to identify individuals. The accuracy of identification depends on type of attributes. For example, attribute sex is 1bit information to classify the set of individuals into two classes. Birthday has a domain of 365 ways, which is equivalent to log 2(365) = 8.51 bit entropy. Hence, combining sex and birthday could be 1 + 8.51 = 9.51 bit entropy, which could identify 2^{9.51} = 729 individuals in average.
Entropy of personal attributes
Personal  Entropy [bit]  Ambiguity  Max # of  Description 

Attribute  Duplicated IDs  
Name in Chinese char. (Kanji)  27  High  24  Same name can be written in several ways. 
Name in Japanese char. (Kana)  N/A  Low  30  Several representations can be unified. 
Sex  1  None  61,020  Male or female (1 bit) 
Birthday and year  15  None  86  365 days (15 bit) 
Mailing address  26  Low  56  Almost unique but several representations in font. 
City (ku, machi, mura)  14  High  12,131  Not very unique for historical reason. 
Prefecture (states)  6  None  22,336  Unique. 
C/A  2  High  N/A  Occasionally specified. not complete attribute. 
Privacypreserved estimation of reduced cancer risk
Problem description
Consider Alice is a national cancer center that maintains comprehensive personal attributes for patients of gastric cancer, lung, colon, and so on. In our study, we mainly focus on colon cancer because the risk of colon cancer has been known as significantly correlated to daily physical activity. Alice owns the set of colon cancer as X.
Bob is a provider for datasets, indicating interests on personal physical activity. Examples include a sport club that records frequency of exercises for each member, a public health center that periodically investigate personal information of citizens, or a commercial health company that monitors daily physical activity quantities from vital devices. Metabolic equivalents (METs) is used to determine quantity daily total physical activity level, based on questionnaires about hours/day in heavy physical works, hours/day in walking, hours/day in sitting, and the days/week in leisuretime sports or exercise [19]. With the METs score, Bob classifies the people into four (q = 4) orthogonal classes; Lowest (L), Second (S), Third (T), and Highest (H), specified by four subsets of U, Y_{1},Y_{2},Y_{3} and Y_{4}, respectively.
Number of people with exact the same names
Proprietary identifiers are used to identify individuals for institutes who own datasets. However, in order to join multiple datasets with inconsistent identifiers, alternatives of identifies are necessary. We study a set of significant personal attributes to identify unique individual as a quasiidentifier. For example, name attribution is as known as an almost unique attribute, however, there are exceptions that when multiple users have the exact same surname and given name. Figures 1 and 2 shows the population of people in which x individuals have the exact same surname and given name in some datasets: JPHC,^{1} Univac [20], and NTT.^{2} Both of vertical and horizontal axis are plotted in log scale. In JPHC, there are about 100 thousand people with unique name (x = 1), which becomes about 2 thousand when two people are with the same name (x = 2).
Based on the observation of the distribution of people with the exact same names, we adapt a mathematical model of Zipf’s law, which states that the number of people, f(x), with the xth order is proportional to 1/x.
Combination of attributes as a unique identifier
Entropies of some combinations of personal attributes
Option  Set of attributes  Entropy  Max #  # of unresolved 

[bit]  duplicated records  records  
A  Name in Kana, sex  14  30  30,180 
B  Name in Kana, sex, birthday  30  2  16 
C  Name in Kana, sex, birthday, state  36  2  12 
D  Name in Kana, sex, birthday, address  56  0  0 
E  Name in Kana, birthday, address  55  0  0 
F  Name in Kana, address  40  2  16 
G  Sex, birthday, address  42  2  10 
According to Table 3, we find options D (name, sex, birthday, address) and E (name, birthday, address) uniquely identify all individuals in JPHC dataset.
Proposed scheme
We propose a cryptographic protocol between Alice with X and Bob with Y_{1},…,Y_{ q } for privacypreserved relative risk estimation without revealing identities to other parties in Algorithm 2. It uses Algorithm 1 as a subprotocol.
It aims to compute relative risks in terms of some interested attributes. It is a twoparty cryptographic protocol of a party (Alice) with set of (colon cancer) patients and a party (Bob) with questionnaire survey results of patients. Alice and Bob could be the national registry of cancer and local government that conduct user study of citizens. Both parties are not allowed to share the database without consent of all subjects. Instead of sharing, Algorithm 2 allows them to evaluate relative risks of cancer in terms of questionnaire survey without revealing who suffered cancers or the survey results. In our experiment, we are interested in clarifying relative risk of colon cancer in terms of sex and the frequency of daily physical activities. Algorithm 2 also gives the test statistics χ to perform hypothesis testing without revealing any personal attribute.
Contingency table and relative risks with test statistic
X ∩ Y_{ p }  Y_{ p } − (X ∩ Y_{ p })  Y _{ p }  RR  N _{ p }  \( {\chi ^{2}_{p}}\)  

Y _{1}  c  d  c + d  1.0  –  Reference 
Y _{2}  a _{2}  b _{2}  a_{2} + b_{2}  \( \frac {a_{2}}{a_{2}+b_{2}}/\frac {c}{c+d}\)  a_{2} + b_{2} + c + d  \(\frac {N_{2}(a_{2}db_{2}c)^{2}}{(a_{2}+b_{2})(c+d)(a_{2}+c)(b_{2}+d)}\) 
⋮  ⋮  ⋮  ⋮  ⋮  ⋮  ⋮ 
Y _{ q }  a _{ q }  b _{ q }  a_{ q } + b_{ q }  \(\frac {a_{q}}{a_{q}+b_{q}}/\frac {c}{c+d}\)  a_{ q } + b_{ q } + c + d  \(\frac {N_{q}(a_{q}db_{q}c)^{2}}{(a_{q}+b_{q})(c+d)(a_{q}+c)(b_{q}+d)}\) 
Experimental evaluation
Experiment with JPHC dataset
Relative risk of colon cancer according to daily total physical activity level
X  Y_{ i } − (X ∩ Y_{ i })  Y_{ p }  RR  \( \chi ^{2}_{(i)}\)  

Men n = 46, 236  
(178)  (41,108)  (41,286)  
L(16,374)  79  13915  13994  1.00  Reference 
S(9,594)  36  8229  8265  0.77  1.68 
T(9,085)  25  7865  7890  0.56  6.54 
H(11,184)  32  9830  9862  0.57  7.20 
Women n = 52, 891  
(130)  (46,330)  (46,460)  
L(17,404)  40  14347  14387  1.00  Reference 
S(13,795)  32  11703  11735  0.98  0.01 
T(11,865)  32  10283  10315  1.12  0.21 
H(9,827)  19  8473  8492  0.80  0.61 
However, according to the experimental result, test statistics for women are not significant. The possible reason why the METs scores for women are not significant can be assumed that the distribution of ages was skewed, or the other exposure factors such as smoking habit effects.
Evaluation on performance
Experimental environments
Modulus size p  2048 bit 
Order of G  160 bit 
Domain of u, v  160 bit 
Application impl.  Scala 
SHA1  Java sphlib 
Modulo  Java Big integer 
Data Structure  Java HashSet Collection 
OS  Ubuntu 12.10 amd64 
CPU  Intel Celeron Processor G1610 
Memory  4 GB (DDR3 SDRM PC310600) 
Network speed  46 Mbps (measured values average) 
Performance requires a dominant resource for calculation when the modular exponentiation is over than 140k individuals. However, this problem can be solved and the performance can be improved by a distributed computing with multiple machines or parallel computation.
Evaluation on security
In [13], assuming the random oracle model has no hash collisions, and in semihonest model, there is no polynomialtime algorithm that can distinguish a random value from H(x)^{ u } given x.
Summarily, in our use case, data providers are assumed to be honestbutcurious, which is known as the semihonest model, that providers own private datasets following protocols properly but trying to learn additional information about the datasets from received messages. It is rational to assume honestbutcurious model because either party may be interested in learning the personal data so that personal data such as name with disease could be dealt in underground market. Malicious model is too strong to assume in twoparty case where the malicious party should be excluded easily.
We make the following remarks about the security of the proposed scheme.
Remark 1
Assuming Decisional DiffieHellman hypothesis (DDH), no party learns any element that does not belong to the intersection from the output of Algorithm 1.
Proof
DDH claims for any element ginZ_{ q }, the distribution of 〈g^{ a },g^{ b },g^{ a b }〉 is indistinguishable from the distribution of langleg^{ a },g^{ b },g^{ c }〉, where a,b,c ∈ Z_{q− 1}. Without loss of generality, assume Alice determines H(y)^{ v } for any y∉Y −−X when she has H(x)^{ u } and H(x)^{ u v } for some known x and u. By replacing g^{ a } = H(x)^{ u } and g^{ a b } = H(x)^{ u v }, she can distinguish g^{ a b } with g^{ c } because she distinguishes H(x)^{ u v } with H(y)^{ v } from the above assumption. This contradicts to the DDH. Therefore, we have the proof. □
Remark 2
No party learns any element that does not belong to any intersections for q subsets.
Proof
It is straightforward from the construction of Algorithm 2. Given c = X ∩ Y_{1}, there are Y_{1}− c possible elements in Y_{1}, which are impossible to guess with trivial probability 1/(Y_{1}− c). Similarly, no information can be learned from a_{ i } = X ∩ Y_{ i }, b_{ i } = Y_{ i }− a_{ i } for i = 2,…,q. □
Comparison to conventional works
Comparison of building blocks
Other applications
PSI protocol is applicable to broad fields not only in cancer risk evaluation but also in enterprise and government. We give some potential applications.
Intellectual property and patents
Enterprise having confidential technologies are interested to seek their competitor’s patents. However, it is not clear if their competitor owns unpublished intellectual property that conflict with its private technology. If they share the common technologies, they would like to collaborate or to license each other. If their confidential technologies are disjoint, they want to keep that secret. This can be solved by applying PSI protocol with the set of technical terms of documents. They could make sure whether their unpublished intellectual properties are common or disjoint without revealing confidential document.
Epidemiological studies
Epidemiology is study to clarify the outcome of a dose whose effect is not known well. DoseResponse test aims to clarify the positive (negative) correlation between an amount of dose and the outcome in a clinical laboratory test. Divided into several groups with same condition, the set of subjects are given a specified amount of doses for each group and observed the responses for the dose. The proposed protocol can be applied to privacypreserving doseresponse test in two parties.
Big data study
Link the database of individual tax payments from Tax and Customs office and educational records to reveal the correlations between working in university and their yearly income.
Conclusions
We have proposed a privacypreserving hypothesis testing for epidemic studies for calculating relative risk of cancer from distributed providers. The proposed schemes allow independent provides to have confidential datasets to perform computing correlation between any interested attributes. Our experiment shows a close relative risk to conventional work with raw data, which indicating that the daily physical activities reduce a risk of cancer for some experiments in a significant level of confidence.
Footnotes
Notes
Acknowledgment
We appreciate the support from Dr. Kawamura, Mr. Uozumi, Mr. Higashi, Mr. Koyanagi, Mr. Taguchi, Mr. Kato, and Mr. Ohkubo for giving insightful suggestions and cooperation for the experiments.
Compliance with Ethical Standards
Conflict of interests
Hiroaki Kikuchi has received research grants from Cyber Communications Inc.
Ethical approval
All procedures performed in studies involving human participants were in accordance with the ethical standards of the National Cancer Center.
Informed consent
Informed consent was obtained from all individual participants included in the study.
References
 1.Cardis, E. , Vrijheid, M., Blettner, M., and Gilbert, E.: Risk of cancer after low doses of ionizing radiation: retrospective cohort study in 15 countries, BMJ Online First, pp. 1–6, 2005Google Scholar
 2.Homer, N., Szelinger, S., and Redman, M.: Resolving individuals contributing trace amounts of DNA to highly complex mixtures using highdensity snp genotyping microarrays, Plos Genetics, 4(8), 2008.Google Scholar
 3.Smith, A.: Privacypreserving statistical estimation with optimal convergence rates, in proc. of the fortythird annual ACM symposium on Theory of computing pp. 813–822, 2011.Google Scholar
 4.Agrawal, R., Evfimievski, A., and Srikant, R.: Information sharing across private databases, in proc. of ACM SIGMOD International Conference on Management of Data, 2003.Google Scholar
 5.Fienberg, S. E., Rinaldo, A., and Yang, X.: Differential privacy and the riskutility tradeoff for multidimensional contingency tables. In: DomingoFerrer, J., and Magkos, E. (Eds.) PSD 2010, LNCS 6344, pp. 187–199, 2010.Google Scholar
 6.Liao, J. C., and Sankear, L.: Hypothesis Testing in the High Privacy Limit, in proc. of 54th Annual Allerton Conference on Communication, Control, and Computing, pp. 649–656, 2016.Google Scholar
 7.Kikuchi, H., Sato, T., and Sakuma, J.: PrivacyPreserving Protocol for Epidemiology in Effect of Radiation, Proceedings of the 2013 Seventh International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS 13), pp. 831–836, IEEE, 2013.Google Scholar
 8.Vaidya, J., and Clifton, C.: Privacy preserving association rule mining in vertically partitioned data, in The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, SIGKDD, ACM Press, Edmonton, Canada, pp. 639–644, 2002.Google Scholar
 9.Radiation Effects Association, Annual Report on radiation epidemiological study for workers in nuclearpower station, 2010. (written in Japanese)Google Scholar
 10.Ministry of Health, Labor and Welfare, the Vital Statistics of Japan. (available from http://www.mhlw.go.jp/english/database/index.html)
 11.Yasui, R., Sato, K., Harigaya, T., Kanai, A., Hirota, K., and Tanimoto, S., A proposal of privacy search oracle model for estimating personal information disclosure level of blog articles. IPSJ SIG Technical Report 2009EIP43:9–16, 2009. (in Japanese).Google Scholar
 12.Pagano, M., and Gauvreau, K.: Principles of biostatistics 2nd ed., Brooks/Cole, 2000.Google Scholar
 13.Agrawal, R., Evfimievski, A., and Srikant, R.: Information sharing across private databases, in proc. of ACM SIGMOD International Conference on Management of Data, 2003.Google Scholar
 14.Paillier, P.: Publickey cryptosystems based on composite degree residuosity classes, In Advances In Cryptology  Eurocrypt 1999, pp. 223–238, Springer, 1999.Google Scholar
 15.Pohlig, S., and Hellman, M.: An Improved Algorithm for Computing Logarithms over GF(p) and its Cryptographic Significance, IEEE Transactions on Information Theory, (24), pp. 106110, 1978.Google Scholar
 16.Goethals, B., Laur, S., Lipmaa, H., and Mielikainen, T.: On Secure Scalar Product Computation for PrivacyPreserving Data Mining, In Choonsik Park and Seongtaek Chee, editors, The 7th Annual International Conference in Information Security and Cryptology (ICISC 2004), volume 3506, pp. 104–120, December 2.3, 2004.Google Scholar
 17.Freedman, M. J., Nissim, K., and Pinkas, B.: Efficient private matching and set intersection, EUROCRYPT 2004, LNCS 3027, pp. 1–19, Springer, 2004.Google Scholar
 18.Abadi, A., Terzis, S., Metere, R., and Dong, C.: Efficient Delegated Private Set Intersection on Outsourced Private Datasets, Proceedings of the 30th International Conference on ICT Systems Security and Privacy Protections (SEC 2015), pp. 3–17, 2015.Google Scholar
 19.Inoue, Daily total physical activity level and total cancer risk in men and women: results from a largescale populationbased cohort study in Japan. Am. J. Epiderniol. 168:391–403, 2008.CrossRefGoogle Scholar
 20.Tanaka, Y., Frequency of people with same first and last names. IPSJ SIG Technical Report on Natural Language (NL) 1977NL010:1–7, 1977.Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.