True lies
Abstract
Garbarino et al. (J Econ Sci Assoc. https://doi.org/10.1007/s4088101800554, 2018) describe a new method to calculate the probability distribution of the proportion of lies told in “coin flip” style experiments. I show that their estimates and confidence intervals are flawed. I demonstrate two better ways to estimate the probability distribution of what we really care about—the proportion of liars—and I provide R software to do this.
Keywords
Lying Experiment EstimationJEL Classification
C91 C81 D03Some people are honest, while others are likely to lie whenever it benefits them. We would like to understand the prevalence of lying, because dishonesty may be economically and socially harmful. Since we cannot simply ask people if they are liars, one way to estimate the proportion of liars in a group is to ask them to report the result of a coin flip or other random device, offering them a payment if they report heads. Liars do not always lie: they only lie when it benefits them. So, they always report heads irrespective of the true coin flip.^{1} If there are many more heads than we would expect by chance, we can assume many people are lying. But how many?
Garbarino et al. (2018)—GSV from here on—point out this problem and introduce an alternative method. They claim that their method corrects for this problem and can estimate the full distribution of lying outcomes, and they recommend using it for confidence intervals, hypothesis testing and power calculations.
Parameter values
Parameter  Values 

Sample size (N)  10, 50, 100, 500 
Probability of lying (\(\lambda\))  0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1 
Probability of bad random outcome (P)  0.2, 0.5, 0.8 
Coverage levels for GSV and alternative methods
Method  CI 90%  CI 95%  CI 99% 

GSV  65.2%  71.0%  78.9% 
Frequentist  91.4%  94.4%  96.5% 
Bayesian  91.3%  95.5%  99.1% 
By definition, 95% of 95% confidence intervals ought to contain the true value, on average. This is called “achieving nominal coverage”. GSV confidence intervals are too narrow.^{2}
To deal with this problem, I test two alternative methods for calculating confidence intervals on my simulated data. The first (“Frequentist”) is the standard method of deriving confidence intervals from a binomial test. The second is a Bayesian method.
There are numerous ways to calculate confidence intervals in a test of proportions. See, e.g., Agresti and Coull (1998). Here, I use the binomial exact test of Clopper and Pearson (1934), which is known to be conservative.
The Bayesian method requires a prior. Here, I used a uniform prior, \(\varphi (\lambda ) = 1\) on \([0, 1]\).
Results in Table 2 show that both frequentist and Bayesian methods mostly achieve the nominal confidence level, with more than 90/95/99% of intervals containing the true value of \(\lambda\). The exception is the frequentist 99% confidence interval, which is too narrow.
Confidence interval coverage by sample size
Method  N  CI 90%  CI 95%  CI 99% 

GSV  10  61.4%  65.7%  69.8% 
50  67.9%  73.3%  81.1%  
Frequentist  10  94.1%  95.7%  96.8% 
50  91.6%  94.5%  96.6%  
Bayesian  10  91.3%  95.3%  99.1% 
50  90.9%  95.5%  99.1% 
1 Understanding the GSV approach
Why does the GSV method produce narrow confidence intervals? We can get a clue by running the GSV method when there are 10 reports of “heads” out of 10 for a fair coin flip (\(R = N = 10, P = 0.5\)). The resulting point estimate is that 100% of subjects lied. The upper and lower 99% confidence intervals are also 100%.

With probability \(\frac{1}{1024}\), there were really 10 heads. Nobody lied in the sample.^{4}

Otherwise, 1 or more people saw tails, and they all lied. The proportion of liars is 100%.
There are two problems with this approach: one statistical, and one conceptual.
First, if many heads are reported, you should learn two things. On the one hand, there are probably many liars in your sample. On the other hand, probably a lot of coins really landed heads. The probability distribution in Eq. (5) does not take account of this.
For example, suppose we are certain that everyone in the sample is a liar who always reports heads. In this case, observing \(R = N = 10\) gives us no information about the true number of heads. The posterior probability that \(T = 10\) is then indeed 1/1024, the same as the prior. Now, suppose we know that nobody in the sample is a liar. Then on observing \(R = 10\), we are sure that there were truly 10 heads: the posterior that \(T = 10\) is 1. If exactly 5 out of 10 subjects are liars, then observing \(R = 10\) means that all 5 truthtellers really saw heads. The posterior probability that \(T = 10\) is then \(1/32\), the chance that all 5 liars saw heads, and so on.
When we are uncertain about the number of liars, our posterior that \(T = 10\) will be some weighted combination of these beliefs. Unless we are certain everyone in the sample is a liar, the probability that \(T = 10\) will be greater than 1 in 1024. Equation (5) is, therefore, not correct. In this case, it is equivalent to assume that everybody in the sample is a liar, whose report is uninformative about the true number of heads. One then uses the prior distribution of heads to estimate the proportion of those who actually saw tails and lied.
Indeed, in the simulations with \(P = 0.5\) and across all values of \(\lambda\), the overall probability that there were 10 true heads, conditional on \(R = N = 10\), was about 1 in 161, not 1 in 1024. Fixing \(\lambda = 0.2\), it was about 1 in 4.
This problem means that the GSV estimator of Lies is biased. In the “Appendix”, I show that the GSV estimator can have substantial bias, and performs worse than the naïve estimator from Eq. (1), \(\frac{R/N(1P)}{P}\). Also, the GSV confidence intervals do not always achieve nominal coverage of Lies. When the number of heads reported is either high or low, the percentage of confidence intervals containing Lies may fall below the nominal value.
There is a second, more important problem. The GSV approach attempts to estimate Lies in Eq. (6). This is the proportion of lies actually told, among the subsample of people who saw tails. But we are not usually interested in the proportion of lies actually told. We care about the probability that a subject in the sample would lie if they saw tails—\(\lambda\) in Eq. (2). This \(\lambda\) can be interpreted in different ways. Maybe on seeing a tail, each person in the sample lies with probability \(\lambda\). Or maybe the sample is drawn from a population of whom \(\lambda\) are (always) liars, and \(1  \lambda\) are truthtellers. Lies has no interpretation in the population, because the rest of the population has no chance to tell a lie in the experiment.
Lies can be treated as an estimate of \(\lambda\). It is unbiased: it estimates \(\lambda\) from the random, and randomly sized, sample of \(N  T\) people who saw tails. But it can be a very noisy estimate. Again, suppose 10 heads out of 10 are reported, and 9 heads were really observed. Lies is 100%. But it is 100% of just one person.
This means that even the correct confidence intervals for Lies would not be correct for \(\lambda\). For example, if 3 out of 3 subjects report heads, the GSV software reports a lower bound of 100% for any confidence interval. Indeed, since anyone who had the opportunity to lie clearly did so, this is the correct lower bound (if we arbitrarily define Lies = 1 when \(T = N\)). But it makes no sense as a confidence interval for \(\lambda\): we clearly cannot rule out that one or two subjects truly saw heads, and would have reported tails if they had seen tails.
GSV confidence interval coverage by proportion of heads reported (R/N)
R/N  Percentage of simulations  CI 90%  CI 95%  CI 99% 

[0.00,0.25)  3.8  84.3%  87.9%  91.5% 
[0.25,0.50)  10.6  76.3%  82.4%  89.2% 
[0.50,0.75)  25.0  68.2%  75.7%  84.1% 
[0.75,1.00]  60.6  60.8%  66.1%  74.1% 
2 Point estimation
We can also compare the accuracy of point estimates of \(\lambda\) between GSV, Frequentist and Bayesian methods. Table 5 shows bias (the estimated value minus the true value of \(\lambda\)) for different methods by different N. The Bayesian method is always the least biased until \(N = 500\), and the GSV method is the most biased.
Table 6 shows the mean squared error for methods by different N. For low N, the best method is Bayesian and the worst is Frequentist, with GSV in between. When N gets large, all methods give about the same estimates and are equally accurate.
Mean bias by method and N
Method  N: 10  N: 50  N: 100  N: 500 

Bayesian  0.0025  0.00417  0.00438  0.00275 
Frequentist  0.0354  0.01  0.0071  0.0025 
GSV  0.048  0.016  0.0109  0.00419 
Mean squared errors by method and N
Method  N: 10  N: 50  N: 100  N: 500 

Bayesian  0.0409  0.0136  0.00789  0.00184 
Frequentist  0.0661  0.0159  0.00852  0.00184 
GSV  0.0571  0.0142  0.00799  0.00184 
3 Comparing different groups
Bayesian estimates are accurate, but rely on a choice of prior. A noninformative prior is a reasonable choice. Alternatively one might use information from previous metaanalyses such as Abeler et al. (2016). If the sample size is large enough, the choice of prior should not matter much.
When comparing the dishonesty rates of different groups, an interesting approach is to use the “empirical Bayes” method (Casella 1985). This piece of statistical jiujitsu involves estimating a common prior from the pooled data, before updating the prior for each individual group.
4 Applications
Benndorf et al. (2017) use the GSV method to calculate confidence intervals for the proportion of liars in a lying task with a die roll (\(P = 5/6\)). From 57 reports of the best outcome, out of 98 subjects, they calculate a lying rate of 49.68%, with a 95% CI of (45.3%, 53.95%). Using the Bayesian method with a uniform prior, the confidence interval becomes (38.0%, 61.1%), about twice as big.
Banerjee et al. (2018) use the GSV method to estimate confidence intervals for proportion of liars in a die roll task. They estimate the proportion of liars who report a die roll above 3 (\(P = 0.5\)), for several treatments. Table 8 in the “Appendix” shows GSV confidence intervals, along with recalculated Bayesian confidence intervals (from a uniform prior), and confidence intervals for the difference between lying to the “Same” and “Other” caste. The Bayesian confidence intervals are much larger than GSV confidence intervals. Only a couple of significant results survive. (Note that significance tests in the original paper were done with standard frequentist techniques, not the GSV method.) More importantly, the N is rather low to make useful inferences about the differences between groups. For example, for the T2winnersGC group in the “aligned payoffs” treatment, differences in lying could be as much as 40% in either direction.
HughJones (2016) estimates the dishonesty rates of 15 nations using a coin flip experiment. I use empirical Bayes to check these results. For my prior over \(\lambda\), I fit a beta distribution using the 15 observations of \(2R/N  1\). I then updated this prior separately for each country to find new confidence intervals and point estimates of the means.^{6} There is some “shrinkage” towards the pooled mean from the naïve percountry estimates found by calculating \(2R/N  1\) separately for each country. One of the strengths of empirical Bayes, as Casella (1985) points out, is that it “anticipates regression to the mean”. Using Eq. (7), I calculated the probability of different \(\lambda\) values for each pair of countries in the data. Reassuringly, there were still significant differences between countries.
5 Software
The Bayesian methods described here are implemented in R code, available at https://github.com/hughjonesd/GSVcomment. In this section, I give some simple examples of how to use it. More details are available at the website.
To load the code, download the file “bayesianheadscts.R” from github, and source it in the R command line:
Suppose 33 people report heads out of an N of 50, where the probability of the bad outcome is 0.5. To create a posterior distribution over \(\lambda\), we use the update_prior() function:
Here, we have started with a uniform prior, using R’s built in dunif() function.
To calculate the point estimate of lambda, call the dist_mean() function on the updated posterior:
To calculate the 95% confidence interval (the highest density region), use dist_hdr():
Lastly, we can run power tests by simulating multiple experiments. GSV argue that existing sample sizes may be too small to reject “no lying” (\(\lambda = 0\)). With a uniform prior and an \(N\) of 100, the Bayesian method has 80.6% power to detect \(\lambda\) of 25% and 21.4% power to detect \(\lambda\) of 10%. So, this paper confirms that important point. To run power calculations, use power_calc(). Here, we calculate the power to detect \(\lambda = 0.1\) in a sample of 300, where the probability of the bad outcome is 0.5, with an alpha level of 0.05 and a uniform prior:
6 Conclusion
 1.
Use power tests to ensure that your N is big enough.
 2.
If your N is reasonably large, say at least 100, you can safely use standard frequentist confidence intervals and tests.
 3.
If your N is small, consider Bayesian estimates and confidence intervals. To estimate differences between subgroups, consider empirical Bayes with a prior derived from the pooled sample.
Footnotes
 1.
Garbarino et al. (2018) maintain this assumption and so shall I.
 2.
This problem holds across all simulated probabilities of the low outcome, confidence levels, and sample sizes. See the “Appendix”.
 3.
In particular, this method can estimate confidence bounds for \(\lambda\) lower than 0. If so, we can set them to 0.
 4.
But the proportion of people who lied out of those who saw tails is undefined, because no one saw tails. The GSV software seems to resolve this by fixing the proportion of lies to 100%.
 5.
When L/N = 1, all subjects deterministically report heads, and both Frequentist and GSV point estimates are exactly correct.
 6.
Results available on request.
Notes
References
 Abeler, J., Nosenzo, D., & Raymond, C. (2016). Preferences for truthtelling. CESIFO working paper.Google Scholar
 Agresti, A., & Coull, B. A. (1998). Approximate is better than ‘exact’ for interval estimation of binomial proportions. The American Statistician, 52(2), 119–26.Google Scholar
 Banerjee, R., Gupta, N. D., & Villeval, M. C. (2018). The spillover effects of affirmative action on competitiveness and unethical behavior. European Economic Review, 101, 567–604.CrossRefGoogle Scholar
 Benndorf, V., Moellers, C., & Normann, H.T. (2017). Experienced vs. inexperienced participants in the lab: Do they behave differently. Journal of the Economic Science Association, 3(1), 12–25.CrossRefGoogle Scholar
 Casella, G. (1985). An introduction to empirical Bayes data analysis. The American Statistician, 39(2), 83–87.Google Scholar
 Clopper, C. J., & Pearson, E. S. (1934). The use of confidence or Fiducial limits illustrated in the case of the binomial. Biometrika, 26(4), 404–13.CrossRefGoogle Scholar
 Garbarino, E., Slonim, R., & Villeval, M. C. (2018). A method to estimate mean lying rates and their full distribution. Journal of the Economic Science Association,. https://doi.org/10.1007/s4088101800554.Google Scholar
 HughJones, D. (2016). Honesty, beliefs about honesty, and economic growth in 15 countries. Journal of Economic Behavior & Organization, 127, 99–114.CrossRefGoogle Scholar
 Hyndman, R. J. (1996). Computing and graphing highest density regions. The American Statistician, 50(2), 120–26.Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.