Compression-Based Algorithms for Deception Detection

Ting, Christina L.; Fisher, Andrew N.; Bauer, Travis L.

doi:10.1007/978-3-319-67217-5_16

Christina L. Ting¹⁶,
Andrew N. Fisher¹⁶ &
Travis L. Bauer¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10539))

Included in the following conference series:

International Conference on Social Informatics

3461 Accesses
3 Citations

Abstract

In this work we extend compression-based algorithms for deception detection in text. In contrast to approaches that rely on theories for deception to identify feature sets, compression automatically identifies the most significant features. We consider two datasets that allow us to explore deception in opinion (content) and deception in identity (stylometry). Our first approach is to use unsupervised clustering based on a normalized compression distance (NCD) between documents. Our second approach is to use Prediction by Partial Matching (PPM) to train a classifier with conditional probabilities from labeled documents, followed by arithmetic coding (AC) to classify an unknown document based on which label gives the best compression. We find a significant dependence of the classifier on the relative volume of training data used to build the conditional probability distributions of the different labels. Methods are demonstrated to overcome the data size-dependence when analytics, not information transfer, is the goal. Our results indicate that deceptive text contains structure statistically distinct from truthful text, and that this structure can be automatically detected using compression-based algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Afroz, S., Brennan, M., Greenstadt, R.: Detecting hoaxes, frauds, and deception in writing style online. In: Proceedings of the 2012 IEEE Symposium on Security and Privacy, pp. 461–475 (2012)
Google Scholar
Amitay, E., Yogev, S., Yom-Tov, E.: Serial sharers: detecting split identities of web authors. In: ACM SIGIR 2007 Amsterdam. Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection (2007)
Google Scholar
Bastian, M., Heymann, S., Jacomy, M.: Gephi: an open source software for exploring and manipulating networks (2009). http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154
Biber, D., Johansson, S., Leech, G., Conrad, S., Finegan, E., Quirk, R.: Longman Grammar of Spoken and Written English, vol. 2. MIT Press, Cambridge (1999)
Google Scholar
Bond, C.F., DePaulo, B.M.: Accuracy of deception judgments. Pers. Soc. Psychol. Rev. 10, 214–234 (2006)
Article Google Scholar
Brennan, M., Afroz, S., Greenstadt, R.: Adversarial stylometry: circumventing authorship recognition to preserve privacy and anonymity. ACM Trans. Inf. Syst. Secur. 15, 12:1–12:22 (2012)
Article Google Scholar
Brennan, M., Greenstadt, R.: Practical attacks against authorship recognition techniques. In: Proceedings of the Twenty-First Conference on Innovative Applications of Artificial Intelligence (IAAI), Pasadena, CA (2009)
Google Scholar
Burgoon, J.K., Blair, J.P., Qin, T., Nunamaker, J.F.: Detecting deception through linguistic analysis. In: Chen, H., Miranda, R., Zeng, D.D., Demchak, C., Schroeder, J., Madhusudan, T. (eds.) ISI 2003. LNCS, vol. 2665, pp. 91–101. Springer, Heidelberg (2003). doi:10.1007/3-540-44853-5_7
Chapter Google Scholar
Cilibrasi, R., Vitányi, P.M.B., de Wolf, R.: Algorithmic clustering of music based on string compression. Comput. Music J. 28, 49–67 (2004)
Article Google Scholar
Cleary, J.G., Whitten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32, 396–402 (1984)
Article Google Scholar
Feng, S., Banerjee, R., Choi, Y.: Syntactic stylometry for deception detection. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pp. 171–175 (2012)
Google Scholar
Frank, E., Chui, C., Whitten, I.H.: Text categorization using compression models. In: Proceedings of Data Compression Conference, DCC 2000 (2000)
Google Scholar
Hancock, J.T., Curry, L.E., Goorha, S., Woodworth, M.: On lying and being lied to: a linguistic analysis of deception in computer-mediated communiation. Discourse Process. 57, 1–23 (2006)
Google Scholar
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Trans. Inf. Theory 50, 3250–3264 (2004)
Article MathSciNet MATH Google Scholar
Marton, Y., Wu, N., Hellerstein, L.: On compression-based text classification. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 300–314. Springer, Heidelberg (2005). doi:10.1007/978-3-540-31865-1_22
Chapter Google Scholar
Mayzlin, D., Dover, Y., Chevalier, J.: Promotional reviews: an empirical investigation of online review manipulation. Am. Econ. Rev. 104(8), 2421–2455 (2012). doi:10.1257/aer.104.8.2421
Article Google Scholar
Moffat, A.: Implementing the PPM data compression scheme. IEEE Trans. Commun. 38, 1917–1921 (1990)
Article Google Scholar
Newman, M.L., Pennebaker, J.W., Berry, D.S., Richards, J.M.: Lying words: predicting deception from linguistic styles. Pers. Soc. Psychol. Bull. 29, 665–675 (2003)
Article Google Scholar
Nishida, K., Banno, R., Fujimura, K., Hoshide, T.: Tweet classification by data compression. In: Proceedings of the 2011 International Workshop on Detecting and Exploiting Cultural Diversity on the Social Web, pp. 29–34 (2011)
Google Scholar
Ott, M., Cardie, C., Hancock, J.T.: Estimating the prevalence of deception in online review communities. In: Proceedings of the 21st International Conference on the World Wide Web, pp. 201–210 (2012)
Google Scholar
Ott, M., Cardie, C., Hancock, J.T.: Negative deception opinion spam. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 497–501 (2013)
Google Scholar
Ott, M., Choi, Y., Cardie, C., Hancock, J.T.: Finding deceptive opinion spam by any stretch of the imagination. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 309–319 (2011)
Google Scholar
Pennebaker, J.W., Chung, C.K., Ireland, M., Gonzales, A., Booth, R.J.: The development and psychometric properties of LIWC 2007. In: Austin, TX, LIWC. Net (2007)
Google Scholar
Pennebaker, J.W., Frances, M.E., Booth, R.J.: Linguistic Inquiry and Word Count: LIWC 2001. Lawrence Erlbaum, Mahwah (2001)
Google Scholar
Rayson, P., Wilson, A., Leech, G.: Grammatical word class variation within the British National Corpus sampler. Lang. Comput. 36, 295–306 (2001)
Google Scholar
Zheng, R., Li, J., Chen, H., Huang, Z.: A framework of authorship identification for online messages: writing style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57, 378–393 (2006)
Article Google Scholar
Zhou, L., Shi, Y., Zhang, D.: A statistical language modeling approach to online deception detection. IEEE Trans. Knowl. Data Eng. 20, 1077–1081 (2008)
Article Google Scholar
Zhou, L., Twitchell, D.P., Qin, T., Burgoon, J.K., Nunamaker, J.F.: An exploratory study into deception detection in text-based computer mediated communication. In: Proceedings of the 36th Hawaii International Conference on System Sciences (2003)
Google Scholar

Download references

Acknowledgements

The authors acknowledge funding support from the U.S. Department of Energy. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. SAND NO. SAND2017-7685 C.

Author information

Authors and Affiliations

Sandia National Laboratories, Albuquerque, NM, 87123, USA
Christina L. Ting, Andrew N. Fisher & Travis L. Bauer

Authors

Christina L. Ting
View author publications
You can also search for this author in PubMed Google Scholar
Andrew N. Fisher
View author publications
You can also search for this author in PubMed Google Scholar
Travis L. Bauer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christina L. Ting .

Editor information

Editors and Affiliations

Indiana University, Bloomington, Indiana, USA
Giovanni Luca Ciampaglia
University of Washington, Seattle, Washington, USA
Afra Mashhadi
University of Oxford, Oxford, United Kingdom
Taha Yasseri

A1 Appendix

1.1 A1.1 Normalized Compression Distance

Our first method of distinguishing between truthful and deceptive documents is to use a type of similarity measure called the normalized compression distance (NCD) [14]. NCD attempts to determine how similar two strings are to each other while also taking their size into account. Rather than providing an absolute distance, NCD is normalized in the sense that a pair of small strings should not be considered closer to one another than a pair of large strings simply because they are smaller. NCD is defined by:

$$\begin{aligned} \text {NCD}(x, y) = \frac{C(xy) - \min {\{C(x), C(y)\}}}{\max \{C(x), C(y)\}}, \end{aligned}$$

where C(x) is a compression algorithm applied to a string x and xy denotes the concatenation of two strings x and y. We use LZMA as the compression algorithm.

To use the NCD in a clustering scheme, we compute the pairwise NCD between every pair of documents. We then use a threshold on the size of the distance in order to identify clusters.

1.2 A1.2 Prediction by Partial Matching and Arithmetic Coding

PPM using AC is a powerful method for compression and was used as the benchmark for compression algorithms for a number of years. The two algorithms complement each other in that PPM provides a model for predicting the next character in a document while AC takes a prediction model and uses the provided probabilities for each character to produce an encoded binary output. When the probability for a particular character or symbol is provided to AC we say that it is emitted. This terminology is helpful since PPM does not always provide the probability for the next character, but instead emits a sequence of other symbols called escapes to encode the fact that the PPM model is changing its internal state. These escapes, in turn, are used by a decoder so that the decoder can make the same changes.

The complementary nature of PPM and AC allows us to use them as part of a supervised learning scheme. We can use PPM to create models for truthful documents, called the truthful model, and for deceptive documents, called the deceptive model. We can then classify a document by using each predictive model together with AC and assigning the document to the class whose model produces the fewest number of bits for the compressed output.

For this scheme to work, it is possible to use a standard implementation of AC and the details are not necessary for understanding our method or results. There are only two facts that one needs to know. The first is that only non-zero probabilities are allowed and the second is that the higher the probability assigned to the next character the fewer the eventual bits needed for the compressed output. Saying the second fact another way, the better the predictor is at predicting the next character, the fewer the bits that are required.

Unlike AC, our way of using PPM is not the standard method. Moreover, in Sect. 4 we use two variants of PPM, known as PPMA and PPMC, and we make additional modifications to the PPM algorithm. To understand these changes, it is necessary to give a detailed overview of PPM, which we do next.

In general, PPM uses a set of tables of predictors of the next character that are conditioned on previous characters. The previous characters used are called the context, where the number of characters in the context, d, is called the order. Each table is representative of all the prediction of a given context within a given order, and is called an order-d table. Note that since there are multiple contexts that have the same order, there are several order-d tables for each order $d\ge 1$. For example, if one has the characters abab as the context, then the prediction for the next character to be an a given abab is contained in a table of order-4 with context abab. Thus, an order-d table for context $c_{0}c_{1}\ldots c_{d-1}$ is really the collection of conditional probabilities $P(c|c_{0}c_{1}\ldots c_{d-1})$ providing the probability of seeing a token c given that $c_{0}c_{1}\ldots c_{d-1}$ are the preceding characters. The special case when the order is 0 is P(c), which is just the probability of c occurring.

For the versions of PPM that we consider (PPMA and PPMC), the orders that are allowed are bounded by a user–specified maximum order $d_\text {max}$. When predicting the next character c, PPM first considers the previous $d=d_\text {max}$ characters $c_{0}c_{1}\ldots c_{d-1}$ as the context, where $c_{d-1}$ is the character directly preceding c. If $P(c|c_{0}c_{1}\ldots c_{d-1})$ is non-zero, then this probability is provided to AC. However, it may be the case that $P(c|c_{0}c_{1}\ldots c_{d-1}) = 0$, which is not an allowable probability for AC. When this event occurs, we say that the table does not contain c and PPM issues a probability for a special symbol called an escape to indicate that the character was not found. Then, PPM changes state and attempts to provide the probability of the next smaller context $P(c|c_{1}\ldots c_{d-1})$. If $P(c|c_{1}\ldots c_{d-1}) = 0$, then $P(c|c_{2}\ldots c_{d-1})$ is considered, etc. If the character is not found with any context, then P(c) is considered. Finally, if PPM has no prediction for P(c), then a default uniform distribution is assumed for all characters. The table for this distribution is considered to have order $-1$. In addition to the characters, every table of order $d > -1$ has two additional symbols: an end-of-file symbol eof and an escape symbol esc. These symbols are present even if the probability for the characters may be zero.

Table 1. Resulting characters counts for the PPM tables after receiving the character string rowrowrow.

Full size table

PPM uses an adaptive strategy for generating the tables. Instead of keeping track of the probabilities for a given context directly, a character count is maintained for each character seen with the given context up to context order $d_\text {max}$. For example, consider the string rowrowrow and a max context order of $d_\text {max}=2$. After receiving the first character r, the count r is incremented by one in the order-0 table. When o is received, the count for o is incremented by one in the order-1 table with context r, and the order-0 table. When w is received, the count of w is incremented by one in the order-2 table with context ro, in the order-1 table with context o, and in the order-0 table. When the next r is received, the count of r is incremented by one in the order-2 table with context ow, the order-1 table with context w, and the order-0 table. This process continues until all characters are received. The resulting character counts for this example are shown in Table 1. Note that this process is the exact process used for PPMA and the process we use for PPMC, though technically a pure implementation of PPMC updates the tables differently. In our results, it is the escapes that seem particularly influential on the number of output bits, so we chose to focus on the change that PPMC does for the escapes and not the change PPMC does for tables. More details on the pure implementation of PPMC are in [17].

Table 2. Resulting symbol counts for PPMA and PPMC tables after receiving the character string rowrowrow.

Full size table

Table 3. Resulting probabilities for the characters and symbols provided by PPMA and PPMC.

Full size table

Table 4. Machine learning on the hotel corpus.

Full size table

As mentioned above, it is not only the characters that are contained in the table, but also two symbols: eof and esc. The end-of-file symbol, eof, is always allocated a count of one. The count associated with the esc is different depending on whether we are using PPMA or PPMC. In fact, the difference in escape counts is the only difference between PPMA and PPMC that we use in our implementation. For PPMA, the esc symbol is allocated a single count while for PPMC, the escape is given a count equal to the number of non-zero character entries in the table, or 1 if no non-zero character entries are present. Thus, for the order-0 table in Table 1 the corresponding count for the esc according to PPMC is 3 while for any of the order-1 tables, the count is 1. The symbol counts for both methods are shown in Table 2.

To get the probabilities that are sent to AC, PPM just normalizes the entries in the tables. For example, for PPMA $P(r|ow) = \frac{2}{4}$ since r has a character count of 2, eof has a count of 1, and esc has a count of 1. So, the total count is 4, which is the normalization factor. The rest of the probabilities for PPMA and PPMC are shown in Table 3. Note that each row in this table is really a full PPM table for a given context of a given order-d, as defined in the beginning of this section.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ting, C.L., Fisher, A.N., Bauer, T.L. (2017). Compression-Based Algorithms for Deception Detection. In: Ciampaglia, G., Mashhadi, A., Yasseri, T. (eds) Social Informatics. SocInfo 2017. Lecture Notes in Computer Science(), vol 10539. Springer, Cham. https://doi.org/10.1007/978-3-319-67217-5_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-67217-5_16
Published: 03 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67216-8
Online ISBN: 978-3-319-67217-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Compression-Based Algorithms for Deception Detection

Abstract

Access this chapter

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A1 Appendix

A1 Appendix

1.1 A1.1 Normalized Compression Distance

1.2 A1.2 Prediction by Partial Matching and Arithmetic Coding

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation