Skip to main content

Compression-Based Algorithms for Deception Detection

  • Conference paper
  • First Online:
Social Informatics (SocInfo 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10539))

Included in the following conference series:

Abstract

In this work we extend compression-based algorithms for deception detection in text. In contrast to approaches that rely on theories for deception to identify feature sets, compression automatically identifies the most significant features. We consider two datasets that allow us to explore deception in opinion (content) and deception in identity (stylometry). Our first approach is to use unsupervised clustering based on a normalized compression distance (NCD) between documents. Our second approach is to use Prediction by Partial Matching (PPM) to train a classifier with conditional probabilities from labeled documents, followed by arithmetic coding (AC) to classify an unknown document based on which label gives the best compression. We find a significant dependence of the classifier on the relative volume of training data used to build the conditional probability distributions of the different labels. Methods are demonstrated to overcome the data size-dependence when analytics, not information transfer, is the goal. Our results indicate that deceptive text contains structure statistically distinct from truthful text, and that this structure can be automatically detected using compression-based algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Afroz, S., Brennan, M., Greenstadt, R.: Detecting hoaxes, frauds, and deception in writing style online. In: Proceedings of the 2012 IEEE Symposium on Security and Privacy, pp. 461–475 (2012)

    Google Scholar 

  2. Amitay, E., Yogev, S., Yom-Tov, E.: Serial sharers: detecting split identities of web authors. In: ACM SIGIR 2007 Amsterdam. Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection (2007)

    Google Scholar 

  3. Bastian, M., Heymann, S., Jacomy, M.: Gephi: an open source software for exploring and manipulating networks (2009). http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154

  4. Biber, D., Johansson, S., Leech, G., Conrad, S., Finegan, E., Quirk, R.: Longman Grammar of Spoken and Written English, vol. 2. MIT Press, Cambridge (1999)

    Google Scholar 

  5. Bond, C.F., DePaulo, B.M.: Accuracy of deception judgments. Pers. Soc. Psychol. Rev. 10, 214–234 (2006)

    Article  Google Scholar 

  6. Brennan, M., Afroz, S., Greenstadt, R.: Adversarial stylometry: circumventing authorship recognition to preserve privacy and anonymity. ACM Trans. Inf. Syst. Secur. 15, 12:1–12:22 (2012)

    Article  Google Scholar 

  7. Brennan, M., Greenstadt, R.: Practical attacks against authorship recognition techniques. In: Proceedings of the Twenty-First Conference on Innovative Applications of Artificial Intelligence (IAAI), Pasadena, CA (2009)

    Google Scholar 

  8. Burgoon, J.K., Blair, J.P., Qin, T., Nunamaker, J.F.: Detecting deception through linguistic analysis. In: Chen, H., Miranda, R., Zeng, D.D., Demchak, C., Schroeder, J., Madhusudan, T. (eds.) ISI 2003. LNCS, vol. 2665, pp. 91–101. Springer, Heidelberg (2003). doi:10.1007/3-540-44853-5_7

    Chapter  Google Scholar 

  9. Cilibrasi, R., Vitányi, P.M.B., de Wolf, R.: Algorithmic clustering of music based on string compression. Comput. Music J. 28, 49–67 (2004)

    Article  Google Scholar 

  10. Cleary, J.G., Whitten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32, 396–402 (1984)

    Article  Google Scholar 

  11. Feng, S., Banerjee, R., Choi, Y.: Syntactic stylometry for deception detection. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pp. 171–175 (2012)

    Google Scholar 

  12. Frank, E., Chui, C., Whitten, I.H.: Text categorization using compression models. In: Proceedings of Data Compression Conference, DCC 2000 (2000)

    Google Scholar 

  13. Hancock, J.T., Curry, L.E., Goorha, S., Woodworth, M.: On lying and being lied to: a linguistic analysis of deception in computer-mediated communiation. Discourse Process. 57, 1–23 (2006)

    Google Scholar 

  14. Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Trans. Inf. Theory 50, 3250–3264 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  15. Marton, Y., Wu, N., Hellerstein, L.: On compression-based text classification. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 300–314. Springer, Heidelberg (2005). doi:10.1007/978-3-540-31865-1_22

    Chapter  Google Scholar 

  16. Mayzlin, D., Dover, Y., Chevalier, J.: Promotional reviews: an empirical investigation of online review manipulation. Am. Econ. Rev. 104(8), 2421–2455 (2012). doi:10.1257/aer.104.8.2421

    Article  Google Scholar 

  17. Moffat, A.: Implementing the PPM data compression scheme. IEEE Trans. Commun. 38, 1917–1921 (1990)

    Article  Google Scholar 

  18. Newman, M.L., Pennebaker, J.W., Berry, D.S., Richards, J.M.: Lying words: predicting deception from linguistic styles. Pers. Soc. Psychol. Bull. 29, 665–675 (2003)

    Article  Google Scholar 

  19. Nishida, K., Banno, R., Fujimura, K., Hoshide, T.: Tweet classification by data compression. In: Proceedings of the 2011 International Workshop on Detecting and Exploiting Cultural Diversity on the Social Web, pp. 29–34 (2011)

    Google Scholar 

  20. Ott, M., Cardie, C., Hancock, J.T.: Estimating the prevalence of deception in online review communities. In: Proceedings of the 21st International Conference on the World Wide Web, pp. 201–210 (2012)

    Google Scholar 

  21. Ott, M., Cardie, C., Hancock, J.T.: Negative deception opinion spam. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 497–501 (2013)

    Google Scholar 

  22. Ott, M., Choi, Y., Cardie, C., Hancock, J.T.: Finding deceptive opinion spam by any stretch of the imagination. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 309–319 (2011)

    Google Scholar 

  23. Pennebaker, J.W., Chung, C.K., Ireland, M., Gonzales, A., Booth, R.J.: The development and psychometric properties of LIWC 2007. In: Austin, TX, LIWC. Net (2007)

    Google Scholar 

  24. Pennebaker, J.W., Frances, M.E., Booth, R.J.: Linguistic Inquiry and Word Count: LIWC 2001. Lawrence Erlbaum, Mahwah (2001)

    Google Scholar 

  25. Rayson, P., Wilson, A., Leech, G.: Grammatical word class variation within the British National Corpus sampler. Lang. Comput. 36, 295–306 (2001)

    Google Scholar 

  26. Zheng, R., Li, J., Chen, H., Huang, Z.: A framework of authorship identification for online messages: writing style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57, 378–393 (2006)

    Article  Google Scholar 

  27. Zhou, L., Shi, Y., Zhang, D.: A statistical language modeling approach to online deception detection. IEEE Trans. Knowl. Data Eng. 20, 1077–1081 (2008)

    Article  Google Scholar 

  28. Zhou, L., Twitchell, D.P., Qin, T., Burgoon, J.K., Nunamaker, J.F.: An exploratory study into deception detection in text-based computer mediated communication. In: Proceedings of the 36th Hawaii International Conference on System Sciences (2003)

    Google Scholar 

Download references

Acknowledgements

The authors acknowledge funding support from the U.S. Department of Energy. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. SAND NO. SAND2017-7685 C.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christina L. Ting .

Editor information

Editors and Affiliations

A1 Appendix

A1 Appendix

1.1 A1.1 Normalized Compression Distance

Our first method of distinguishing between truthful and deceptive documents is to use a type of similarity measure called the normalized compression distance (NCD) [14]. NCD attempts to determine how similar two strings are to each other while also taking their size into account. Rather than providing an absolute distance, NCD is normalized in the sense that a pair of small strings should not be considered closer to one another than a pair of large strings simply because they are smaller. NCD is defined by:

$$\begin{aligned} \text {NCD}(x, y) = \frac{C(xy) - \min {\{C(x), C(y)\}}}{\max \{C(x), C(y)\}}, \end{aligned}$$

where C(x) is a compression algorithm applied to a string x and xy denotes the concatenation of two strings x and y. We use LZMA as the compression algorithm.

To use the NCD in a clustering scheme, we compute the pairwise NCD between every pair of documents. We then use a threshold on the size of the distance in order to identify clusters.

1.2 A1.2 Prediction by Partial Matching and Arithmetic Coding

PPM using AC is a powerful method for compression and was used as the benchmark for compression algorithms for a number of years. The two algorithms complement each other in that PPM provides a model for predicting the next character in a document while AC takes a prediction model and uses the provided probabilities for each character to produce an encoded binary output. When the probability for a particular character or symbol is provided to AC we say that it is emitted. This terminology is helpful since PPM does not always provide the probability for the next character, but instead emits a sequence of other symbols called escapes to encode the fact that the PPM model is changing its internal state. These escapes, in turn, are used by a decoder so that the decoder can make the same changes.

The complementary nature of PPM and AC allows us to use them as part of a supervised learning scheme. We can use PPM to create models for truthful documents, called the truthful model, and for deceptive documents, called the deceptive model. We can then classify a document by using each predictive model together with AC and assigning the document to the class whose model produces the fewest number of bits for the compressed output.

For this scheme to work, it is possible to use a standard implementation of AC and the details are not necessary for understanding our method or results. There are only two facts that one needs to know. The first is that only non-zero probabilities are allowed and the second is that the higher the probability assigned to the next character the fewer the eventual bits needed for the compressed output. Saying the second fact another way, the better the predictor is at predicting the next character, the fewer the bits that are required.

Unlike AC, our way of using PPM is not the standard method. Moreover, in Sect. 4 we use two variants of PPM, known as PPMA and PPMC, and we make additional modifications to the PPM algorithm. To understand these changes, it is necessary to give a detailed overview of PPM, which we do next.

In general, PPM uses a set of tables of predictors of the next character that are conditioned on previous characters. The previous characters used are called the context, where the number of characters in the context, d, is called the order. Each table is representative of all the prediction of a given context within a given order, and is called an order-d table. Note that since there are multiple contexts that have the same order, there are several order-d tables for each order \(d\ge 1\). For example, if one has the characters abab as the context, then the prediction for the next character to be an a given abab is contained in a table of order-4 with context abab. Thus, an order-d table for context \(c_{0}c_{1}\ldots c_{d-1}\) is really the collection of conditional probabilities \(P(c|c_{0}c_{1}\ldots c_{d-1})\) providing the probability of seeing a token c given that \(c_{0}c_{1}\ldots c_{d-1}\) are the preceding characters. The special case when the order is 0 is P(c), which is just the probability of c occurring.

For the versions of PPM that we consider (PPMA and PPMC), the orders that are allowed are bounded by a user–specified maximum order \(d_\text {max}\). When predicting the next character c, PPM first considers the previous \(d=d_\text {max}\) characters \(c_{0}c_{1}\ldots c_{d-1}\) as the context, where \(c_{d-1}\) is the character directly preceding c. If \(P(c|c_{0}c_{1}\ldots c_{d-1})\) is non-zero, then this probability is provided to AC. However, it may be the case that \(P(c|c_{0}c_{1}\ldots c_{d-1}) = 0\), which is not an allowable probability for AC. When this event occurs, we say that the table does not contain c and PPM issues a probability for a special symbol called an escape to indicate that the character was not found. Then, PPM changes state and attempts to provide the probability of the next smaller context \(P(c|c_{1}\ldots c_{d-1})\). If \(P(c|c_{1}\ldots c_{d-1}) = 0\), then \(P(c|c_{2}\ldots c_{d-1})\) is considered, etc. If the character is not found with any context, then P(c) is considered. Finally, if PPM has no prediction for P(c), then a default uniform distribution is assumed for all characters. The table for this distribution is considered to have order \(-1\). In addition to the characters, every table of order \(d > -1\) has two additional symbols: an end-of-file symbol eof and an escape symbol esc. These symbols are present even if the probability for the characters may be zero.

Table 1. Resulting characters counts for the PPM tables after receiving the character string rowrowrow.

PPM uses an adaptive strategy for generating the tables. Instead of keeping track of the probabilities for a given context directly, a character count is maintained for each character seen with the given context up to context order \(d_\text {max}\). For example, consider the string rowrowrow and a max context order of \(d_\text {max}=2\). After receiving the first character r, the count r is incremented by one in the order-0 table. When o is received, the count for o is incremented by one in the order-1 table with context r, and the order-0 table. When w is received, the count of w is incremented by one in the order-2 table with context ro, in the order-1 table with context o, and in the order-0 table. When the next r is received, the count of r is incremented by one in the order-2 table with context ow, the order-1 table with context w, and the order-0 table. This process continues until all characters are received. The resulting character counts for this example are shown in Table 1. Note that this process is the exact process used for PPMA and the process we use for PPMC, though technically a pure implementation of PPMC updates the tables differently. In our results, it is the escapes that seem particularly influential on the number of output bits, so we chose to focus on the change that PPMC does for the escapes and not the change PPMC does for tables. More details on the pure implementation of PPMC are in [17].

Table 2. Resulting symbol counts for PPMA and PPMC tables after receiving the character string rowrowrow.
Table 3. Resulting probabilities for the characters and symbols provided by PPMA and PPMC.
Table 4. Machine learning on the hotel corpus.
Fig. 9.
figure 9

Normalizing the extended BG dataset: expected values for the counts \( \bar{n}\) and probabilities \(\bar{P}\) as a function of context order for max order \(d_\text {max} = 15\). Panels (a) and (c) are for characters; (b) and (d) are for escapes. Figure label indicates test and model category, respectively.

As mentioned above, it is not only the characters that are contained in the table, but also two symbols: eof and esc. The end-of-file symbol, eof, is always allocated a count of one. The count associated with the esc is different depending on whether we are using PPMA or PPMC. In fact, the difference in escape counts is the only difference between PPMA and PPMC that we use in our implementation. For PPMA, the esc symbol is allocated a single count while for PPMC, the escape is given a count equal to the number of non-zero character entries in the table, or 1 if no non-zero character entries are present. Thus, for the order-0 table in Table 1 the corresponding count for the esc according to PPMC is 3 while for any of the order-1 tables, the count is 1. The symbol counts for both methods are shown in Table 2.

To get the probabilities that are sent to AC, PPM just normalizes the entries in the tables. For example, for PPMA \(P(r|ow) = \frac{2}{4}\) since r has a character count of 2, eof has a count of 1, and esc has a count of 1. So, the total count is 4, which is the normalization factor. The rest of the probabilities for PPMA and PPMC are shown in Table 3. Note that each row in this table is really a full PPM table for a given context of a given order-d, as defined in the beginning of this section.

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Ting, C.L., Fisher, A.N., Bauer, T.L. (2017). Compression-Based Algorithms for Deception Detection. In: Ciampaglia, G., Mashhadi, A., Yasseri, T. (eds) Social Informatics. SocInfo 2017. Lecture Notes in Computer Science(), vol 10539. Springer, Cham. https://doi.org/10.1007/978-3-319-67217-5_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67217-5_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67216-8

  • Online ISBN: 978-3-319-67217-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics