Abstract
In this work we extend compression-based algorithms for deception detection in text. In contrast to approaches that rely on theories for deception to identify feature sets, compression automatically identifies the most significant features. We consider two datasets that allow us to explore deception in opinion (content) and deception in identity (stylometry). Our first approach is to use unsupervised clustering based on a normalized compression distance (NCD) between documents. Our second approach is to use Prediction by Partial Matching (PPM) to train a classifier with conditional probabilities from labeled documents, followed by arithmetic coding (AC) to classify an unknown document based on which label gives the best compression. We find a significant dependence of the classifier on the relative volume of training data used to build the conditional probability distributions of the different labels. Methods are demonstrated to overcome the data size-dependence when analytics, not information transfer, is the goal. Our results indicate that deceptive text contains structure statistically distinct from truthful text, and that this structure can be automatically detected using compression-based algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Afroz, S., Brennan, M., Greenstadt, R.: Detecting hoaxes, frauds, and deception in writing style online. In: Proceedings of the 2012 IEEE Symposium on Security and Privacy, pp. 461–475 (2012)
Amitay, E., Yogev, S., Yom-Tov, E.: Serial sharers: detecting split identities of web authors. In: ACM SIGIR 2007 Amsterdam. Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection (2007)
Bastian, M., Heymann, S., Jacomy, M.: Gephi: an open source software for exploring and manipulating networks (2009). http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154
Biber, D., Johansson, S., Leech, G., Conrad, S., Finegan, E., Quirk, R.: Longman Grammar of Spoken and Written English, vol. 2. MIT Press, Cambridge (1999)
Bond, C.F., DePaulo, B.M.: Accuracy of deception judgments. Pers. Soc. Psychol. Rev. 10, 214–234 (2006)
Brennan, M., Afroz, S., Greenstadt, R.: Adversarial stylometry: circumventing authorship recognition to preserve privacy and anonymity. ACM Trans. Inf. Syst. Secur. 15, 12:1–12:22 (2012)
Brennan, M., Greenstadt, R.: Practical attacks against authorship recognition techniques. In: Proceedings of the Twenty-First Conference on Innovative Applications of Artificial Intelligence (IAAI), Pasadena, CA (2009)
Burgoon, J.K., Blair, J.P., Qin, T., Nunamaker, J.F.: Detecting deception through linguistic analysis. In: Chen, H., Miranda, R., Zeng, D.D., Demchak, C., Schroeder, J., Madhusudan, T. (eds.) ISI 2003. LNCS, vol. 2665, pp. 91–101. Springer, Heidelberg (2003). doi:10.1007/3-540-44853-5_7
Cilibrasi, R., Vitányi, P.M.B., de Wolf, R.: Algorithmic clustering of music based on string compression. Comput. Music J. 28, 49–67 (2004)
Cleary, J.G., Whitten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32, 396–402 (1984)
Feng, S., Banerjee, R., Choi, Y.: Syntactic stylometry for deception detection. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pp. 171–175 (2012)
Frank, E., Chui, C., Whitten, I.H.: Text categorization using compression models. In: Proceedings of Data Compression Conference, DCC 2000 (2000)
Hancock, J.T., Curry, L.E., Goorha, S., Woodworth, M.: On lying and being lied to: a linguistic analysis of deception in computer-mediated communiation. Discourse Process. 57, 1–23 (2006)
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Trans. Inf. Theory 50, 3250–3264 (2004)
Marton, Y., Wu, N., Hellerstein, L.: On compression-based text classification. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 300–314. Springer, Heidelberg (2005). doi:10.1007/978-3-540-31865-1_22
Mayzlin, D., Dover, Y., Chevalier, J.: Promotional reviews: an empirical investigation of online review manipulation. Am. Econ. Rev. 104(8), 2421–2455 (2012). doi:10.1257/aer.104.8.2421
Moffat, A.: Implementing the PPM data compression scheme. IEEE Trans. Commun. 38, 1917–1921 (1990)
Newman, M.L., Pennebaker, J.W., Berry, D.S., Richards, J.M.: Lying words: predicting deception from linguistic styles. Pers. Soc. Psychol. Bull. 29, 665–675 (2003)
Nishida, K., Banno, R., Fujimura, K., Hoshide, T.: Tweet classification by data compression. In: Proceedings of the 2011 International Workshop on Detecting and Exploiting Cultural Diversity on the Social Web, pp. 29–34 (2011)
Ott, M., Cardie, C., Hancock, J.T.: Estimating the prevalence of deception in online review communities. In: Proceedings of the 21st International Conference on the World Wide Web, pp. 201–210 (2012)
Ott, M., Cardie, C., Hancock, J.T.: Negative deception opinion spam. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 497–501 (2013)
Ott, M., Choi, Y., Cardie, C., Hancock, J.T.: Finding deceptive opinion spam by any stretch of the imagination. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 309–319 (2011)
Pennebaker, J.W., Chung, C.K., Ireland, M., Gonzales, A., Booth, R.J.: The development and psychometric properties of LIWC 2007. In: Austin, TX, LIWC. Net (2007)
Pennebaker, J.W., Frances, M.E., Booth, R.J.: Linguistic Inquiry and Word Count: LIWC 2001. Lawrence Erlbaum, Mahwah (2001)
Rayson, P., Wilson, A., Leech, G.: Grammatical word class variation within the British National Corpus sampler. Lang. Comput. 36, 295–306 (2001)
Zheng, R., Li, J., Chen, H., Huang, Z.: A framework of authorship identification for online messages: writing style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57, 378–393 (2006)
Zhou, L., Shi, Y., Zhang, D.: A statistical language modeling approach to online deception detection. IEEE Trans. Knowl. Data Eng. 20, 1077–1081 (2008)
Zhou, L., Twitchell, D.P., Qin, T., Burgoon, J.K., Nunamaker, J.F.: An exploratory study into deception detection in text-based computer mediated communication. In: Proceedings of the 36th Hawaii International Conference on System Sciences (2003)
Acknowledgements
The authors acknowledge funding support from the U.S. Department of Energy. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. SAND NO. SAND2017-7685 C.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A1 Appendix
A1 Appendix
1.1 A1.1 Normalized Compression Distance
Our first method of distinguishing between truthful and deceptive documents is to use a type of similarity measure called the normalized compression distance (NCD) [14]. NCD attempts to determine how similar two strings are to each other while also taking their size into account. Rather than providing an absolute distance, NCD is normalized in the sense that a pair of small strings should not be considered closer to one another than a pair of large strings simply because they are smaller. NCD is defined by:
where C(x) is a compression algorithm applied to a string x and xy denotes the concatenation of two strings x and y. We use LZMA as the compression algorithm.
To use the NCD in a clustering scheme, we compute the pairwise NCD between every pair of documents. We then use a threshold on the size of the distance in order to identify clusters.
1.2 A1.2 Prediction by Partial Matching and Arithmetic Coding
PPM using AC is a powerful method for compression and was used as the benchmark for compression algorithms for a number of years. The two algorithms complement each other in that PPM provides a model for predicting the next character in a document while AC takes a prediction model and uses the provided probabilities for each character to produce an encoded binary output. When the probability for a particular character or symbol is provided to AC we say that it is emitted. This terminology is helpful since PPM does not always provide the probability for the next character, but instead emits a sequence of other symbols called escapes to encode the fact that the PPM model is changing its internal state. These escapes, in turn, are used by a decoder so that the decoder can make the same changes.
The complementary nature of PPM and AC allows us to use them as part of a supervised learning scheme. We can use PPM to create models for truthful documents, called the truthful model, and for deceptive documents, called the deceptive model. We can then classify a document by using each predictive model together with AC and assigning the document to the class whose model produces the fewest number of bits for the compressed output.
For this scheme to work, it is possible to use a standard implementation of AC and the details are not necessary for understanding our method or results. There are only two facts that one needs to know. The first is that only non-zero probabilities are allowed and the second is that the higher the probability assigned to the next character the fewer the eventual bits needed for the compressed output. Saying the second fact another way, the better the predictor is at predicting the next character, the fewer the bits that are required.
Unlike AC, our way of using PPM is not the standard method. Moreover, in Sect. 4 we use two variants of PPM, known as PPMA and PPMC, and we make additional modifications to the PPM algorithm. To understand these changes, it is necessary to give a detailed overview of PPM, which we do next.
In general, PPM uses a set of tables of predictors of the next character that are conditioned on previous characters. The previous characters used are called the context, where the number of characters in the context, d, is called the order. Each table is representative of all the prediction of a given context within a given order, and is called an order-d table. Note that since there are multiple contexts that have the same order, there are several order-d tables for each order \(d\ge 1\). For example, if one has the characters abab as the context, then the prediction for the next character to be an a given abab is contained in a table of order-4 with context abab. Thus, an order-d table for context \(c_{0}c_{1}\ldots c_{d-1}\) is really the collection of conditional probabilities \(P(c|c_{0}c_{1}\ldots c_{d-1})\) providing the probability of seeing a token c given that \(c_{0}c_{1}\ldots c_{d-1}\) are the preceding characters. The special case when the order is 0 is P(c), which is just the probability of c occurring.
For the versions of PPM that we consider (PPMA and PPMC), the orders that are allowed are bounded by a user–specified maximum order \(d_\text {max}\). When predicting the next character c, PPM first considers the previous \(d=d_\text {max}\) characters \(c_{0}c_{1}\ldots c_{d-1}\) as the context, where \(c_{d-1}\) is the character directly preceding c. If \(P(c|c_{0}c_{1}\ldots c_{d-1})\) is non-zero, then this probability is provided to AC. However, it may be the case that \(P(c|c_{0}c_{1}\ldots c_{d-1}) = 0\), which is not an allowable probability for AC. When this event occurs, we say that the table does not contain c and PPM issues a probability for a special symbol called an escape to indicate that the character was not found. Then, PPM changes state and attempts to provide the probability of the next smaller context \(P(c|c_{1}\ldots c_{d-1})\). If \(P(c|c_{1}\ldots c_{d-1}) = 0\), then \(P(c|c_{2}\ldots c_{d-1})\) is considered, etc. If the character is not found with any context, then P(c) is considered. Finally, if PPM has no prediction for P(c), then a default uniform distribution is assumed for all characters. The table for this distribution is considered to have order \(-1\). In addition to the characters, every table of order \(d > -1\) has two additional symbols: an end-of-file symbol eof and an escape symbol esc. These symbols are present even if the probability for the characters may be zero.
PPM uses an adaptive strategy for generating the tables. Instead of keeping track of the probabilities for a given context directly, a character count is maintained for each character seen with the given context up to context order \(d_\text {max}\). For example, consider the string rowrowrow and a max context order of \(d_\text {max}=2\). After receiving the first character r, the count r is incremented by one in the order-0 table. When o is received, the count for o is incremented by one in the order-1 table with context r, and the order-0 table. When w is received, the count of w is incremented by one in the order-2 table with context ro, in the order-1 table with context o, and in the order-0 table. When the next r is received, the count of r is incremented by one in the order-2 table with context ow, the order-1 table with context w, and the order-0 table. This process continues until all characters are received. The resulting character counts for this example are shown in Table 1. Note that this process is the exact process used for PPMA and the process we use for PPMC, though technically a pure implementation of PPMC updates the tables differently. In our results, it is the escapes that seem particularly influential on the number of output bits, so we chose to focus on the change that PPMC does for the escapes and not the change PPMC does for tables. More details on the pure implementation of PPMC are in [17].
As mentioned above, it is not only the characters that are contained in the table, but also two symbols: eof and esc. The end-of-file symbol, eof, is always allocated a count of one. The count associated with the esc is different depending on whether we are using PPMA or PPMC. In fact, the difference in escape counts is the only difference between PPMA and PPMC that we use in our implementation. For PPMA, the esc symbol is allocated a single count while for PPMC, the escape is given a count equal to the number of non-zero character entries in the table, or 1 if no non-zero character entries are present. Thus, for the order-0 table in Table 1 the corresponding count for the esc according to PPMC is 3 while for any of the order-1 tables, the count is 1. The symbol counts for both methods are shown in Table 2.
To get the probabilities that are sent to AC, PPM just normalizes the entries in the tables. For example, for PPMA \(P(r|ow) = \frac{2}{4}\) since r has a character count of 2, eof has a count of 1, and esc has a count of 1. So, the total count is 4, which is the normalization factor. The rest of the probabilities for PPMA and PPMC are shown in Table 3. Note that each row in this table is really a full PPM table for a given context of a given order-d, as defined in the beginning of this section.
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Ting, C.L., Fisher, A.N., Bauer, T.L. (2017). Compression-Based Algorithms for Deception Detection. In: Ciampaglia, G., Mashhadi, A., Yasseri, T. (eds) Social Informatics. SocInfo 2017. Lecture Notes in Computer Science(), vol 10539. Springer, Cham. https://doi.org/10.1007/978-3-319-67217-5_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-67217-5_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67216-8
Online ISBN: 978-3-319-67217-5
eBook Packages: Computer ScienceComputer Science (R0)