Abstract
The requirement to efficiently store and process SMILES data used in Chemoinformatics creates a demand for efficient techniques to compress this data. General-purpose transforms and compressors are available to transform and compress this type of data to a certain extent, however, these techniques are not specific to SMILES data. We develop a transform specific to SMILES data that can be used alongside other general-purpose compressors as a preprocessor and post-processor to improve the compression of SMILES data. We test our transform with six other general-purpose compressors and also compare our results with another transform on our SMILES data corpus, we also compare our results with untransformed data.
Chapter PDF
Similar content being viewed by others
References
7z Format, http://www.7-zip.org/7z.html
Benchmark Data Set for In Silico Prediction of Ames Mutagenicity, http://doc.ml.tu-berlin.de/toxbenchmark/
BZip2 for Windows, http://gnuwin32.sourceforge.net/packages/bzip2.htm
Carus, A., Mesut, A.: Fast Text Compression Using Multiple Static Dictionaries. J. Inf. Tech. 9(5), 1013–1021 (2010)
Cavnar, W.B., Trenkle, J.M.: N-Gram-Based Text Categorization. In: Proceedings of the Annual Symposium on Document Analysis and Information Retrieval (SDAIR 1994), Las Vegas, Nevada, USA, April 11-13, pp. 161–175 (1994)
Chemoinformatics.org, http://cheminformatics.org/datasets/index.shtml
Compression.ru Project (in Russian), http://www.compression.ru/ds/
Daylight Theory Manual, http://www.daylight.com/dayhtml/doc/theory/index.html
DSSTox, http://www.epa.gov/ncct/dsstox/
Engel, T.: Basic Overview of Chemoinformatics. J. Chem. Inf. Model. 46(6), 2267–2277 (2006)
The GZip Home Page, http://www.gzip.org/
Homepage of Przemysław Skibiński, http://pskibinski.pl/
Index of /Data/CPDB, http://www.predictive-toxicology.org/data/cpdb/
Karthikeyan, M., Bender, A.: Encoding and Decoding Graphical Chemical Structures as Two-Dimensional (PDF417) Barcodes. J. Chem. Inf. Model. 45(3), 572–580 (2005)
Kristensen, T.G., Nielsen, J., Pedersen, C.N.S.: Using Inverted Indices for Accelerating LINGO Calculations. J. Chem. Inf. Model. 51(3), 597–600 (2011)
Kruse, H., Mukherjee, A.: Preprocessing Text to Improve Compression Ratios. In: Proceedings of the IEEE Data Compression Conference (DCC 1998), Snowbird, Utah, USA, March 30-April 1, p. 556 (1998)
Mahoney, M.V.: Adaptive Weighing of Context Models for Lossless Data Compression. Technical Report CS-2005-16, Florida Institute of Technology, Melbourne, Florida, USA (2005)
O’Boyle, N.M.: Towards a Universal SMILES Representation – A Standard Method to Generate Canonical SMILES Based on the InChI. J. Cheminform. 4, 22 (2012)
Ratanaworabhan, P., Ke, J., Burtscher, M.: Fast Lossless Compression of Scientific Floating-Point Data. In: Proceedings of the IEEE Data Compression Conference (DCC 2006), Snowbird, Utah, USA, March 28-30, pp. 133–142 (2006)
Skibiński, P.: Reversible Data Transforms that Improve Effectiveness of Universal Lossless Data Compression. PhD Dissertation, University of Wrocław, Wrocław, Poland (2006)
Skibiński, P.: Two-Level Directory Based Compression. In: Proceedings of the IEEE Data Compression Conference (DCC 2005), Snowbird, Utah, USA, March 29-31, pp. 481–492 (2005)
Skibiński, P., Grabowski, S.: Variable-Length Contexts for PPM. In: Proceedings of the IEEE Data Compression Conference (DCC 2004), Snowbird, Utah, USA, March 23-25, pp. 409–418 (2004)
Weininger, D.: SMILES, A Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci. 28(1), 31–36 (1988)
ZPAQ: Open Standard Programmable Data Compression, http://mattmahoney.net/dc/zpaq.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 IFIP International Federation for Information Processing
About this paper
Cite this paper
Scanlon, S., Ridley, M. (2013). A Fully Reversible Data Transform Technique Enhancing Data Compression of SMILES Data. In: Cuzzocrea, A., Kittl, C., Simos, D.E., Weippl, E., Xu, L. (eds) Availability, Reliability, and Security in Information Systems and HCI. CD-ARES 2013. Lecture Notes in Computer Science, vol 8127. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40511-2_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-40511-2_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40510-5
Online ISBN: 978-3-642-40511-2
eBook Packages: Computer ScienceComputer Science (R0)