Abstract
A problem to be faced very early when dealing with digital documents is how to represent them in a suitable machine-readable format. This chapter introduces the current widespread formats for digital document representation, divided by category according to the degree of structure they express. Formats that do not exhibit any high-level structure for their content are dealt with first: plain text and image formats (in turn divided into vector and raster ones). Then, the formats containing information on spatial placement of the document elements are introduced: PostScript and its evolution Portable Document Format, which represent the current standard in document exchange. Lastly, a selection of formats organized according to the content and function of the document components are presented, including Web formats and the official standard for text processing. Some insight in the file representation and organization aspects, and in the methods involved in producing and exploiting them, is given, aimed at ensuring a sufficient understanding and ability to recognize their pros and cons, without entering into subtle technical details.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
As proved by Shannon [23], the number of bits required to specify sequences of length N, for large N, is equal to N⋅H (where H is the source entropy).
- 2.
In the rest of this section, both notations will be used interchangeably, as needed.
- 3.
Another coding standard in use during the ancient times of computing machinery, abandoned later on, was the EBCDIC (Extended Binary Coded Decimal Interchange Code), which started from the BCD (Binary Coded Decimal) binary representation of decimal digits, from 00002=0 to 10012=9, used by early computers for performing arithmetical operations, and extended it by putting before four additional bits. The decimal digits are characterized by an initial 11112 sequence, while the other configurations available allow defining various kinds of characters (alphabetic ones, punctuation marks, etc.).
- 4.
As a trivial example of how tricky UTF-16 can be: usual C string handling cannot be applied because it would consider as string terminators the many 00000000 byte configurations in UTF-16 codes.
- 5.
An indication that one of the main interests towards images is their transmission.
- 6.
An image compressed lossily, if repeatedly saved, will tend to lose quality, up to not being able to recognize its content anymore.
- 7.
The subsampling scheme is commonly expressed as an R:f:s code that refers to a conceptual region having height of 2 rows (pixels), where:
- R :
-
width of the conceptual region (horizontal sampling reference), usually 4;
- f :
-
number of chrominance samples in the first row of R pixels;
- s :
-
number of (additional) chrominance samples in the second row of R pixels.
- 8.
A variant of the LZ77, developed by J.-L. Gailly for the compression part (used in zip and gzip) and by M. Adler for the decompression part (used in gzip and unzip), and almost always exploited nowadays in ZIP compression. It can be optimized for specific types of data.
- 9.
An open-source project (http://pages.cs.wisc.edu/~ghost/) that does not directly handle PS and PDF formats, this way being able to handle some differences in the various versions or slangs of such formats. An associated viewer for PS files, called GSview, is also maintained in the project.
- 10.
In this section, PostScript code and operators will be denoted using a teletype font. Operators that can be used with several numbers of parameters are disambiguated by appending the number n of parameters in the form operator/n.
- 11.
The most common operations that modify the matrix are translation of the axes origin, rotation of the system of Cartesian axes by a given angle, scaling that independently changes the unit of measure of the axes, and concatenation that applies a linear transformation to the coordinate system.
- 12.
Some software applications that produce HTML documents exploit extensions not included in the official format definition, and hence part of their output represents semi-proprietary code that might not be properly displayed on some platforms.
- 13.
Note that HTML is an application, not a subset, of SGML.
- 14.
This representation is inspired by JAR files (Java ARchives), used by the Java programming language to save applications.
References
Graphics Interchange Format (sm) specification—version 89a. Tech. rep., Compuserve Inc. (1990)
TIFF specification—revision 6.0. Tech. rep., Adobe Systems Incorporated (1992)
HTML 4.01 specification—W3C recommendation. Tech. rep., W3C (1999)
XML Path Language (XPath) 1.0—W3C recommendation. Tech. rep., W3C (1999)
Transformations, X.S.L.: (XSLT) 1.0—W3C recommendation. Tech. rep., W3C (1999)
International standard ISO/IEC 10646: Information technology—Universal Multiple-octet coded Character Set (UCS). Tech. rep., ISO/IEC (2003)
Portable Network Graphics (PNG) specification, 2nd edn.—W3C recommendation. Tech. rep., W3C (2003)
Lizardtech djvu reference—version 3. Tech. rep., Lizardtech, A Celartem Company (2005)
Extensible Markup Language (XML) 1.1, 2nd edn.—W3C recommendation. Tech. rep., W3C (2006)
Extensible Stylesheet Language (XSL) 1.1—W3C recommendation. Tech. rep., W3C (2006)
Microsoft Office Word 97–2007 binary file format specification [*.doc]. Tech. rep., Microsoft Corporation (2007)
Open Document Format for office applications (OpenDocument) v1.1—OASIS standard. Tech. rep., OASIS (2007)
Extensible Markup Language (XML) 1.0, 5th edn.—W3C recommendation. Tech. rep., W3C (2008)
Adobe Systems Incorporated: PDF Reference—Adobe Portable Document Format Version 1.3, 2nd edn. Addison-Wesley, Reading (2000)
International Telegraph and Telephone Consultative Committee (CCITT): Recommendation t.81. Tech. rep., International Telecommunication Union (ITU) (92)
Deutsch, P.: Deflate compressed data format specification 1.3. Tech. rep. RFC1951 (1996)
Deutsch, P., Gailly, J.L.: Zlib compressed data format specification 3.3. Tech. rep. RFC1950 (1996)
Eisenberg, J.: OASIS OpenDocument Essentials—Using OASIS OpenDocument XML. Friends of OpenDocument (2005)
Hamilton, E.: JPEG file interchange format—version 1.2. Tech. rep. (1992)
Huffman, D.: A method for the construction of minimum-redundancy codes. In: Proceedings of the I.R.E, pp. 1098–1102 (1952)
Lamport, L.: , A Document Preparation System—User’s Guide and Reference Manual, 2nd edn. Addison-Wesley, Reading (1994)
Reid, G.: Thinking in PostScript. Addison-Wesley, Reading (1990)
Shannon, C.E., Weaver, W.: The Mathematical Theory of Communication. University of Illinois Press, Champaign (1949)
The Unicode Consortium: The Unicode Standard, Version 5.0, 5th edn. Addison-Wesley, Reading (2006)
W3C SVG Working Group: Scalable Vector Graphics (SVG) 1.1 specification. Tech. rep., W3C (2003)
Welch, T.: A technique for high-performance data compression. IEEE Computer 17(6), 8–19 (1984)
Wood, L.: Programming the Web: The W3C DOM specification. IEEE Internet Computing 3(1), 48–54 (1999)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2011 Springer-Verlag London Limited
About this chapter
Cite this chapter
Ferilli, S. (2011). Digital Formats. In: Automatic Digital Document Processing and Management. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-198-1_2
Download citation
DOI: https://doi.org/10.1007/978-0-85729-198-1_2
Publisher Name: Springer, London
Print ISBN: 978-0-85729-197-4
Online ISBN: 978-0-85729-198-1
eBook Packages: Computer ScienceComputer Science (R0)