Skip to main content

Part of the book series: Advances in Pattern Recognition ((ACVPR))

  • 1227 Accesses

Abstract

A problem to be faced very early when dealing with digital documents is how to represent them in a suitable machine-readable format. This chapter introduces the current widespread formats for digital document representation, divided by category according to the degree of structure they express. Formats that do not exhibit any high-level structure for their content are dealt with first: plain text and image formats (in turn divided into vector and raster ones). Then, the formats containing information on spatial placement of the document elements are introduced: PostScript and its evolution Portable Document Format, which represent the current standard in document exchange. Lastly, a selection of formats organized according to the content and function of the document components are presented, including Web formats and the official standard for text processing. Some insight in the file representation and organization aspects, and in the methods involved in producing and exploiting them, is given, aimed at ensuring a sufficient understanding and ability to recognize their pros and cons, without entering into subtle technical details.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    As proved by Shannon [23], the number of bits required to specify sequences of length N, for large N, is equal to NH (where H is the source entropy).

  2. 2.

    In the rest of this section, both notations will be used interchangeably, as needed.

  3. 3.

    Another coding standard in use during the ancient times of computing machinery, abandoned later on, was the EBCDIC (Extended Binary Coded Decimal Interchange Code), which started from the BCD (Binary Coded Decimal) binary representation of decimal digits, from 00002=0 to 10012=9, used by early computers for performing arithmetical operations, and extended it by putting before four additional bits. The decimal digits are characterized by an initial 11112 sequence, while the other configurations available allow defining various kinds of characters (alphabetic ones, punctuation marks, etc.).

  4. 4.

    As a trivial example of how tricky UTF-16 can be: usual C string handling cannot be applied because it would consider as string terminators the many 00000000 byte configurations in UTF-16 codes.

  5. 5.

    An indication that one of the main interests towards images is their transmission.

  6. 6.

    An image compressed lossily, if repeatedly saved, will tend to lose quality, up to not being able to recognize its content anymore.

  7. 7.

    The subsampling scheme is commonly expressed as an R:f:s code that refers to a conceptual region having height of 2 rows (pixels), where:

    R :

    width of the conceptual region (horizontal sampling reference), usually 4;

    f :

    number of chrominance samples in the first row of R pixels;

    s :

    number of (additional) chrominance samples in the second row of R pixels.

  8. 8.

    A variant of the LZ77, developed by J.-L. Gailly for the compression part (used in zip and gzip) and by M. Adler for the decompression part (used in gzip and unzip), and almost always exploited nowadays in ZIP compression. It can be optimized for specific types of data.

  9. 9.

    An open-source project (http://pages.cs.wisc.edu/~ghost/) that does not directly handle PS and PDF formats, this way being able to handle some differences in the various versions or slangs of such formats. An associated viewer for PS files, called GSview, is also maintained in the project.

  10. 10.

    In this section, PostScript code and operators will be denoted using a teletype font. Operators that can be used with several numbers of parameters are disambiguated by appending the number n of parameters in the form operator/n.

  11. 11.

    The most common operations that modify the matrix are translation of the axes origin, rotation of the system of Cartesian axes by a given angle, scaling that independently changes the unit of measure of the axes, and concatenation that applies a linear transformation to the coordinate system.

  12. 12.

    Some software applications that produce HTML documents exploit extensions not included in the official format definition, and hence part of their output represents semi-proprietary code that might not be properly displayed on some platforms.

  13. 13.

    Note that HTML is an application, not a subset, of SGML.

  14. 14.

    This representation is inspired by JAR files (Java ARchives), used by the Java programming language to save applications.

References

  1. Graphics Interchange Format (sm) specification—version 89a. Tech. rep., Compuserve Inc. (1990)

    Google Scholar 

  2. TIFF specification—revision 6.0. Tech. rep., Adobe Systems Incorporated (1992)

    Google Scholar 

  3. HTML 4.01 specification—W3C recommendation. Tech. rep., W3C (1999)

    Google Scholar 

  4. XML Path Language (XPath) 1.0—W3C recommendation. Tech. rep., W3C (1999)

    Google Scholar 

  5. Transformations, X.S.L.: (XSLT) 1.0—W3C recommendation. Tech. rep., W3C (1999)

    Google Scholar 

  6. International standard ISO/IEC 10646: Information technology—Universal Multiple-octet coded Character Set (UCS). Tech. rep., ISO/IEC (2003)

    Google Scholar 

  7. Portable Network Graphics (PNG) specification, 2nd edn.—W3C recommendation. Tech. rep., W3C (2003)

    Google Scholar 

  8. Lizardtech djvu reference—version 3. Tech. rep., Lizardtech, A Celartem Company (2005)

    Google Scholar 

  9. Extensible Markup Language (XML) 1.1, 2nd edn.—W3C recommendation. Tech. rep., W3C (2006)

    Google Scholar 

  10. Extensible Stylesheet Language (XSL) 1.1—W3C recommendation. Tech. rep., W3C (2006)

    Google Scholar 

  11. Microsoft Office Word 97–2007 binary file format specification [*.doc]. Tech. rep., Microsoft Corporation (2007)

    Google Scholar 

  12. Open Document Format for office applications (OpenDocument) v1.1—OASIS standard. Tech. rep., OASIS (2007)

    Google Scholar 

  13. Extensible Markup Language (XML) 1.0, 5th edn.—W3C recommendation. Tech. rep., W3C (2008)

    Google Scholar 

  14. Adobe Systems Incorporated: PDF Reference—Adobe Portable Document Format Version 1.3, 2nd edn. Addison-Wesley, Reading (2000)

    Google Scholar 

  15. International Telegraph and Telephone Consultative Committee (CCITT): Recommendation t.81. Tech. rep., International Telecommunication Union (ITU) (92)

    Google Scholar 

  16. Deutsch, P.: Deflate compressed data format specification 1.3. Tech. rep. RFC1951 (1996)

    Google Scholar 

  17. Deutsch, P., Gailly, J.L.: Zlib compressed data format specification 3.3. Tech. rep. RFC1950 (1996)

    Google Scholar 

  18. Eisenberg, J.: OASIS OpenDocument Essentials—Using OASIS OpenDocument XML. Friends of OpenDocument (2005)

    Google Scholar 

  19. Hamilton, E.: JPEG file interchange format—version 1.2. Tech. rep. (1992)

    Google Scholar 

  20. Huffman, D.: A method for the construction of minimum-redundancy codes. In: Proceedings of the I.R.E, pp. 1098–1102 (1952)

    Google Scholar 

  21. Lamport, L.: , A Document Preparation System—User’s Guide and Reference Manual, 2nd edn. Addison-Wesley, Reading (1994)

    MATH  Google Scholar 

  22. Reid, G.: Thinking in PostScript. Addison-Wesley, Reading (1990)

    Google Scholar 

  23. Shannon, C.E., Weaver, W.: The Mathematical Theory of Communication. University of Illinois Press, Champaign (1949)

    MATH  Google Scholar 

  24. The Unicode Consortium: The Unicode Standard, Version 5.0, 5th edn. Addison-Wesley, Reading (2006)

    Google Scholar 

  25. W3C SVG Working Group: Scalable Vector Graphics (SVG) 1.1 specification. Tech. rep., W3C (2003)

    Google Scholar 

  26. Welch, T.: A technique for high-performance data compression. IEEE Computer 17(6), 8–19 (1984)

    Article  Google Scholar 

  27. Wood, L.: Programming the Web: The W3C DOM specification. IEEE Internet Computing 3(1), 48–54 (1999)

    Article  Google Scholar 

  28. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  29. Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefano Ferilli .

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag London Limited

About this chapter

Cite this chapter

Ferilli, S. (2011). Digital Formats. In: Automatic Digital Document Processing and Management. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-198-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-0-85729-198-1_2

  • Publisher Name: Springer, London

  • Print ISBN: 978-0-85729-197-4

  • Online ISBN: 978-0-85729-198-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics