Digital Formats

  • Stefano Ferilli
Part of the Advances in Pattern Recognition book series (ACVPR)


A problem to be faced very early when dealing with digital documents is how to represent them in a suitable machine-readable format. This chapter introduces the current widespread formats for digital document representation, divided by category according to the degree of structure they express. Formats that do not exhibit any high-level structure for their content are dealt with first: plain text and image formats (in turn divided into vector and raster ones). Then, the formats containing information on spatial placement of the document elements are introduced: PostScript and its evolution Portable Document Format, which represent the current standard in document exchange. Lastly, a selection of formats organized according to the content and function of the document components are presented, including Web formats and the official standard for text processing. Some insight in the file representation and organization aspects, and in the methods involved in producing and exploiting them, is given, aimed at ensuring a sufficient understanding and ability to recognize their pros and cons, without entering into subtle technical details.


Color Space Graphic State User Space Raster Image Document Content 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Graphics Interchange Format (sm) specification—version 89a. Tech. rep., Compuserve Inc. (1990) Google Scholar
  2. 2.
    TIFF specification—revision 6.0. Tech. rep., Adobe Systems Incorporated (1992) Google Scholar
  3. 3.
    HTML 4.01 specification—W3C recommendation. Tech. rep., W3C (1999) Google Scholar
  4. 4.
    XML Path Language (XPath) 1.0—W3C recommendation. Tech. rep., W3C (1999) Google Scholar
  5. 5.
    Transformations, X.S.L.: (XSLT) 1.0—W3C recommendation. Tech. rep., W3C (1999) Google Scholar
  6. 6.
    International standard ISO/IEC 10646: Information technology—Universal Multiple-octet coded Character Set (UCS). Tech. rep., ISO/IEC (2003) Google Scholar
  7. 7.
    Portable Network Graphics (PNG) specification, 2nd edn.—W3C recommendation. Tech. rep., W3C (2003) Google Scholar
  8. 8.
    Lizardtech djvu reference—version 3. Tech. rep., Lizardtech, A Celartem Company (2005) Google Scholar
  9. 9.
    Extensible Markup Language (XML) 1.1, 2nd edn.—W3C recommendation. Tech. rep., W3C (2006) Google Scholar
  10. 10.
    Extensible Stylesheet Language (XSL) 1.1—W3C recommendation. Tech. rep., W3C (2006) Google Scholar
  11. 11.
    Microsoft Office Word 97–2007 binary file format specification [*.doc]. Tech. rep., Microsoft Corporation (2007) Google Scholar
  12. 12.
    Open Document Format for office applications (OpenDocument) v1.1—OASIS standard. Tech. rep., OASIS (2007) Google Scholar
  13. 13.
    Extensible Markup Language (XML) 1.0, 5th edn.—W3C recommendation. Tech. rep., W3C (2008) Google Scholar
  14. 14.
    Adobe Systems Incorporated: PDF Reference—Adobe Portable Document Format Version 1.3, 2nd edn. Addison-Wesley, Reading (2000) Google Scholar
  15. 15.
    International Telegraph and Telephone Consultative Committee (CCITT): Recommendation t.81. Tech. rep., International Telecommunication Union (ITU) (92) Google Scholar
  16. 16.
    Deutsch, P.: Deflate compressed data format specification 1.3. Tech. rep. RFC1951 (1996) Google Scholar
  17. 17.
    Deutsch, P., Gailly, J.L.: Zlib compressed data format specification 3.3. Tech. rep. RFC1950 (1996) Google Scholar
  18. 18.
    Eisenberg, J.: OASIS OpenDocument Essentials—Using OASIS OpenDocument XML. Friends of OpenDocument (2005) Google Scholar
  19. 19.
    Hamilton, E.: JPEG file interchange format—version 1.2. Tech. rep. (1992) Google Scholar
  20. 20.
    Huffman, D.: A method for the construction of minimum-redundancy codes. In: Proceedings of the I.R.E, pp. 1098–1102 (1952) Google Scholar
  21. 21.
    Lamport, L.: , A Document Preparation System—User’s Guide and Reference Manual, 2nd edn. Addison-Wesley, Reading (1994) MATHGoogle Scholar
  22. 22.
    Reid, G.: Thinking in PostScript. Addison-Wesley, Reading (1990) Google Scholar
  23. 23.
    Shannon, C.E., Weaver, W.: The Mathematical Theory of Communication. University of Illinois Press, Champaign (1949) MATHGoogle Scholar
  24. 24.
    The Unicode Consortium: The Unicode Standard, Version 5.0, 5th edn. Addison-Wesley, Reading (2006) Google Scholar
  25. 25.
    W3C SVG Working Group: Scalable Vector Graphics (SVG) 1.1 specification. Tech. rep., W3C (2003) Google Scholar
  26. 26.
    Welch, T.: A technique for high-performance data compression. IEEE Computer 17(6), 8–19 (1984) CrossRefGoogle Scholar
  27. 27.
    Wood, L.: Programming the Web: The W3C DOM specification. IEEE Internet Computing 3(1), 48–54 (1999) CrossRefGoogle Scholar
  28. 28.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977) MathSciNetMATHCrossRefGoogle Scholar
  29. 29.
    Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978) MathSciNetMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  1. 1.Dipartimento di InformaticaUniversità di BariBariItaly

Personalised recommendations