Skip to main content

Floating-Point Formats and Environment

  • Chapter
  • First Online:
Handbook of Floating-Point Arithmetic

Abstract

Our main focus in this chapter is the IEEE 754-2008 Standard for Floating-Point Arithmetic [267] , a revision and merge of the earlier IEEE 754-1985 [12] and IEEE 854-1987 [13] standards. A paper written in 1981 by Kahan, Why Do We Need a Floating-Point Standard? [315], depicts the rather messy situation of floating-point arithmetic before the 1980s. Anybody who takes the view that the current standard is too constraining and that circuit and system manufacturers could build much more efficient machines without it should read that paper and think about it. Even if there were at that time a few reasonably good environments, the various systems available then were so different that writing portable yet reasonably efficient numerical software was extremely difficult. For instance, as pointed out in [553], sometimes a programmer had to insert multiplications by 1. 0 to make a program work reliably.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    IEEE is an acronym for the Institute of Electrical and Electronics Engineers. For more details, see https://www.ieee.org/about/index.html.

  2. 2.

    Platforms may exchange character strings, or binary data. In the latter case, endianness problems (see Section 3.1.1.5) must be considered.

  3. 3.

    Those of the form 01x11x111x, 10x11x111x, or 11x11x111x.

  4. 4.

    The case where these floating-point numbers both have an odd least significant significand digit (this can occur in precision 1 only, possibly when converting a number such as 9. 5 into a decimal string for instance) has been forgotten in the standard, but for the next revision, it has been proposed—See http://speleotrove.com/misc/IEEE754-errata.html—to deliver the one larger in magnitude.

  5. 5.

    This implies that the language standards must specify what that literal meaning is: order of operations, destination formats of operations, etc.

  6. 6.

    This rule (which may help in implementing complex functions [317]) may seem strange, but the most important point is that any sequence of exact computations on real numbers will give the correct result, even when \(\sqrt{-0}\) is involved. Also let us recall that − 0 is regarded as a null value, not a negative number.

  7. 7.

    In most cases, x ⋅ β n is exactly representable so that there is no rounding at all, but requiring correct rounding is the simplest way of defining what should be returned if the result is outside the normal range.

  8. 8.

    This problem was already there in the 1985 version of the standard. The choice of not giving a clearer specification in IEEE 754-2008 probably results from the desire to keep existing implementations conforming to the standard.

  9. 9.

    There are a few exceptions to this rule (but not with the basic arithmetic operations): for instance it is recommended that pow(+1, y) should be 1 even if y is a quiet NaN (this partial function being constant).

  10. 10.

    These operations regard qNaN as missing data, but not sNaN.

  11. 11.

    This means that in the case where a NaN is not regarded as missing data, qNaN must propagate even when the partial function is constant. For instance, the minimum function on (qNaN, −) will return qNaN instead of −.

  12. 12.

    But it is the default on recent systems only.

  13. 13.

    Which is not very realistic but suffices to get a rough estimate of the frequency of occurrences of double roundings.

  14. 14.

    Warning! The instructions called FMADDs and so on from SPARC64 V, which share the same name and the same encoding with SPARC64 VI, are not real FMA instructions as they perform two roundings. [208, page 56].

  15. 15.

    These instructions also suppress (“silent”) exceptions, which induce similar problems in case of out-of-order execution (although their correct management in hardware must be implemented anyway).

  16. 16.

    The performance of AVX instructions can be degraded on some computations involving subnormals.

  17. 17.

    CVE-2006-6499 / https://bugzilla.mozilla.org/show_bug.cgi?id=358569.

References

  1. Advanced Micro Devices. 128-bit SSE5 instruction set, 2007. Available at http://pdinda.org/icsclass/doc/AMD_ARCH_MANUALS/AMD64_128_Bit_SSE5_Instrs.pdf.

  2. B. Akbarpour, A. T. Abdel-Hamid, S. Tahar, and J. Harrison. Verifying a synthesized implementation of IEEE-754 floating-point exponential function using HOL. The Computer Journal, 53(4):465, 2010.

    Article  Google Scholar 

  3. American National Standards Institute and Institute of Electrical and Electronic Engineers. IEEE Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Standard 754–1985, 1985.

    Google Scholar 

  4. American National Standards Institute and Institute of Electrical and Electronic Engineers. IEEE Standard for Radix Independent Floating-Point Arithmetic. ANSI/IEEE Standard 854–1987, 1987.

    Google Scholar 

  5. A. Blot, J.-M. Muller, and L. Théry. Formal correctness of comparison algorithms between binary64 and decimal64 floating-point numbers. In 10th International Workshop on Numerical Software Verification (NSV), pages 25–37, 2017.

    Chapter  Google Scholar 

  6. S. Boldo, J.-H. Jourdan, X. Leroy, and G. Melquiond. Verified compilation of floating-point computations. Journal of Automated Reasoning, 54(2):135–163, 2015.

    Article  MathSciNet  Google Scholar 

  7. R. T. Boute. The Euclidean definition of the functions div and mod. ACM Transactions on Programming Languages and Systems, 14(2):127–144, 1992.

    Article  Google Scholar 

  8. N. Brisebarre, C. Lauter, M. Mezzarobba, and J.-M. Muller. Comparison between binary and decimal floating-point numbers. IEEE Transactions on Computers, 65(7):2032–2044, 2016.

    Article  MathSciNet  Google Scholar 

  9. N. Brunie, F. de Dinechin, and B. Dupont de Dinechin. A mixed-precision fused multiply and add. In 45th Asilomar Conference on Signals, Systems, and Computers, November 2011.

    Google Scholar 

  10. N. Brunie, F. de Dinechin, and B. Dupont de Dinechin. Mixed-precision merged multiplication and addition operator. Patent WO/2012/175828, December 2012.

    Google Scholar 

  11. T. C. Chen and I. T. Ho. Storage-efficient representation of decimal data. Communications of the ACM, 18(1):49–52, 1975.

    Article  MathSciNet  Google Scholar 

  12. W. J. Cody. MACHAR: a subroutine to dynamically determine machine parameters. ACM Transactions on Mathematical Software, 14(4):301–311, 1988.

    Article  Google Scholar 

  13. W. J. Cody and W. Waite. Software Manual for the Elementary Functions. Prentice-Hall, Englewood Cliffs, NJ, 1980.

    MATH  Google Scholar 

  14. T. Coe and P. T. P. Tang. It takes six ones to reach a flaw. In 12th IEEE Symposium on Computer Arithmetic (ARITH-12), pages 140–146, July 1995.

    Google Scholar 

  15. S. Collange, M. Daumas, and D. Defour. État de l’intégration de la virgule flottante dans les processeurs graphiques. Technique et Science Informatiques, 27(6):719–733, 2008. In French.

    Article  Google Scholar 

  16. M. Cornea, C. Anderson, J. Harrison, P. T. P. Tang, E. Schneider, and C. Tsen. A software implementation of the IEEE 754R decimal floating-point arithmetic using the binary encoding format. In 18th IEEE Symposium on Computer Arithmetic (ARITH-18), pages 29–37, June 2007.

    Google Scholar 

  17. M. Cornea, J. Harrison, C. Anderson, P. T. P. Tang, E. Schneider, and E. Gvozdev. A software implementation of the IEEE 754R decimal floating-point arithmetic using the binary encoding format. IEEE Transactions on Computers, 58(2):148–162, 2009.

    Article  MathSciNet  Google Scholar 

  18. M. Cornea, J. Harrison, and P. T. P. Tang. Scientific Computing on Itanium ®; -based Systems. Intel Press, Hillsboro, OR, 2002.

    Google Scholar 

  19. M. F. Cowlishaw. Decimal floating-point: algorism for computers. In 16th IEEE Symposium on Computer Arithmetic (ARITH-16), pages 104–111, June 2003.

    Google Scholar 

  20. A. Edelman. The mathematics of the Pentium division bug. SIAM Review, 39(1):54–67, 1997.

    Article  MathSciNet  Google Scholar 

  21. L. Eisen, J. W. Ward, H. W. Tast, N. Mäding, J. Leenstra, S. M. Mueller, C. Jacobi, J. Preiss, E. M. Schwarz, and S. R. Carlough. IBM POWER6 accelerators: VMX and DFU. IBM Journal of Research and Development, 51(6):1–21, 2007.

    Article  Google Scholar 

  22. M. D. Ercegovac and T. Lang. Division and Square Root: Digit-Recurrence Algorithms and Implementations. Kluwer Academic Publishers, Boston, MA, 1994.

    MATH  Google Scholar 

  23. M. D. Ercegovac and T. Lang. Digital Arithmetic. Morgan Kaufmann Publishers, San Francisco, CA, 2004.

    Google Scholar 

  24. M. A. Erle, M. J. Schulte, and B. J. Hickmann. Decimal floating-point multiplication. IEEE Transactions on Computers, 58(7):902–916, 2009.

    Article  MathSciNet  Google Scholar 

  25. Fujitsu. SPARC64 TM VI Extensions. Fujitsu Limited, 1.3 edition, March 2007.

    Google Scholar 

  26. D. Goldberg. What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys, 23(1):5–48, 1991. An edited reprint is available at https://docs.oracle.com/cd/E19059-01/fortec6u2/806-7996/806-7996.pdf from Sun’s Numerical Computation Guide; it contains an addendum Differences Among IEEE 754 Implementations, also available at http://www.validlab.com/goldberg/addendum.html.

    Article  Google Scholar 

  27. J. Harrison. Formal verification of floating-point trigonometric functions. In 3rd International Conference on Formal Methods in Computer-Aided Design (FMCAD), number 1954 in Lecture Notes in Computer Science, pages 217–233, Austin, TX, USA, 2000.

    Google Scholar 

  28. J. Harrison. Formal verification of IA-64 division algorithms. In 13th International Conference on Theorem Proving in Higher Order Logics (TPHOLs), volume 1869 of Lecture Notes in Computer Science, pages 233–251, 2000.

    Google Scholar 

  29. IEEE Computer Society. IEEE Standard for Floating-Point Arithmetic. IEEE Standard 754-2008, August 2008. Available at http://ieeexplore.ieee.org/servlet/opac?punumber=4610933.

  30. Intel Corporation. Intrinsics Guide. Available at https://software.intel.com/sites/landingpage/IntrinsicsGuide/. Accessed in June 2017.

  31. Intel Corporation. Intel 64 and IA-32 Architectures Software Developer’s Manual, March 2017. Order Number: 325462-062US.

    Google Scholar 

  32. International Organization for Standardization. ISO/IEC/IEEE Standard 60559:2011: Information technology – Microprocessor Systems – Floating-Point arithmetic. International Electrotechnical Commission, 1st edition, 2011.

    Google Scholar 

  33. W. Kahan. Why do we need a floating-point standard? Technical report, Computer Science, UC Berkeley, 1981. Available at http://www.cs.berkeley.edu/~wkahan/ieee754status/why-ieee.pdf.

  34. W. Kahan. Branch cuts for complex elementary functions. In The State of the Art in Numerical Analysis, pages 165–211, 1987.

    Google Scholar 

  35. W. Kahan. Lecture notes on the status of IEEE-754. Available at http://www.cs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF, 1997.

  36. R. Karpinsky. PARANOIA: a floating-point benchmark. BYTE, 10(2), 1985.

    Google Scholar 

  37. D. E. Knuth. The Art of Computer Programming, volume 2. Addison-Wesley, Reading, MA, 3rd edition, 1998.

    Google Scholar 

  38. D. Lutz. Fused multiply-add microarchitecture comprising separate early-normalizing multiply and add pipelines. In 20th IEEE Symposium on Computer Arithmetic (ARITH-20), pages 123–128, 2011.

    Google Scholar 

  39. D. Lutz and N. Burgess. Overcoming double-rounding errors under IEEE 754-2008 using software. In 44th Asilomar Conference on Signals, Systems, and Computers, pages 1399–1401, November 2010.

    Google Scholar 

  40. É. Martin-Dorel, G. Melquiond, and J.-M. Muller. Some issues related to double rounding. BIT Numerical Mathematics, 53(4):897–924, 2013.

    Article  MathSciNet  Google Scholar 

  41. D. Monniaux. The pitfalls of verifying floating-point computations. ACM TOPLAS, 30(3):1–41, 2008. A preliminary version is available at http://hal.archives-ouvertes.fr/hal-00128124.

  42. J. S. Moore, T. Lynch, and M. Kaufmann. A mechanically checked proof of the correctness of the kernel of the AMD5K86 floating point division algorithm. IEEE Transactions on Computers, 47(9):913–926, 1998.

    Article  MathSciNet  Google Scholar 

  43. R. E. Moore. Interval analysis. Prentice Hall, 1966.

    Google Scholar 

  44. J.-M. Muller. Algorithmes de division pour microprocesseurs: illustration à l’aide du “bug” du pentium. Technique et Science Informatiques, 14(8), 1995.

    Google Scholar 

  45. S. F. Oberman and M. J. Flynn. Division algorithms and implementations. IEEE Transactions on Computers, 46(8):833–854, 1997.

    Article  MathSciNet  Google Scholar 

  46. S. F. Oberman and M. Y. Siu. A high-performance area-efficient multifunction interpolator. In 17th IEEE Symposium on Computer Arithmetic (ARITH-17), June 2005.

    Google Scholar 

  47. M. Parks. Number-theoretic test generation for directed roundings. IEEE Transactions on Computers, 49(7):651–658, 2000.

    Article  MathSciNet  Google Scholar 

  48. D. M. Russinoff. A mechanically checked proof of IEEE compliance of a register-transfer-level specification of the AMD-K7 floating-point multiplication, division, and square root instructions. LMS Journal of Computation and Mathematics, 1:148–200, 1998.

    Article  MathSciNet  Google Scholar 

  49. D. M. Russinoff. A mechanically checked proof of correctness of the AMD K5 floating point square root microcode. Formal Methods in System Design, 14(1):75–125, 1999.

    Article  Google Scholar 

  50. D. M. Russinoff. A case study in formal verification of register-transfer logic with ACL2: The floating point adder of the AMD Athlon processor. Lecture Notes in Computer Science, 1954:3–36, 2000.

    Google Scholar 

  51. E. M. Schwarz, J. S. Kapernick, and M. F. Cowlishaw. Decimal floating-point support on the IBM system Z10 processor. IBM Journal of Research and Development, 53(1):36–45, 2009.

    Article  Google Scholar 

  52. C. Severance. IEEE 754: An interview with William Kahan. Computer, 31(3):114–115, 1998.

    Article  Google Scholar 

  53. Sun Microsystems. Interval arithmetic in high performance technical computing. Technical report, 2002.

    Google Scholar 

  54. N. Whitehead and A. Fit-Florea. Precision & performance: Floating point and IEEE 754 compliance for NVIDIA GPUs. Technical report, NVIDIA, 2011. Available at http://developer.download.nvidia.com/devzone/devcenter/cuda/files/NVIDIA-CUDA-Floating-Point.pdf.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Muller, JM. et al. (2018). Floating-Point Formats and Environment. In: Handbook of Floating-Point Arithmetic. Birkhäuser, Cham. https://doi.org/10.1007/978-3-319-76526-6_3

Download citation

Publish with us

Policies and ethics