Floating-Point Formats and Environment

  • Jean-Michel Muller
  • Nicolas Brunie
  • Florent de Dinechin
  • Claude-Pierre Jeannerod
  • Mioara Joldes
  • Vincent Lefèvre
  • Guillaume Melquiond
  • Nathalie Revol
  • Serge Torres


Our main focus in this chapter is the IEEE 754-2008 Standard for Floating-Point Arithmetic [267] , a revision and merge of the earlier IEEE 754-1985 [12] and IEEE 854-1987 [13] standards. A paper written in 1981 by Kahan, Why Do We Need a Floating-Point Standard? [315], depicts the rather messy situation of floating-point arithmetic before the 1980s. Anybody who takes the view that the current standard is too constraining and that circuit and system manufacturers could build much more efficient machines without it should read that paper and think about it. Even if there were at that time a few reasonably good environments, the various systems available then were so different that writing portable yet reasonably efficient numerical software was extremely difficult. For instance, as pointed out in [553], sometimes a programmer had to insert multiplications by 1. 0 to make a program work reliably.


  1. [2]
    Advanced Micro Devices. 128-bit SSE5 instruction set, 2007. Available at
  2. [6]
    B. Akbarpour, A. T. Abdel-Hamid, S. Tahar, and J. Harrison. Verifying a synthesized implementation of IEEE-754 floating-point exponential function using HOL. The Computer Journal, 53(4):465, 2010.CrossRefGoogle Scholar
  3. [12]
    American National Standards Institute and Institute of Electrical and Electronic Engineers. IEEE Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Standard 754–1985, 1985.Google Scholar
  4. [13]
    American National Standards Institute and Institute of Electrical and Electronic Engineers. IEEE Standard for Radix Independent Floating-Point Arithmetic. ANSI/IEEE Standard 854–1987, 1987.Google Scholar
  5. [40]
    A. Blot, J.-M. Muller, and L. Théry. Formal correctness of comparison algorithms between binary64 and decimal64 floating-point numbers. In 10th International Workshop on Numerical Software Verification (NSV), pages 25–37, 2017.CrossRefGoogle Scholar
  6. [51]
    S. Boldo, J.-H. Jourdan, X. Leroy, and G. Melquiond. Verified compilation of floating-point computations. Journal of Automated Reasoning, 54(2):135–163, 2015.MathSciNetCrossRefGoogle Scholar
  7. [62]
    R. T. Boute. The Euclidean definition of the functions div and mod. ACM Transactions on Programming Languages and Systems, 14(2):127–144, 1992.CrossRefGoogle Scholar
  8. [73]
    N. Brisebarre, C. Lauter, M. Mezzarobba, and J.-M. Muller. Comparison between binary and decimal floating-point numbers. IEEE Transactions on Computers, 65(7):2032–2044, 2016.MathSciNetCrossRefGoogle Scholar
  9. [83]
    N. Brunie, F. de Dinechin, and B. Dupont de Dinechin. A mixed-precision fused multiply and add. In 45th Asilomar Conference on Signals, Systems, and Computers, November 2011.Google Scholar
  10. [84]
    N. Brunie, F. de Dinechin, and B. Dupont de Dinechin. Mixed-precision merged multiplication and addition operator. Patent WO/2012/175828, December 2012.Google Scholar
  11. [95]
    T. C. Chen and I. T. Ho. Storage-efficient representation of decimal data. Communications of the ACM, 18(1):49–52, 1975.MathSciNetCrossRefGoogle Scholar
  12. [106]
    W. J. Cody. MACHAR: a subroutine to dynamically determine machine parameters. ACM Transactions on Mathematical Software, 14(4):301–311, 1988.CrossRefGoogle Scholar
  13. [107]
    W. J. Cody and W. Waite. Software Manual for the Elementary Functions. Prentice-Hall, Englewood Cliffs, NJ, 1980.zbMATHGoogle Scholar
  14. [108]
    T. Coe and P. T. P. Tang. It takes six ones to reach a flaw. In 12th IEEE Symposium on Computer Arithmetic (ARITH-12), pages 140–146, July 1995.Google Scholar
  15. [109]
    S. Collange, M. Daumas, and D. Defour. État de l’intégration de la virgule flottante dans les processeurs graphiques. Technique et Science Informatiques, 27(6):719–733, 2008. In French.CrossRefGoogle Scholar
  16. [116]
    M. Cornea, C. Anderson, J. Harrison, P. T. P. Tang, E. Schneider, and C. Tsen. A software implementation of the IEEE 754R decimal floating-point arithmetic using the binary encoding format. In 18th IEEE Symposium on Computer Arithmetic (ARITH-18), pages 29–37, June 2007.Google Scholar
  17. [117]
    M. Cornea, J. Harrison, C. Anderson, P. T. P. Tang, E. Schneider, and E. Gvozdev. A software implementation of the IEEE 754R decimal floating-point arithmetic using the binary encoding format. IEEE Transactions on Computers, 58(2):148–162, 2009.MathSciNetCrossRefGoogle Scholar
  18. [118]
    M. Cornea, J. Harrison, and P. T. P. Tang. Scientific Computing on Itanium ®; -based Systems. Intel Press, Hillsboro, OR, 2002.Google Scholar
  19. [121]
    M. F. Cowlishaw. Decimal floating-point: algorism for computers. In 16th IEEE Symposium on Computer Arithmetic (ARITH-16), pages 104–111, June 2003.Google Scholar
  20. [183]
    A. Edelman. The mathematics of the Pentium division bug. SIAM Review, 39(1):54–67, 1997.MathSciNetCrossRefGoogle Scholar
  21. [184]
    L. Eisen, J. W. Ward, H. W. Tast, N. Mäding, J. Leenstra, S. M. Mueller, C. Jacobi, J. Preiss, E. M. Schwarz, and S. R. Carlough. IBM POWER6 accelerators: VMX and DFU. IBM Journal of Research and Development, 51(6):1–21, 2007.CrossRefGoogle Scholar
  22. [186]
    M. D. Ercegovac and T. Lang. Division and Square Root: Digit-Recurrence Algorithms and Implementations. Kluwer Academic Publishers, Boston, MA, 1994.zbMATHGoogle Scholar
  23. [187]
    M. D. Ercegovac and T. Lang. Digital Arithmetic. Morgan Kaufmann Publishers, San Francisco, CA, 2004.Google Scholar
  24. [192]
    M. A. Erle, M. J. Schulte, and B. J. Hickmann. Decimal floating-point multiplication. IEEE Transactions on Computers, 58(7):902–916, 2009.MathSciNetCrossRefGoogle Scholar
  25. [208]
    Fujitsu. SPARC64 TM VI Extensions. Fujitsu Limited, 1.3 edition, March 2007.Google Scholar
  26. [214]
    D. Goldberg. What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys, 23(1):5–48, 1991. An edited reprint is available at from Sun’s Numerical Computation Guide; it contains an addendum Differences Among IEEE 754 Implementations, also available at Scholar
  27. [241]
    J. Harrison. Formal verification of floating-point trigonometric functions. In 3rd International Conference on Formal Methods in Computer-Aided Design (FMCAD), number 1954 in Lecture Notes in Computer Science, pages 217–233, Austin, TX, USA, 2000.Google Scholar
  28. [242]
    J. Harrison. Formal verification of IA-64 division algorithms. In 13th International Conference on Theorem Proving in Higher Order Logics (TPHOLs), volume 1869 of Lecture Notes in Computer Science, pages 233–251, 2000.Google Scholar
  29. [267]
    IEEE Computer Society. IEEE Standard for Floating-Point Arithmetic. IEEE Standard 754-2008, August 2008. Available at
  30. [271]
    Intel Corporation. Intrinsics Guide. Available at Accessed in June 2017.
  31. [272]
    Intel Corporation. Intel 64 and IA-32 Architectures Software Developer’s Manual, March 2017. Order Number: 325462-062US.Google Scholar
  32. [277]
    International Organization for Standardization. ISO/IEC/IEEE Standard 60559:2011: Information technology – Microprocessor Systems – Floating-Point arithmetic. International Electrotechnical Commission, 1st edition, 2011.Google Scholar
  33. [315]
    W. Kahan. Why do we need a floating-point standard? Technical report, Computer Science, UC Berkeley, 1981. Available at
  34. [317]
    W. Kahan. Branch cuts for complex elementary functions. In The State of the Art in Numerical Analysis, pages 165–211, 1987.Google Scholar
  35. [318]
    W. Kahan. Lecture notes on the status of IEEE-754. Available at, 1997.
  36. [329]
    R. Karpinsky. PARANOIA: a floating-point benchmark. BYTE, 10(2), 1985.Google Scholar
  37. [342]
    D. E. Knuth. The Art of Computer Programming, volume 2. Addison-Wesley, Reading, MA, 3rd edition, 1998.Google Scholar
  38. [399]
    D. Lutz. Fused multiply-add microarchitecture comprising separate early-normalizing multiply and add pipelines. In 20th IEEE Symposium on Computer Arithmetic (ARITH-20), pages 123–128, 2011.Google Scholar
  39. [400]
    D. Lutz and N. Burgess. Overcoming double-rounding errors under IEEE 754-2008 using software. In 44th Asilomar Conference on Signals, Systems, and Computers, pages 1399–1401, November 2010.Google Scholar
  40. [409]
    É. Martin-Dorel, G. Melquiond, and J.-M. Muller. Some issues related to double rounding. BIT Numerical Mathematics, 53(4):897–924, 2013.MathSciNetCrossRefGoogle Scholar
  41. [423]
    D. Monniaux. The pitfalls of verifying floating-point computations. ACM TOPLAS, 30(3):1–41, 2008. A preliminary version is available at
  42. [426]
    J. S. Moore, T. Lynch, and M. Kaufmann. A mechanically checked proof of the correctness of the kernel of the AMD5K86 floating point division algorithm. IEEE Transactions on Computers, 47(9):913–926, 1998.MathSciNetCrossRefGoogle Scholar
  43. [428]
    R. E. Moore. Interval analysis. Prentice Hall, 1966.Google Scholar
  44. [437]
    J.-M. Muller. Algorithmes de division pour microprocesseurs: illustration à l’aide du “bug” du pentium. Technique et Science Informatiques, 14(8), 1995.Google Scholar
  45. [469]
    S. F. Oberman and M. J. Flynn. Division algorithms and implementations. IEEE Transactions on Computers, 46(8):833–854, 1997.MathSciNetCrossRefGoogle Scholar
  46. [470]
    S. F. Oberman and M. Y. Siu. A high-performance area-efficient multifunction interpolator. In 17th IEEE Symposium on Computer Arithmetic (ARITH-17), June 2005.Google Scholar
  47. [484]
    M. Parks. Number-theoretic test generation for directed roundings. IEEE Transactions on Computers, 49(7):651–658, 2000.MathSciNetCrossRefGoogle Scholar
  48. [534]
    D. M. Russinoff. A mechanically checked proof of IEEE compliance of a register-transfer-level specification of the AMD-K7 floating-point multiplication, division, and square root instructions. LMS Journal of Computation and Mathematics, 1:148–200, 1998.MathSciNetCrossRefGoogle Scholar
  49. [535]
    D. M. Russinoff. A mechanically checked proof of correctness of the AMD K5 floating point square root microcode. Formal Methods in System Design, 14(1):75–125, 1999.CrossRefGoogle Scholar
  50. [536]
    D. M. Russinoff. A case study in formal verification of register-transfer logic with ACL2: The floating point adder of the AMD Athlon processor. Lecture Notes in Computer Science, 1954:3–36, 2000.Google Scholar
  51. [548]
    E. M. Schwarz, J. S. Kapernick, and M. F. Cowlishaw. Decimal floating-point support on the IBM system Z10 processor. IBM Journal of Research and Development, 53(1):36–45, 2009.CrossRefGoogle Scholar
  52. [553]
    C. Severance. IEEE 754: An interview with William Kahan. Computer, 31(3):114–115, 1998.CrossRefGoogle Scholar
  53. [579]
    Sun Microsystems. Interval arithmetic in high performance technical computing. Technical report, 2002.Google Scholar
  54. [631]
    N. Whitehead and A. Fit-Florea. Precision & performance: Floating point and IEEE 754 compliance for NVIDIA GPUs. Technical report, NVIDIA, 2011. Available at

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Jean-Michel Muller
    • 1
  • Nicolas Brunie
    • 2
  • Florent de Dinechin
    • 3
  • Claude-Pierre Jeannerod
    • 4
  • Mioara Joldes
    • 5
  • Vincent Lefèvre
    • 4
  • Guillaume Melquiond
    • 6
  • Nathalie Revol
    • 4
  • Serge Torres
    • 7
  1. 1.CNRS - LIPLyonFrance
  2. 2.KalrayGrenobleFrance
  3. 3.INSA-Lyon - CITIVilleurbanneFrance
  4. 4.Inria - LIPLyonFrance
  5. 5.CNRS - LAASToulouseFrance
  6. 6.Inria - LRIOrsayFrance
  7. 7.ENS-Lyon - LIPLyonFrance

Personalised recommendations