Abstract
The aim of this section is to provide the reader with some basic concepts of floating-point arithmetic, and to define notations that are used throughout the book.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
We should mention a few exceptions, such as some HP pocket calculators and the Intel 8087 coprocessor, that were precursors of the standard.
- 2.
Partly only, because bit strings must be reserved for representing exceptional values, such as the results of forbidden operations (e.g., 0 / 0) and infinities.
- 3.
In practice that information is encoded in the exponent field, see Section 2.1.6.
- 4.
A major difference between computers and pocket calculators is that usually computers do much computation between input and output of data, so that the time needed to perform a radix conversion is negligible compared to the whole processing time. If pocket calculators used radix 2, they would perform radix conversions before and after almost every arithmetic operation. Another reason for using radix 10 in pocket calculators is the fact that many simple decimal numbers such as 0.1 are not exactly representable in radix 2.
- 5.
- 6.
At least in theory: one must make sure that the order of execution of the operations is not changed by the compiler, that there are no phenomenons of “double roundings” due to the possible use of a wider format in intermediate calculations, and that an FMA instruction is called only if one has decided to use it.
- 7.
Let us say, as does the IEEE-754 standard, that an operation underflows when the result is subnormal and inexact.
- 8.
Horner’s scheme consists in evaluating a degree-n polynomial \(a_nx^n+a_{n-1}x^{n-1}+\cdots {}+a_0\) as \((\cdots {}(((a_nx + a_{n-1})x + a_{n-2})x+a_{n-3})\cdots {})x+a_0\). This requires n multiplications and n additions if we use conventional operations, or n fused multiply–add operations. See Chapter 5 for more information.
- 9.
Beware: that property is not always true with rounding functions different from \({{\mathrm{RN}}}\). The error of a floating-point addition with one of these other rounding functions may not sometimes be exactly representable by a floating-point number of the same format.
- 10.
As a matter of fact there can be two such numbers (if \(a+b\) is the exact middle of two consecutive floating-point numbers).
- 11.
Throughout the book, we call “target format” the floating-point format specified for the returned result, and “target precision” its precision.
- 12.
Coq can be downloaded at https://coq.inria.fr.
- 13.
This condition is stronger than the condition \(2a \ge r-1\) that is required to represent every number.
- 14.
The carry-save and borrow-save systems are roughly equivalent: everything that is computable using one of these systems is computable at approximately the same cost as with the other one.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media New York
About this chapter
Cite this chapter
Muller, JM. (2016). Introduction to Computer Arithmetic. In: Elementary Functions. Birkhäuser, Boston, MA. https://doi.org/10.1007/978-1-4899-7983-4_2
Download citation
DOI: https://doi.org/10.1007/978-1-4899-7983-4_2
Published:
Publisher Name: Birkhäuser, Boston, MA
Print ISBN: 978-1-4899-7981-0
Online ISBN: 978-1-4899-7983-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)