Software engineering keeps getting more abstract, but one thing is unchanging: theÂ importance of floatingpoint arithmetic. Every computer programmer is bound to work with numbers (they call them computers for a reason), so itâ€™s genuinely useful to understand theÂ way machines do math, no matter if your code is for a todo app, a stock exchange, or a fridge. How are numbers stored exactly? Whatâ€™s the significance of special values? And why is 0.1 + 0.2 not equal to 0.3? Letâ€™sÂ explore this all!
The Standard
Letâ€™s start with one key assumption: in all theÂ world, on every continent, thereâ€™s one and only one way of doing floatingpoint arithmetic.
This makes things a lot easier, and itâ€™s true in practice. That standardâ€™s name?
Albert Ein IEEE 754.
Howâ€™s this so simple? I mean, we all know how attempts at standardization really turn out.
Youâ€™d be correct to think more than one format must have been invented. Plenty wereÂ â€“ inÂ theÂ early days of computing, practically every system with floatingpoint capabilities had itsÂ own. Later on, brandspecific formats emerged: IBM went for hexadecimal floatingpoint inÂ their mainframes, Microsoft created Microsoft Binary Format for their BASIC products, DEC cookedÂ up yetÂ something else for theÂ VAX architecture.
This changed when Intel decided in theÂ lateÂ 70s to design theÂ floatingpoint chip to rule them all â€“ which required a format to rule them all too, theÂ best possible. TheÂ decision culminated in theÂ IntelÂ 8087 coprocessor of 1980, but even before that, other companies in theÂ space caught wind of this work and set up a common effort at the Institute of Electrical and Electronics Engineers (IEEE) to standardize floatingpoint arithmetic â€“ the IEEE 754 working group. TwoÂ competing drafts prevailed: theÂ IntelÂ 8087 spec vs. theÂ DECÂ VAX one. After some more arguments and error analysis, in 1981 Intelâ€™s draft won out, rapidly got adopted by everyone, andÂ theÂ rest is history (though it took the committee another four years of bickering to publish that draft, of course).
For a detailed look at historic floatingpoint formats, see this great article by John Savard.
The Specs
When you type let x = 0.5
(might be JavaScript, Rust, Swift, or perhaps something else), that x
needs to be stored
in a usable way. For a computer usable equals binary â€“ ones and zeros. Were we talking about an integer, say,
let x = 5
, the solution would be simple â€“ integers can be expressed in binary just as easily as in decimal, so a quick
conversion of 5 to 101_{2}, and weâ€™re all set.
Things get trickier when we want to represent real numbers though. The difference there is arbitrary precision. Whole numbers are spaced uniformly apart on the number line, so between 0.5 and 2.5 there are exactly two integers: 1 and 2. Real numbers also include every number inbetween, so thereâ€™s an infinite amount of points between 0.5 and 2.5. (If you come up with a number with an insane amount of fractional digits, another digit can always be tacked on to get a brandnew value.)
Thatâ€™s the mathematical theory in a nutshell. Sadly, computing capabilities are limited by the physical world, so infinite precision is out of the question. To be very pedestrian, machine word sizes are a major limitation â€“ handling 32 bit long values comes naturally to a 32bit processor, but any longer than that and things become slow. Nowadays, 64bit architectures rule the world, and this is reflected in the way floatingpoint is used. Two floatingpoint formats are generally used:
 32 bits, technically named
binary32
, but commonly single precision.
Values of this size are called floats.  64 bits, technically named
binary64
, but commonly double precision.
Values of this size are called doubles.
With this introduction out of the way, letâ€™s venture into the workings of those formats.
The Notation
As the name suggests, floatingpoint values donâ€™t have a fixed number of integer and fractional digits â€“ instead, the radix point floats so that thereâ€™s rather a certain number of significant digits. This allows representing a wide range of magnitudes usefully.
Much like how humans use scientific notation (an example:
6.022 * 10^23
) to express real numbers of arbitrary magnitude in a standard way, computers under theÂ hood store each
floatingpoint value as three numbers cleverly put together:
(1)^sign * significand * 2^exponent
 TheÂ sign is a single bit â€“ 0 if theÂ number is positive, 1 if negative.
 TheÂ significand (also called theÂ mantissa) is a fixedpoint number in theÂ range
(1, 2]
. It might be something like 1, 1.0625 or 1.984375, but it canâ€™t be 2. What it intuitively does is it â€śfinetunesâ€ť the value within the range set by theÂ sign and exponent.
Because theÂ significandâ€™s leading digit normally* is 1 (normal number being the technical term for such a case), only the fractional part of the significand is included in the binary representation â€“ if theÂ significand is, say, 1.001_{2} (thatâ€™s binary for 1.125), only theÂ001
part ends up being stored. This has the nice property of ensuring thereâ€™s only one way to store a given a number.
TheÂ significandâ€™s size determines the typeâ€™s precision â€“ 24 significant bits of it in singleprecision and 53 in double.  TheÂ exponent is a
signed integer.
In a way, it establishes theÂ magnitude, for instance an exponent of 8 means that theÂ absolute value of theÂ number
must be in theÂ range
(256, 512]
((2^8, 2^9]
).
Even though the exponent being signed, itâ€™s not stored using twoâ€™s complement like regular signed integers. Instead, the stored value is an unsigned integer and the computer subtract a formatspecific bias to obtain the true value. TheÂ raw, presubtraction value is called the biased exponent. That bias is specifically127
in singleprecision and1023
in double.
TheÂ exponentâ€™s size determines the range of the typeÂ â€“ 8 bits of it in singleprecision and 11 in double.
See for yourself how this all comes together using theÂ calculator below! Toggle bits by clicking on them and see what number comes out, or type in a number to see what it looks like in your computerâ€™s memory:
The Precision
Since value sizes are very much finite, their precision is too, and this means thereâ€™s a minimum distance between two values. TheÂ thing is, this spacing is different depending on the magnitude (as defined by the exponent). In effect, for small values the distance is tiny in absolute terms, while for larger ones itâ€™s roughly proportionally bigger.
A valueâ€™s immediate neighbors are essentially a Â±1 in its last significant digit away, and the difference such an increment makes is our measure of minimum distance â€“ called unit in the last place, ULP in short. You can easily see how ULP size differs depending on the exponent in theÂ calculator above. Simply change the exponent value around and youâ€™ll notice the size of the ULP when you toggle the last bit of the significand!
ULP is the best way of defining floatingpoint precision as itâ€™s valid at any magnitude, but another related definition is machine epsilon. This one, as opposed to the abstract ULP, is a constant value â€“ specifically, the ULP between 1 and the next largest number.
Itâ€™s tempting to use machine epsilon for checking if the result of a routine is within some error bounds, especially
because itâ€™s often exposed by standard libraries (e.g. in JS itâ€™s available as Number.EPSILON
). Beware! Machine
epsilon makes some sense for values around 1, but itâ€™s useless for values much larger than that as the ULP is then
bigger too, and for smaller values it might be surprisingly large in ULP terms. Moreover, in reality error tolerance
rarely depends on the floatingpoint format of all things. In summary, you should almost always just use a custom
epsilon value tuned for your specific application.
The Zero(s)
Weâ€™ve omitted something though: how to represent 0
? Mathematically, the only way to do that is to set the significand
to 0
â€¦ but that implicit leading 1
is standing in the way.
Hereâ€™s the trick: when the significand is 0
, setting the biased exponent to 0
makes the significandâ€™s leading digit
also 0
. VoilĂ , 0
as a result! Thatâ€™s a useful number to have.
Hmm, what if we set the sign to 1
at the same time? That signifies a negative value, but itâ€™s obviously ridiculous
for zero to be negâ€“ WHAT?! According to all sources (what sources now) we do actually get 0
this way. Itâ€™s not
even as absurd as it seems at first glance: for practically allÂ intents and purposes 0 == +0
and this only doesnâ€™t
hold when thereâ€™s an important reason â€“ weâ€™ll go into it in TheÂ Enormous.
See how the signed zero is stored in binary by trying
0
in theÂ calculator.
The Undefined
Some mathematical expressions simply cannot be evaluated. Take 0 / 0
for instance: the result of this is said to be
undefined in mathematics. This means there is no answer at all (not to be confused with JavaScriptâ€™s undefined
value, which means the value hasnâ€™t been initiated) or, in other words, the answer certainly is not a number.
Programs used to crash immediately whenever they encountered such expressions and in fact they still do when this happens with integers â€“ integer formats donâ€™t have a way of safely representing an undefined value, so thereâ€™s no recourse. Floatingpoint has a solution though: the NaN value, short for â€śnot a numberâ€ť.
See how NaN is stored in binary by trying
NaN
in theÂ calculator.
Specifically, you get a NaN whenever:

An indeterminate form is encountered:
0 / 0
infinity / infinity
0 * infinity
infinity Â± infinity
x % 0
infinity % x

TheÂ result of an operation would have to be a complex number (floatingpoint only represents reals):
sqrt(x)
forx
below 0log(x)
forx
below 0asin(x)
andacos(x)
forx
below 1 or above 1
Additionally, NaNs propagate, so a NaN anywhere in an expression makes it almost guaranteed that the final result will be NaN too.
Be careful with comparisons, as NaNs uniquely have an unordered relation to all values, instead of being smaller,
greater, or equal. That means <
, >
, <=
, >=
, and ==
comparisons involving a NaN always come out false â€“ no
matter if the special value is on the left or right side, or even if itâ€™s on both (so a NaN is never ever equal to
another NaN). On the other hand !=
always comes out true. AnÂ important implication of the above is that >=
cannot
be implemented as just a negation of <
forÂ floatingpoint!
The Microscopic
When a result of a calculation is so tiny that it canâ€™t be represented by a normal number we see underflow occur. Itâ€™s an edge case, but an important one. How it gets handled has significant implications for some operations.
Historically, underflow was handled by returning zero â€“ a solution called flushtozero. Itâ€™s very straightforward, but not quite optimal for the accuracy of calculations taking place near zero. TheÂ issue is, the absolute difference between neighboring floatingpoint values (i.e. ULP) gets smaller as the values themselves do so too â€“ but with the significandâ€™s leading digit always being 1, the jump between 0 and the smallest representable value is MUCH larger than the jump between that smallest value and the secondsmallest one. You can see this in figure 2, which shows what thisâ€™d look like for double precision. Note that the 2^{1023} marker is dashed, because it couldnâ€™t be reached â€“ weâ€™d go straight to 0 instead.
During development of IEEE 754 it turned out thereâ€™s a verifiably better way. Remember how 0 is stored in binary? It relies on the significandâ€™s leading digit being 0 when the biased exponent is 0. We can extend this to nonzero values of the significand â€“ this way, thereâ€™s no odd gap when going from 0 up. TheÂ gap has merely moved away from 0 though â€“ to get rid of it, we make it so that the unbiased exponent is the same for the biased exponent value of 0 as it is for 1. As you can see in figure 3, this way we trade away some precision at the bottom range of our number lineâ€¦ but we no longer return 0 when the real result is (in relative terms) quite far away from 0. We call this solution gradual underflow and those extremely small values that have 0 as the significandâ€™s leading digit â€“ subnormal numbers (or denormalized).
See how subnormal numbers are stored in binary by trying
8e323
in theÂ calculator.
The Enormous
On the other hand, when the result of a calculation is so large that it canâ€™t be represented â€“ a situation called overflow â€“ another special value is returned: infinity. It can be positive or negative, depending on the direction of overflow.
Infinity obviously almost never is the correct result at all, but thatâ€™s a feature, not a bug. Returning the maximum
representable number would be much more unsafe â€“ youâ€™d end up with a seemingly normal value, that would actually be
off by 1, or perhaps by orders of magnitude, with no way to tell. Infinity makes it evident an overflow happened and the
rules of calculations involving it are welldefined: basically, any such operation that doesnâ€™t produce a NaN (listed in
TheÂ Undefined) results in infinity (with sign rules applying, i.e. Infinity * (3) == Infinity
).
See how infinity is stored in binary by trying
Infinity
in theÂ calculator.
Interestingly, one extra case where you get infinity is division by 0. Ordinarily in such case the answer is considered
to be undefined, but IEEE 754 stipulates that the sign ought to be preserved, which is why NaN only is returned if the
numerator is 0 too (0 divided by 0 â€“ when there is no way to interpret the expression at all). Otherwise, the
interpretation thatâ€™s used is that the limit of the expression is taken, approaching from either positive or negative
numbers â€“ hereâ€™s where the sign of the zero uniquely plays a role. For instance: 123 / 0 == Infinity
.
The MoreorLess
In floatingpoint, what you see is usually not what you get. As outlined in The Precision, bits donâ€™t quite grow on trees, so only a limited subset of points on the number line can be stored, and those points are all in base2. These two facts combined are a source of constant friction between humans and computers, kept at bay thanks to an array of tricks.
Whether itâ€™s a user providing data or you hardcoding values, the starting point for many real numbers is a string of
characters representing the decimal value. We can, for instance, parse "0.2"
as a double. Print that back and you get
0.2
as expected. Thatâ€™s not exactly whatâ€™s stored though. If we calculate the value a bit more accurately based on the
binary data, using The Notation, we get 0.2000000000000000111022...
. Thatâ€™s evidently off! But there
just isnâ€™t such number as 0.2
in binary.
As an illustration of the problem behind this, letâ€™s take 1/3. It doesnâ€™t have an exact decimal counterpart, so we
humans resort to limited approximations, such as 0.3333. TheÂ reason for this predicament is that for a rational number
x
to be representable in base b
, that x
â€™s denominator cannot have any prime factor that isnâ€™t also a prime factor
of b
. In the case of 1/3, the denominatorâ€™s only prime factor is 3, while the prime factors of our target base, 10,
are 2 and 5. Thatâ€™s a mismatch. In the same vein, binary means base2, so it only is completely compatible with
denominators that are a power of 2. 0.2
â€™s rational form is 1/5, and it is that 5 which precludes finite representation
of the value in binary!
See how 0.2 is stored in binary by trying
0.2
in theÂ calculator.
The display shows the numberâ€™s standard decimal representation, but the significand is extraordinarily precise (specifically, itâ€™s shown with extra 5 digits of decimal precision thanks to being parsed withdecimal.js
instead of as a double), so that you can see how far the floatingpoint value is from the original by pasting the decomposed form into a much more precise calculator. Try this out in Wolfram Alpha, which is what Iâ€™ve done above.
There are some assurances to keep the errors in check. IEEE 754 requires that parsing a base10 string representation of a number results in the closest binary representation possible. This same guarantee applies to results of elementary arithmetic operations: addition, subtraction, multiplication, division, and square root.
Define â€śclosestâ€ť though. Oh, actually the standard includes that too. It describes five rounding modes:
roundTowardPositive
â€“ takes the floor (i.e. towards negative infinity),roundTowardNegative
â€“ takes the ceiling (i.e. towards positive infinity),roundTowardZero
â€“ truncates (i.e. towards zero),roundTiesToAway
â€“ chooses the nearest value, breaks ties by rounding away from zero,roundTiesToEven
â€“ chooses the nearest value, breaks ties by rounding to the value ending in an even digit.
See table 1 for a demonstration of each mode, on decimal values being rounded to integers.
Original  12.3 
19.6 
3.5 
4.5 
2.5 

roundTowardPositive 
12 
20 
4 
5 
2 
roundTowardNegative 
13 
19 
3 
4 
3 
roundTowardZero 
12 
19 
3 
4 
2 
roundTiesToAway 
12 
20 
4 
5 
3 
roundTiesToEven 
12 
20 
4 
4 
2 
The one youâ€™re using, even if you donâ€™t know it yet, is roundTiesToEven
. Itâ€™s the default, because:
 It takes the nearest value in the common case, which is almost always what youâ€™d expect â€“ this way the error cannot be greater than Â±0.5 ULP; but alsoâ€¦
 When the result is smackdab in the middle between two floatingpoint values, it rounds up in 50% of cases and down in the other 50%, making the bias zero on average.
The Unexpected
Unfortunately, even tolerable errors add up. This can result in some odd results, like in the (in)famous case ofâ€¦
0.1 + 0.2
.
This equals 0.3
, right? Not in double precision, no, it doesnâ€™t. TheÂ actual result: 0.30000000000000004
.
Due to friction between bases 2 and 10 (explained in TheÂ MoreorLess), what you see as 0.1
in
doubleprecision is more precisely (by a few digits)
0.1000000000000000055511
, and
that 0.2
is rather
0.2000000000000000111022
. Now,
when you add those in binary, approx.
0.3000000000000000444089
ensues. Close enough? Not so fast. 0.3
is stored as approx.
0.2999999999999999888978
, while
the next immediate value (i.e. exactly 1 ULP bigger) is approx.
0.3000000000000000444089
. Our
result clearly is much closer to the latter! And thatâ€™s how 0.30000000000000004
is deemed the correct result.
This category of errors, where parsing and arithmetic add up to a result a human would not expect, is why itâ€™s crucial NOT to use binary floatingpoint formats where this would be straightup unacceptable. Prime example: finance. At the scale of institutions managing billions, those errors would be inevitable. The solution: decimal floatingpoint formats.
Two such formats are in fact part of IEEE 754: decimal64
and decimal128
. As you can infer from those technical
names, they have a base of 10 instead of 2, and are respectively 64 and 128 bits long. Unfortunately, their adoption
isnâ€™t nearly as universal as that of standardized binary formats. Thatâ€™s largely because hardware acceleration of
decimal floatingpoint arithmetic is extremely niche, so the math practically always is implemented in software â€“
resulting in less pressure to use a standard and also much poorer performance in comparison with binary arithmetic.
Nevertheless, when correct handling decimal values is of utmost importance, thereâ€™s no better way than decimal
floatingpoint.
The ThereandBackAgain
Suppose youâ€™ve got a float but need to convert it to a double. To perform this transmutation, just take the existing
exponent and significand values and pad them with zeros. Simple enough! Beware of false precision though: weâ€™ve found in
The MoreorLess that a parsed value is off by a bit (but less than 0.5 ULP) immediately, unless
the denominator of the original value was a power of 2. So, say, "5.9"
parsed as a float will be printed back as
5.9
, as expected. Cast that to a double though and what you see isâ€¦ 5.9000000953674316
. Where did the processor get
all this extra information from? Truth is, thatâ€™s what a lack of information looks like. The extra digits are simply
the initial error, made glaring because of the ULP being orders of magnitude smaller in the more precise format.
TheÂ other way around â€“ from a higherprecision format to a lowerprecision one â€“ the situation is straightforward. Information is unambiguously lost as the least significant digits of the significand are cut. Just one thing to watch out for here is the exponent being outside the target formatâ€™s range â€“ thatâ€™s an example of overflow, so infinity is the result.
The End
Youâ€™ve reached the finish line. Congratulations. I hope you feel enlightened, or at least marginally smarter than before opening this page. Now go and build great things using floatingpoint!
If youâ€™re somehow yearning for more, thereâ€™s a few topics I deemed less relevant to the daytoday of most programmers, which nonetheless might be useful or interesting. Venture out at your own discretion with this nonexhaustive list:
 How hardware uses guard bits and the sticky bit to minimize calculation errors
 Tricks to squeeze the last bits of performance out of floatingpoint operations
 Minimizing error in implementations of arithmetic operations (e.g. Kahan summation algorithm)
 Quiet vs. signaling NaNs
 Not just NaNs  how status flags can be used to detect exceptions
 Managing exceptions manually with trap handlers
And finally, I could not end this without crediting David Goldberg and his classic What Every Computer Scientist Should Know About FloatingPoint Arithmetic paper. Itâ€™s been my direct inspiration for writing this post and spreading the knowledge.