Numerical Computing & Linear Algebra Essentials
Classified in Mathematics
Written on in English with a size of 198.29 KB
Floating Point Systems & Numerical Error
A Floating Point (FP) System represents numbers as: x = ± (d0 + d1/β + d2/β2 + ... + dt-1/β(t-1)). The Unit Roundoff (u) is defined as εmachine/2, where fl(1 + ε) > 1.
Rounding to Nearest
When rounding to the nearest representable number, fl(x) = x(1 + ε) where |ε|.
IEEE 754 Standard for Floating Point
Normalized Numbers
If the exponent (e) is not equal to 0, it's a normalized FP number. The value is x = (-1)sign ⋅ β(e - offset) ⋅ (1.d1 d2...dt-1).
Denormalized Numbers
If the exponent (e) is 0, the number is denormalized. The value is x = (-1)sign ⋅ β(e - offset + 1) ⋅ (0.d1 d2...dt-1). The sticky bit 0 is free because it is always determined by the value of exponent e.
Exceptional Values
- If