IEEE standard (IEEE 754)
Single precision
F(2,24,−126,127), each number is stored in 32 bits (4 bytes)
Double precision
F(2,53,−1022,1023), each number is stored in 64 bits (8 bytes)
Biased exponent
추가적으로 exponent area에 들어가는 값은 biased
된 값이다. (즉 exponent are에 들어가는 값은 unsigned integer임에 유의해야 한다.)
Representation of numbers in IEEE standard
Floating-point Arithmetic
- Addition or subtraction : Shifting mantissa to make exponents match may cause loss of some digits of smaller number
- Multiplication : Product of two p-digit mantissa contains up to 2p digits, so result may not be representable
- Division : Quotient of two p-digit mantissa may contain more than p digits, such as nonterminating binary expansion of 1/10
Example
Guard Digit
Cancellation error
Cancellation error (losing significant digits) occurs in the subtraction of two close numbers
Example 1
Example 2 : Approximation of π in double precision
이렇게 되는 이유는 위의 빨간색이 빼기가 있기 때문이다.
k가 커짐에 따라서 2−kpk−1이 0으로 수렴하고 결과적으로 매우 가까운 2개의 수를 빼는 것으로 수렴한다. 두 식은 수학적으로는 동일하지만, 컴퓨터로 처리하는 과정에서 큰 차이를 발생시킨다.
Conditioning and Stability
The input data of a model (problem) is always subjected to some perturbations (like as roundoff errors)
Therefore, the fundamental question then is “How much the solution of the perturbed problem is close to the solution of the original problem?”
- well-conditioned : sensitivity is low (즉 input에 noise가 들어가도 output이 크게 변동되지 않는다는 의미이다.)
- ill-conditioned : sensitivity is high (input에 noise가 들어가면 output이 크게 변동된다는 의미이다)
If we replace the term “problem” with “algorithm” then we say that the algorithm is either stable
or unstable
Example : Hilbert matrix
Condition number of a matrix
Consider the system Ax=b and the perturbed system A(x+δx)=b+δb. We get δx=A−1(δb) (Assuming perturbation only in b for simplicity)
Take the norm from both sides
From original system Ax=b we have ∥b∥≤∥A∥∥x∥, or
Multiplying both sides of recent equations we get
At the worst case the error in the output is amplified by ∥A∥⋅∥A−1∥. So the condition number
of A is defined as