Problem 1 [5 points] Consider solving the scalar equation ax=b, for given a and b and assume that you have computed x^. To measure the quality of x^, we can compute the residual r=b−ax^. Derive the error in fl(r), that is the relative error in the floating point representation of r. Can it be large? Explain.
Answer:
Given r=b−ax^,
Let fl(a) is the floating point representation of a
Let fl(b) be the floating point representation of b
Let fl(x^) be the floating point representation of x^
Assuming relative error of fl(x^) is δx^⇒fl(x^)=x^true(1+δx^)
Therefore: a∗x^=a∗x^true(1+δx^)
Assuming relative error of fl(ax^) is δm⇒fl(ax^)=a∗x^true(1+δx^)(1+δm)
Computed residual r=b−a∗x^true(1+δx^)
Assuming relative error of fl(b−ax^) is δs⇒fl(b−ax^)=b−a∗x^true(1+δx^)(1+δm)(1+δs)
Thus, the error in fl(r) is δr=(1+δx^)(1+δm)(1+δs)−1
The error can be large if:
the relative error of x^ is large
significant rounding error in multiplication and subtraction (otherwise δm and δs is large)
value of a and b such that b−ax^ introduces “catastrophic cancellation”, or b≈ax^
Problem 2 [2 points] Explain the output of the following code
Is the result accurate?
Answer:
The following includes steps for the above MATLAB code:
clear all clears all variables in current workspace
x = 10/9 initialise the first value of x to 910
The for loop runs for 20 times, where it updates x using the following formula x:=10∗(x−1)
Finally, x prints out the value of x into the MATLAB terminal window.
The output of the code is not correct, due to floating point errors. Machine epsilon ϵmach by default in MATLAB (which is in double precision) is approx. 2.2204e−16
Since x is a floating point, every iteration in the for loop will include a floating point error, and thus after 20 iterations, the results won’t be accurate to its mathematical value.
Problem 3 [3 points] Suppose you approximate ex by its truncated Taylor series. For given x=0.1, derive the smallest number of terms of the series needed to achieve accuracy of 10−8 . Write a Matlab program to check that your approximation is accurate up to 10−8. Name your program check_exp.m.
Answer:
Taylor series of real or complex f at c is defined by f(x)=∑k=0infk!f(k)(c)(x−c)k
Given f has n+1 continuous derivative [a,b], or f∈Cn+1[a,b] , then the truncated Taylor series can be defined as
f(x)=∑k=0infk!f(k)(c)(x−c)k+En+1 where En+1=(n+1)!fn+1(ξ(c,x))(x−c)n+1=(n+1)!fn+1(ξ)(x−c)n+1
Hence, with x:=x+h we have f(x+h)=∑kinfk!f(k)(x)(h)k+En+1 where En+1=(n+1)!fn+1(ξ)hn+1 and ξ is between x and x+h
Thus, we need to find n terms such that ∣En+1=(n+1)!ex(ξ)xn+1∣≤10−8 with ξ between 0 and x
From the above function, with n=6 the Taylor Series will be accurate up to 10−8
The below is the Matlab to examine the above terms:
Problem 4 [3 points] The sine function has the Taylor series expansion sin(x)=x−3!x3+5!x5−7!x7+⋅⋅⋅+ Suppose we approximate sin(x) by x−3!x3+5!x5. What are the absolute and relative errors in this approximation for x=0.1,0.5,1.0? Write a Matlab program to produce these errors; name it sin_approx.m.
Answer:
Assuming y=sin(x) as exact value and y~ is the approximate value of sin(x), which is y~=x−3!x3+5!x5
Absolute error is given by ∣y−y~∣
Relative error is given by y∣y−y~∣
For the following x∈0.1,0.5,1.0, the following table represents the error:
Error
x=0.1
x=0.5
x=1.0
Absolute
1.983852e-11
1.544729e-06
1.956819e-04
Relative
1.987162e-10
3.222042e-06
2.325474e-04
Problem 5 [2 points] How many terms are needed in the series arccot(x)=2π−x+3x3−5x5+7x7+⋅⋅⋅ to compute arccot(x) for ∣x∣≤0.5 accurate to 12 decimal places.
Answer:
To calculate arccot(x) for ∣x∣≤0.5 accurate to 12 decimal places, we need to find n such that ∣En+1∣<10−12
Substitute for error term, needs to find n such that ∣(n+1)!fn+1(ξ)hn+1∣<10−12
We know that the general term for Taylor series of arccot(x) is an=2n+1(−1)nx2n+1
Since we are considering on interval ∣x∣≤0.5, and arccot(x) is an alternating series, the largest possible value of the error term will occur when x=0.5
Thus, the equation to solve for n term is ∣(2n+1)∗(n+1)!(−1)n+1∗x2n+1∣<10−12⇌(2n+1)∗(n+1)!x2n+1<10−12
Using the following function find_nth_term, we can find that when n=17 will ensure the arccot(x) for ∣x∣≤0.5 to be accurate to 12 decimal places.
Problem 6 [2 points] Consider the expression 1024+x. Derive for what values of x this expression evaluates to 1024.
Answer:
In IEEE 754 double precision, ϵmach=2−52≈2.2∗10−16
From the definition of machine epsilon (1024+ϵmach>1024), the difference between N and the next representable numbers is proportional to N, that is N∗ϵmach
Thus the problem implies there is such x that exists within a range such that x<21∗ϵmach∗N
Substitute value for N=1024 and ϵmach≈2.2∗10−16
⇒x<21∗2.2∗10−16∗1024≈1.1368448×10−13
∀x⪅1.1368448×10−13→(1024+x)evaluates1024
Problem 7 [2 points] Give an example in base-10 floating-point arithmetic when
a. (a+b)+c=a+(b+c)
b. (a∗b)∗c=a∗(b∗c)
Answer:
For the first example (a+b)+c=a+(b+c), assuming using double precision:
Let:
a=1.0
b=1.0∗10−16
c=−1.0
⇒(a+b)+c=0, whereas a+(b+c)=1.11022∗10−16
The explanation from Problem 6 can be used to explain that (a+b)=a since b<1.1368448×10−13, therefore (a+b)+c=0, whereas in a+(b+c)≈1.0−0.999999999≈1.11022∗10−16 due to round up for floating point arithmetic.
For the second example (a∗b)∗c=a∗(b∗c), assuming the following FP system (10,3,L,U) where x=±d0.d1d2∗10e,d0=0,e∈[L,U]
Let:
a=1.23
b=4.56
c=7.89
⇒(a∗b)∗c=44.3 (a∗b=5.61 rounded and 5.61∗c=44.3), whereas a∗(b∗c)=44.2 (b∗c=35.9 rounded and 35.9∗a=44.2)
Problem 8 [8 points] Consider a binary floating-point (FP) system with normalised FP numbers and 8 binary digits after the binary point:
x=±1.d1d2⋅⋅⋅d8×2e
For this problem, assume that we do not have a restriction on the exponent e. Name this system B8.
(a) [2 points] What is the value (in decimal) of the unit roundoff in B8?
(b) (1 point) What is the next binary number after 1.10011001?
(c) [5 points] The binary representation of the decimal 0.1 is infinite: 0.00011001100110011001100110011⋅⋅⋅. Assume it is rounded to the nearest FP number in B8. What is this number (in binary)?
Answer:
B8 system can also be defined as FP(2,8,L,U)
(a). For a binary FP system with p binary digits after binary point, the unit roundoff u is given by u=2−p
With t=8, unit roundoff for this system in decimal is u=2−8=0.00390625
(b). Given u=2−8=0.00000001 in binary, the next binary number can be calculated as:
1.10011001
+
0.00000001
=
1.10011010
(c).
first 9 digits after the binary point to determine how to round: 0.000110011
Given the unit roundoff is 2−8 and 9th digit is 1 (or 2−9) → round up
Therefore, 0.1 rounded to nearest FP system in B8 is 0.00011010 in binary
Problem 9 [10 points] For a scalar function f consider the derivative approximations
f′(x)≈g1(x,h):=2hf(x+2h)−f(x)
and
f′(x)≈g2(x,h):=2hf(x+h)−f(x−h) .
a. [4 points] Let f(x)=esin(x) and x0=4π.
Write a Matlab program that computes the errors ∣f′(x0)−g1(x0,h)∣ and ∣f′(x0)−g2(x0,h)∣ for each h=10−k,k=1,1.5,2,2.5...,16.
Using loglog, plot on the same plot these errors versus h. Name your program derivative_approx.m. For each of these approximations:
b. [4 points] Derive the value of h for which the error is the smallest.
c. [2 points] What is the smallest error and for what value of h is achieved? How does this value compare to the theoretically “optimum” value?
Answer:
(a).
(b).
The Taylor’s series expansion of function f(x) around point a is:
For the second approximation g2(x,h): the error term is −61h2f′′′(x)
(c).
For g1, the smallest error is at h = 1.000000e-08
For g2, the smallest error is at h = 3.162278e-06
Problem 10 [7 points] In the Patriot disaster example, the decimal value 0.1 was converted to a single precision number with chopping.
Suppose that it is converted to a double precision number with chopping.
(a). [5 points] What is the error in this double precision representation of 0.1.
(b). [2 points] What is the error in the computed time after 100 hours?
Answer:
(a).
Given the binary representation of 0.1 in double precision:
Sign: 0
Exponent: 0111111101101111111011, which is 1019 in decimal ⇒ effective exponent is 1029−1023=−4
Significand: 10011001100110011001100110011001100110011001100110101001100110011001100110011001100110011001100110011010
the binary digits will be chopped off at 52 bit. Therefore, ϵmach=2−52 and thus roundoff error=21ϵmach=2−53≈1.11×10−16
(b).
After 100 hours: 100×60×60×10×1.11×10−16≈3.996×10−10sec