STAT3006 STAT7305 Assignment 1 Probability and
Optimization Theory

For solutions, purchase a LIVE CHAT plan or contact us

Due Date: 26th August 2022

Problem 1 [25 Marks]
Let (Ω,F,P) be a probability space, and let (X i (ω)) i∈[n] be a sequence of independent and iden-
tically distribution (IID) random variables from (Ω,F) to (R,B(R)), each with the same image
measure P X , as the random variable X (ω).
(a) Assuming that X has expectation µ X = E(X) and finite variance
σ 2
X
= E
? (X − µ
X )
2 ?
< ∞,
prove that the sample mean,
¯
X n (ω) =
1
n
n
X
i=1
X i (ω),
is a consistent estimator of µ X in the sense that
¯
X n converges in probability to µ X , as
n → ∞. Such a result is usually referred to as a weak law of large numbers.
[10 Marks]
Recall that the cumulative distribution function (CDF) of X is defined as:
F (x) = P(ω : X (ω) ≤ x).
Using the sequence (X i ) i∈[n] estimate the CDF F (x) using the empirical CDF:
¯
F n (x) =
1
n
n
X
i=1
χ (−∞,x) (X i ).
2
(b) For each fixed x ∈ R, show that
I α,n (x) =
?
y ∈ R :
¯
F n (x) −
1
2 √ nα
≤ y ≤
¯
F n (x) +
1
2 √ nα
?
is a (1 − α) × 100% confidence interval for F (x) in the sense that
P(ω : F (x) ∈ I α,n (x)) ≥ 1 − α.
[10 Marks]
The Kolmogorov’s strong law of large numbers improves upon the result from (a) stating that if the
expectation of X is finite (i.e. |µ X | < ∞), then
¯
X n converges almost surely to µ X , as n → ∞.
(c) Assuming that P ? Leb, use the strong law of large numbers to prove the Glivenko–
Cantelli Theorem; i.e., show that
sup
x∈R
?
? ¯
F n (x) − F (x) ?
converges to 0, almost surely, as n → ∞.
[5 Marks]
This result, along with the laws of large numbers, is generally considered to be the fundamental
law(s) of statistics in the sense that they provide a mechanism for probability distributions to be
realized in reality; i.e., repeated observation of IID random variables and averaging the outcomes
yields insight regarding the properties of the underlying probability measure P.
Problem 2
Let (Ω,F,P) be a probability space, and let (X i (ω)) i∈[n] be a sequence of IID random variables
from (Ω,F) to (R,B(R)), each with the same image measure P X , as the random variable X (ω).
Suppose that P X ? Leb, and has Radon–Nykodym derivative (probability density function;
PDF) p θ 0 : R → R ≥0 , where θ 0 ∈ R, the so-called data generating parameter corresponding to P X ,
is unknown to us. That is, we only know that P X has a PDF of the form p θ , for some value of
θ ∈ R, but not which particular value of θ (i.e., θ 0 ).
We wish to test the simple hypotheses that either the null hypothesis
H 0 : θ 0 = θ ∗
1
is true, or the alternative hypothesis
H 1 : θ 0 = θ ∗
2
3
is true, for a pair of predetermined values θ ∗
1 ,θ
∗
2
∈ R, where θ ∗
1
6= θ ∗
2 . To construct a test between
the two hypotheses, we construct the so-called likelihood ratio statistic:
L n (ω) =
Q n
i=1 p θ ∗ 2
(X i (ω))
Q n
i=1 p θ ∗ 1
(X i (ω)) .
(a) Under the null hypothesis (i.e., we assume that θ 0 = θ ∗
1
is true, and subsequently
p θ 0 = p θ ∗
1 ), show that
E[L n (ω)] = 1.
You may use the fact that if (Y i (ω)) i∈[n] is a sequence of independent random variables,
then so (f i ◦ Y i ) i∈[n] is also a sequence of independent random variables, where f i : R →
R is a function, for each i ∈ [n].
[5 Marks]
We say that a random variable E : Ω → R ≥0 is an e-value if
E[E] ≤ 1,
and we say that a random variable P (ω) is a p-value if for any α ∈ [0,1]:
P(ω : P (ω) ≤ α) ≤ α.
(b) Prove that P = min{1,1/E} is a p-value.
[5 Marks]
We say that R n , (X i ) i∈[n] 7→ {0,1}, is a rejection rule (or test) for the hypotheses H 0 and H 1 ,
where we say that we reject H 0 if R n = 1 and we say that we accept H 0 if R n = 0. We further say
that R n has size (or significance level) α ∈ (0,1) if
P(ω : R n = 1) ≤ α,
under the assumption that H 0 is true.
(c) Using the conclusions from (a) and (b), for any α ∈ (0,1), construct a rejection rule
with size α for the test of H 0 : θ 0 = θ ∗
1
versus H 1 : θ 0 = θ ∗
2 .
[5 Marks]
In the R programing language, we can generate an sequence (Y i ) i∈[n] of IID random variables, each
with the same image probability measures P Y , and PDF
p θ (y) = φ θ (y) =
1
√ 2π exp
?
− 1
2
(y − θ) 2
?
,
4
using the command rnorm(n,θ). We can also compute φ θ (y) using the command dnorm(y,θ).
(d) Assuming that X has image measure P X has a PDF of the form p θ 0 (x) = φ θ 0 (x), use
the commands above to program an R script that implements the rejection rule from
(c) to test the hypotheses H 0 : θ 0 = 0 versus H 1 : θ 0 = θ ∗
2 , and assess how often the test
rejects H 0 , as θ ∗
2
increases. We say that a test is powerful if H 0 is rejected with high
probability when H 1 is true. Via computational experiments or otherwise, comment
on the power of the test as θ ∗
2
increases away from 0.
[10 Marks]
Problem 3
(a) For non-negative numbers u,v ∈ R ≥0 and positive coefficients p,q ∈ R >0 , such that
1/p + 1/q = 1, show that
uv ≤
u p
p
+
v q
q
.
You may use the fact that x 7→ log(x) is concave on R >0 .
[5 Marks]
Let X (ω) and Y (ω) be functions from measure space (Ω,F,M) to (R,B(R)).
(b) Use the result from (a) to prove Hölder’s inequality. That is, show that if
Z
Ω
|X| p dM < ∞ and
Z
Ω
|Y | q dM < ∞,
for 1/p + 1/q = 1, then
Z
Ω
|XY |dM ≤
?Z
Ω
|X| p dM
? 1/p
?Z
Ω
|Y | q dM
? 1/q
.
[10 Marks]
(c) Use Hölder’s inequality (or otherwise) to prove Minkowski’s inequality. That is, show
that if
Z
Ω
|X| p dM < ∞ and
Z
Ω
|Y | p dM < ∞,
for p ≥ 1, then
?Z
Ω
|X + Y | p dM
? 1/p
≤
?Z
Ω
|X| p dM
? 1/p
+
?Z
Ω
|Y | p dM
? 1/p
.
[5 Marks]
5
Let (X i ) i∈[n] be a sequence of functions from (Ω,F,M) to (R,B(R)).
(d) Consider a generalization of Hölder’s inequality of the form:
Z
Ω
?
n
Y
i=1
X i
?
dM ≤
n
Y
i=1
?Z
Ω
|X| q i dM
? 1/q i
, (1)
where (q i ) i∈[n] is a sequence of constants, such that q i ∈ R for each i ∈ [n]. Propose
conditions on the values that (q i ) i∈[n] can take so that (1) is true, and argue (formally
or informally) why you believe that your suggested conditions are correct.
[5 Marks]
Problem 4
Let f : T → R be a function on the domain T ⊂ R d , where T is said to be convex. We say that
f is µ-strongly convex on T (with respect to the Euclidean norm k·k) if there exists a µ > 0 such
that for any θ,τ ∈ X and λ ∈ [0,1], we have
f (λθ + (1 − λ)τ) ≤ λf (θ) + (1 − λ)f (τ) + µλ(1 − λ)kθ − τk 2 . (2)
This more structured notion of convexity allows for proofs of convergence and rates for many
modern optimization algorithms, and problems relating to strongly convex functions are mathe-
matically “easier” to solve that others.
(a) Characterization (2) is one of many characterizations of µ-strong convexity. Another
useful characterization is that
g (θ) = f (θ) − µkθk 2 (3)
is convex on T. That is, f is so convex that even when we remove a quadratic function
from f (i.e., g) we still have a convex function. Prove that these two characterizations
((2) and (3)) of µ-strong convexity are equivalent.
[5 Marks]
(b) Using either (2) or (3), prove that if f is µ-strongly convex on T then it is also strictly
convex on T, and propose a counterexample of a strictly convex function that is not
µ-strongly convex for any µ > 0.
[5 Marks]
6
We now assume that f is differentiable on a an open set
¯
T ⊃ T, and thus is differentiable on T.
Then, we can further characterize µ-strong convexity on T by the inequality
f (τ) ≥ f (θ) + (τ − θ) > ∇f (θ) +
1
2 µkτ − θk.
(4)
(c) Using (4), show that if f is µ-strongly convex on T and θ ∗ ∈ T is a critical point of f
in the sense that
∇f (θ ∗ ) = 0,
then for all θ ∈ T we have
f (θ) ≥ f (θ ∗ ) +
1
2 µkθ − θ
∗ k 2
and
f (θ ∗ ) ≥ f (θ) −
1
2µ
k∇f (θ)k 2 .
[10 Marks]
For a differentiable function f, we say that it is β-smooth on T if
k∇f (θ) − ∇f (τ)k ≤ β kθ − τk,
for every θ,τ ∈ T.
(d) A typical algorithm for computing
θ ∗ = argmin
θ∈T
f (θ)
is the gradient descent method, where we generate a sequence (θ t ) t∈N such that
θ t+1 = θ t − η∇f (θ t ),
for some positive constant η > 0, starting from some initial value θ 1 ∈ T. When f is
convex and β-smooth on T, it is provable that after T ∈ N steps of the gradient descent
algorithm, we have the inequality:
f (θ T ) − f (θ ∗ ) ≤
1
T − 1 2β kθ 1
− θ ∗ k 2 , (5)
by taking η = 1/β. On the other hand, if f is also µ-strongly convex on T, then after
7
T + 1 steps of gradient descent, we have
f (θ T ) − f (θ ∗ ) ≤ exp
?
− 4(T − 1)
β/µ
?
β
2
kθ 1 − θ ∗ k 2 , (6)
when taking η = 2/(µ + β). Discuss the meanings of inequalities (5) and (6), and
argue whether or not gradient descent performs better when f is strongly convex.
[5 Marks]

For solutions, purchase a LIVE CHAT plan or contact us

Limited time offer:

STAT3006 STAT7305 Assignment 1 Probability andOptimization Theory

STAT3006 STAT7305 Assignment 1 Probability and
Optimization Theory