Models, statistical inference, and learning

MACS 33001 University of Chicago

\[\newcommand{\E}{\mathrm{E}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\se}{\text{se}}\]

Statistical inference

Process of using data to infer the probability distribution/random variable that generated the data
Given a sample \(X_1, \ldots, X_n \sim F\), how do we infer \(F\)?
All parameters? Some parameters? One parameter?

Parametric vs. nonparametric models

Statistical model \(\xi\)
Parametric model
\[\xi \equiv f(x; \mu, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}} \exp \left[ -\frac{1}{2\sigma^2} (x - \mu)^2 \right], \quad \mu \in \Re, \sigma > 0\]
General form
\[\xi \equiv f(x; \theta) : \theta \in \Theta\]
Nonparametric model
- Infinite set \(\xi\)

Parametric estimation

One dimension
Two dimensions

Nonparametric density estimation

Let \(X_1, \ldots, X_n\) be independent observations from a cumulative distribution function (CDF) \(F\)
Let \(f = F'\) be the probability density function (PDF)
How to estimate \(f\) where \(F \in \xi_{\text{ALL}}\) where \(\xi_{\text{ALL}} = \{\text{all CDF's} \}\)?
Assume some smoothness on \(f\)
- \(f \in \xi = \xi_{\text{DENS}} \cap \xi_{\text{SOB}}\)
- \(\xi_{\text{DENS}}\) is the set of all PDFs and
  
  \[\xi_{\text{SOB}} \equiv f: \int (f''(x))^2 dx < \infty\]

Nonparametric density estimation

General form

\[g(x) = \frac{1}{nh} \sum_{i = 1}^n f(x)\]

Infant mortality

Gaussian kernel

\[f(x) = \frac{1}{\sqrt{2 \pi}}\exp\left[-\frac{1}{2} x^2 \right]\]

Rectangular (uniform) kernel

\[f(x) = \frac{1}{2} \mathbf{1}_{\{ |x| \leq 1 \} }\]

Comparison of kernels

Regression

Suppose we observe pairs of data \((X_1, Y_1), \ldots, (X_n, Y_n)\)
\(X\)
\(Y\)
\(r(x) = \E(Y | X = x)\)
Parametric regression
Nonparametric regression

Naive non-parametric regression

\[\mu = E(\text{Income}|\text{Education}) = f(\text{Education})\]

Naive non-parametric regression

\[\mu = E(Y|x) = f(x)\]

Binning

Naive non-parametric regression

\[X_1 \in \{1, 2, \dots ,10 \}\] \[X_2 \in \{1, 2, \dots ,10 \}\] \[X_3 \in \{1, 2, \dots ,10 \}\]

\(10^3 = 1000\) possible combinations of the explanatory variables and \(1000\) conditional expectations of \(Y\) given \(X\):

\[\mu = E(Y|x_1, x_2, x_3) = f(x_1, x_2, x_3)\]

Naive non-parametric regression

Point estimates

Single “best guess” of some quantity of interest
- Parameter in a parametric model
- CDF \(F\)
- PDF \(f\)
- Regression function \(r\)
Point estimate of \(\theta\) is \(\hat{\theta}\) or \(\hat{\theta}_n\)
- \(\theta\) is a fixed, unknown quantity
- \(\hat{\theta}\) is a random variable
Let \(X_1, \ldots, X_n\) be \(n\) IID data points from some distribution \(F\). A point estimator \(\hat{\theta}_n\) of a paramater \(\theta\) is some function of \(X_1, \ldots, X_n\):

\[\hat{\theta}_n = g(X_1, \ldots, X_n)\]

Properties of point estimates

Bias \[\text{bias}(\hat{\theta}_n) = \E_\theta (\hat{\theta_n}) - \theta\]
Unbiasedness
- Is this necessary?
Consistency

Properties of point estimates

Sampling distribution
Standard error
\[\se = \se(\hat{\theta}_n) = \sqrt{\Var (\hat{\theta}_n)}\]
- Depends on the unknown \(F\)
- Usually estimated from the data (\(\widehat{\se}\))
Mean squared error

\[ \begin{align} \text{MSE} &= \E_\theta (\hat{\theta}_n - \theta)^2 \\ &= \text{bias}^2(\hat{\theta}_n) + \Var_\theta (\hat{\theta}_n) \end{align} \]

Properties of point estimates

Estimators are approximately Normally distributed

\[\frac{\hat{\theta}_n - \theta}{\se} \leadsto N(0,1)\]

Example: Bernoulli distributed random variable

Let \(X_1, \ldots, X_n ~ \text{Bernoulli}(\pi)\)
Let \(\hat{\pi}_n = \frac{1}{n} \sum_{i=1}^n X_i\). Then

\[\E(\hat{\pi}_n) = \frac{1}{n} \sum_{i=1}^n \E(X_i) = \pi\]
so \(\hat{\pi}_n\) is unbiased
Standard error is

\[\se = \sqrt{\Var (\hat{\pi}_n)} = \sqrt{\frac{\pi (1 - \pi)}{n}}\]

\[\widehat{\se} = \sqrt{\frac{\hat{\pi} (1 - \hat{\pi})}{n}}\]
\(\E_\pi (\hat{\pi}_n) = \pi\) so \(\text{bias} = \pi - \pi = 0\)
\[ \begin{align} \text{bias}(\hat{\pi}_n) &= \E_\pi (\hat{\pi}) - \pi \\ &= \pi - \pi \\ &= 0 \end{align} \]
Consistency \[\se = \sqrt{\frac{\pi (1 - \pi)}{n}} \rightarrow 0\]

Confidence sets

\(1 - \alpha\) confidence interval for \(\theta\) is an interval \(C_n = (a,b)\)
- \(a = a(X_1, \ldots, X_n)\)
- \(b = b(X_1, \ldots, X_n)\)
  \[\Pr_{\theta} (\theta \in C_n) \geq 1 - \alpha, \quad \forall \theta \in \Theta\]
\((a,b)\) traps \(\theta\) with probability \(1- \alpha\)

Caution interpreting confidence intervals

\(C_n\) is random and \(\theta\) is fixed
95% confidence intervals corresponding to \(\alpha = 0.05\)
A confidence interval is not a probability statement about \(\theta\)

Proper interpretation

On day 1, you collect data and construct a 95% confidence interval for a parameter \(\theta_1\). On day 2, you collect new data and construct a 95% confidence interval for a parameter \(\theta_2\). You continue this way constructing confidence intervals for a sequence of unrelated parameters \(\theta_1, \theta_2, \ldots\). Then 95% of your intervals will trap the true parameter value.

Constructing confidence intervals

Approximately Normal - use the Normal distribution
Suppose that \(\hat{\theta}_n \approx N(\theta, \widehat{\se}^2)\)
Let \(\Phi\) be the CDF of a standard Normal distribution

\[z_{\frac{\alpha}{2}} = \Phi^{-1} \left(1 - \frac{\alpha}{2} \right)\]

\[\Pr (Z > \frac{\alpha}{2}) = \frac{\alpha}{2}\]

\[\Pr (-z_{\frac{\alpha}{2}} \leq Z \leq z_{\frac{\alpha}{2}}) = 1 - \alpha\]
where \(Z \sim N(0,1)\)
Let
\[C_n = (\hat{\theta}_n - z_{\frac{\alpha}{2}} \widehat{\se}, \hat{\theta}_n + z_{\frac{\alpha}{2}} \widehat{\se})\]
Then

\[ \begin{align} \Pr_\theta (\theta \in C_n) &= \Pr_\theta (\hat{\theta}_n - z_{\frac{\alpha}{2}} \widehat{\se} < \theta < \hat{\theta}_n + z_{\frac{\alpha}{2}} \widehat{\se}) \\ &= \Pr_\theta (- z_{\frac{\alpha}{2}} < \frac{\hat{\theta}_n - \theta}{\widehat{\se}} < z_{\frac{\alpha}{2}}) \\ &\rightarrow \Pr ( - z_{\frac{\alpha}{2}} < Z < z_{\frac{\alpha}{2}}) \\ &= 1 - \alpha \end{align} \]
\(\alpha = 0.05\) and \(z_{\frac{\alpha}{2}} = 1.96 \approx 2\)

Actual vs. approximate confidence intervals

Let \(X_1, \ldots, X_n \sim \text{Bernoulli}(\pi)\) and let \(\hat{\pi}_n = \frac{1}{n} \sum_{i=1}^n X_i\)
Let \(C_n = (\hat{\pi}_n - \epsilon_n, \hat{\pi}_n + \epsilon_n)\) where \(\epsilon_n^2 = \frac{\log(\frac{2}{\alpha})}{2n}\)
From this,
\[\Pr (\pi \in C_n \geq 1 - \alpha)\]
\(C_n\) is a precise \(1 - \alpha\) confidence interval
Approximate confidence interval

\[ \begin{align} \Var (\hat{\pi}_n) &= \frac{1}{n^2} \sum_{i=1}^n \Var(X_i) \\ &= \frac{1}{n^2} \sum_{i=1}^n \pi(1 - \pi) \\ &= \frac{1}{n^2} n\pi(1 - \pi) \\ &= \frac{\pi(1 - \pi)}{n} \\ \se &= \sqrt{\frac{\pi(1 - \pi)}{n}} \\ \widehat{\se} &= \sqrt{\frac{\hat{\pi}(1 - \hat{\pi})}{n}} \end{align} \]

\[\hat{\pi}_n \pm z_{\frac{\alpha}{2}} \widehat{\se} = \hat{\pi}_n \pm z_{\frac{\alpha}{2}} \sqrt{\frac{\hat{\pi}(1 - \hat{\pi})}{n}}\]

Hypothesis testing

Default theory
Null vs. alternative hypothesis
Sufficient evidence to reject the null hypothesis?
Let
\[X_1, \ldots, X_n \sim \text{Bernoulli}(\pi)\]
Is it a fair coin?
- \(H_0: \pi = 0.5\)
- \(H_1: \pi \neq 0.5\)
- Reasonable to reject \(H_0\) if
  
  \[T = | \hat{\pi}_n - 0.5|\] is large