Models, statistical inference, and learning

MACS 33001 University of Chicago

\[\newcommand{\E}{\mathrm{E}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\se}{\text{se}}\]

Statistical inference

  • Process of using data to infer the probability distribution/random variable that generated the data
  • Given a sample \(X_1, \ldots, X_n \sim F\), how do we infer \(F\)?
  • All parameters? Some parameters? One parameter?

Parametric vs. nonparametric models

  • Statistical model \(\xi\)
  • Parametric model

    \[\xi \equiv f(x; \mu, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}} \exp \left[ -\frac{1}{2\sigma^2} (x - \mu)^2 \right], \quad \mu \in \Re, \sigma > 0\]
  • General form

    \[\xi \equiv f(x; \theta) : \theta \in \Theta\]
  • Nonparametric model
    • Infinite set \(\xi\)

Parametric estimation

  • One dimension
  • Two dimensions

Nonparametric density estimation

  • Let \(X_1, \ldots, X_n\) be independent observations from a cumulative distribution function (CDF) \(F\)
  • Let \(f = F'\) be the probability density function (PDF)
  • How to estimate \(f\) where \(F \in \xi_{\text{ALL}}\) where \(\xi_{\text{ALL}} = \{\text{all CDF's} \}\)?
  • Assume some smoothness on \(f\)
    • \(f \in \xi = \xi_{\text{DENS}} \cap \xi_{\text{SOB}}\)
    • \(\xi_{\text{DENS}}\) is the set of all PDFs and

      \[\xi_{\text{SOB}} \equiv f: \int (f''(x))^2 dx < \infty\]

Nonparametric density estimation

  • General form

    \[g(x) = \frac{1}{nh} \sum_{i = 1}^n f(x)\]

Infant mortality

Gaussian kernel

\[f(x) = \frac{1}{\sqrt{2 \pi}}\exp\left[-\frac{1}{2} x^2 \right]\]

Rectangular (uniform) kernel

\[f(x) = \frac{1}{2} \mathbf{1}_{\{ |x| \leq 1 \} }\]

Comparison of kernels

Regression

  • Suppose we observe pairs of data \((X_1, Y_1), \ldots, (X_n, Y_n)\)
  • \(X\)
  • \(Y\)
  • \(r(x) = \E(Y | X = x)\)
  • Parametric regression
  • Nonparametric regression

Naive non-parametric regression

Naive non-parametric regression

\[\mu = E(\text{Income}|\text{Education}) = f(\text{Education})\]

Naive non-parametric regression

\[\mu = E(Y|x) = f(x)\]

  • Binning

Naive non-parametric regression

Naive non-parametric regression

Naive non-parametric regression

\[X_1 \in \{1, 2, \dots ,10 \}\] \[X_2 \in \{1, 2, \dots ,10 \}\] \[X_3 \in \{1, 2, \dots ,10 \}\]

  • \(10^3 = 1000\) possible combinations of the explanatory variables and \(1000\) conditional expectations of \(Y\) given \(X\):

    \[\mu = E(Y|x_1, x_2, x_3) = f(x_1, x_2, x_3)\]

Naive non-parametric regression

Naive non-parametric regression

Naive non-parametric regression

Point estimates

  • Single “best guess” of some quantity of interest
    • Parameter in a parametric model
    • CDF \(F\)
    • PDF \(f\)
    • Regression function \(r\)
  • Point estimate of \(\theta\) is \(\hat{\theta}\) or \(\hat{\theta}_n\)
    • \(\theta\) is a fixed, unknown quantity
    • \(\hat{\theta}\) is a random variable
  • Let \(X_1, \ldots, X_n\) be \(n\) IID data points from some distribution \(F\). A point estimator \(\hat{\theta}_n\) of a paramater \(\theta\) is some function of \(X_1, \ldots, X_n\):

    \[\hat{\theta}_n = g(X_1, \ldots, X_n)\]

Properties of point estimates

  • Bias \[\text{bias}(\hat{\theta}_n) = \E_\theta (\hat{\theta_n}) - \theta\]
  • Unbiasedness
    • Is this necessary?
  • Consistency

Properties of point estimates

  • Sampling distribution
  • Standard error

    \[\se = \se(\hat{\theta}_n) = \sqrt{\Var (\hat{\theta}_n)}\]
    • Depends on the unknown \(F\)
    • Usually estimated from the data (\(\widehat{\se}\))
  • Mean squared error

    \[ \begin{align} \text{MSE} &= \E_\theta (\hat{\theta}_n - \theta)^2 \\ &= \text{bias}^2(\hat{\theta}_n) + \Var_\theta (\hat{\theta}_n) \end{align} \]

Properties of point estimates

  • Estimators are approximately Normally distributed

    \[\frac{\hat{\theta}_n - \theta}{\se} \leadsto N(0,1)\]

Example: Bernoulli distributed random variable

  • Let \(X_1, \ldots, X_n ~ \text{Bernoulli}(\pi)\)
  • Let \(\hat{\pi}_n = \frac{1}{n} \sum_{i=1}^n X_i\). Then

    \[\E(\hat{\pi}_n) = \frac{1}{n} \sum_{i=1}^n \E(X_i) = \pi\]

    so \(\hat{\pi}_n\) is unbiased
  • Standard error is

    \[\se = \sqrt{\Var (\hat{\pi}_n)} = \sqrt{\frac{\pi (1 - \pi)}{n}}\]

    \[\widehat{\se} = \sqrt{\frac{\hat{\pi} (1 - \hat{\pi})}{n}}\]

  • \(\E_\pi (\hat{\pi}_n) = \pi\) so \(\text{bias} = \pi - \pi = 0\)

    \[ \begin{align} \text{bias}(\hat{\pi}_n) &= \E_\pi (\hat{\pi}) - \pi \\ &= \pi - \pi \\ &= 0 \end{align} \]
  • Consistency \[\se = \sqrt{\frac{\pi (1 - \pi)}{n}} \rightarrow 0\]

Confidence sets

  • \(1 - \alpha\) confidence interval for \(\theta\) is an interval \(C_n = (a,b)\)
    • \(a = a(X_1, \ldots, X_n)\)
    • \(b = b(X_1, \ldots, X_n)\)

      \[\Pr_{\theta} (\theta \in C_n) \geq 1 - \alpha, \quad \forall \theta \in \Theta\]
  • \((a,b)\) traps \(\theta\) with probability \(1- \alpha\)

Caution interpreting confidence intervals

  • \(C_n\) is random and \(\theta\) is fixed
  • 95% confidence intervals corresponding to \(\alpha = 0.05\)
  • A confidence interval is not a probability statement about \(\theta\)

Proper interpretation

On day 1, you collect data and construct a 95% confidence interval for a parameter \(\theta_1\). On day 2, you collect new data and construct a 95% confidence interval for a parameter \(\theta_2\). You continue this way constructing confidence intervals for a sequence of unrelated parameters \(\theta_1, \theta_2, \ldots\). Then 95% of your intervals will trap the true parameter value.

Constructing confidence intervals

  • Approximately Normal - use the Normal distribution
  • Suppose that \(\hat{\theta}_n \approx N(\theta, \widehat{\se}^2)\)
  • Let \(\Phi\) be the CDF of a standard Normal distribution

    \[z_{\frac{\alpha}{2}} = \Phi^{-1} \left(1 - \frac{\alpha}{2} \right)\]

    \[\Pr (Z > \frac{\alpha}{2}) = \frac{\alpha}{2}\]

    \[\Pr (-z_{\frac{\alpha}{2}} \leq Z \leq z_{\frac{\alpha}{2}}) = 1 - \alpha\]

    where \(Z \sim N(0,1)\)
  • Let

    \[C_n = (\hat{\theta}_n - z_{\frac{\alpha}{2}} \widehat{\se}, \hat{\theta}_n + z_{\frac{\alpha}{2}} \widehat{\se})\]
  • Then

    \[ \begin{align} \Pr_\theta (\theta \in C_n) &= \Pr_\theta (\hat{\theta}_n - z_{\frac{\alpha}{2}} \widehat{\se} < \theta < \hat{\theta}_n + z_{\frac{\alpha}{2}} \widehat{\se}) \\ &= \Pr_\theta (- z_{\frac{\alpha}{2}} < \frac{\hat{\theta}_n - \theta}{\widehat{\se}} < z_{\frac{\alpha}{2}}) \\ &\rightarrow \Pr ( - z_{\frac{\alpha}{2}} < Z < z_{\frac{\alpha}{2}}) \\ &= 1 - \alpha \end{align} \]

  • \(\alpha = 0.05\) and \(z_{\frac{\alpha}{2}} = 1.96 \approx 2\)

Actual vs. approximate confidence intervals

  • Let \(X_1, \ldots, X_n \sim \text{Bernoulli}(\pi)\) and let \(\hat{\pi}_n = \frac{1}{n} \sum_{i=1}^n X_i\)
  • Let \(C_n = (\hat{\pi}_n - \epsilon_n, \hat{\pi}_n + \epsilon_n)\) where \(\epsilon_n^2 = \frac{\log(\frac{2}{\alpha})}{2n}\)
  • From this,

    \[\Pr (\pi \in C_n \geq 1 - \alpha)\]
  • \(C_n\) is a precise \(1 - \alpha\) confidence interval
  • Approximate confidence interval

    \[ \begin{align} \Var (\hat{\pi}_n) &= \frac{1}{n^2} \sum_{i=1}^n \Var(X_i) \\ &= \frac{1}{n^2} \sum_{i=1}^n \pi(1 - \pi) \\ &= \frac{1}{n^2} n\pi(1 - \pi) \\ &= \frac{\pi(1 - \pi)}{n} \\ \se &= \sqrt{\frac{\pi(1 - \pi)}{n}} \\ \widehat{\se} &= \sqrt{\frac{\hat{\pi}(1 - \hat{\pi})}{n}} \end{align} \]

    \[\hat{\pi}_n \pm z_{\frac{\alpha}{2}} \widehat{\se} = \hat{\pi}_n \pm z_{\frac{\alpha}{2}} \sqrt{\frac{\hat{\pi}(1 - \hat{\pi})}{n}}\]

Hypothesis testing

  • Default theory
  • Null vs. alternative hypothesis
  • Sufficient evidence to reject the null hypothesis?
  • Let

    \[X_1, \ldots, X_n \sim \text{Bernoulli}(\pi)\]
  • Is it a fair coin?
    • \(H_0: \pi = 0.5\)
    • \(H_1: \pi \neq 0.5\)
    • Reasonable to reject \(H_0\) if

      \[T = | \hat{\pi}_n - 0.5|\] is large