Overview

Due before class on November 6th November 13th.

Fork the hw05 repository

Go here to fork the repo for homework 05.

Part 1: Exercises on continuous random variables

Complete each of the following exercises. Some exercises require an analytical answer, others require you to write code to complete the exercise. When writing your answer to analytical exercises, be sure to use appropriate \(\LaTeX\) mathematical notation.

\[\newcommand{\E}{\mathrm{E}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\sd}{\mathrm{sd}} \newcommand{\Cov}{\mathrm{Cov}}\]

  1. A recent college graduate is moving to Houston, Texas to take a new job, and is looking to purchase a home. Since Houston comprises a relatively large metropolitan area of about 7 million people, there are a lot of homes from which to choose. When searching for properties on the real estate websites, it is possible to select the price range of housing in which one is most interested. Suppose the potential buyer specifies a price range of $200,000 to $250,000, and the result of the search returns thousands of homes with prices distributed uniformly throughout that range.
    1. Analytically identify \(\E[X]\) and \(\sigma\) of the probability density function associated with this random variable.
    2. If the buyer ultimately selects a home from this initial list of possibilities, what is the probability she will have to pay more than $235,000? Use R to find the solution.
  2. Let \(X\) be uniformly distributed in the unit interval \([0,1]\). Consider the random variable \(Y = g(X)\), where \[g(x) = \left\{ \begin{array}{ll} 1 & \quad \text{if } x \leq 1/3 \\ 2 & \quad \text{if } x > 1/3 \end{array} \right.\]

    Find the expected value of \(Y\) by first calculating its probability mass function (PMF). Verify the result using the expected value rule.
  3. Let \(X\) and \(Y\) be normal random variables with means 0 and 1, respectively, and variances 1 and 4, respectively.
    1. Find \(\Pr(X \leq 1.5)\) and \(\Pr(X \leq -1)\).
    2. Find the PDF of \((Y - 1) / 2\).
    3. Find \(\Pr(-1 \leq Y \leq 1)\).
  4. Let \(X\) be a random variable with PDF \[ f_X(x) = \left\{ \begin{array}{ll} \dfrac{x}{4} & \quad \text{if } 1 < x \leq 3 \\ 0 & \quad \text{otherwise} \end{array} \right. \]

    and let \(A\) be the event \(\{ X \geq 2 \}\).

    1. Find \(\E[X]\), \(\Pr (A)\), \(f_{X|A}(x)\), and \(\E [X | A]\).
    2. Let \(Y = X^2\). Find \(\E[Y]\) and \(\Var (Y)\).
  5. Simulating fake data
    1. Generate 1000 independent random draws \(X_j\) from a uniform distribution on the unit interval \([0,1]\). Briefly describe (graphically and in words) the resulting “data”, and compare them to the theoretical distribution of values from that distribution.
    2. Repeat (a) 1000 times, saving each set of draws. You should wind up with 1,000,000 random \(U(0,1)\) draws \(U_{ij}\), organized into 1000 objects/variables (here, indexed by \(i\)), each of length 1000 (indexed by \(j\)).
    3. For each of the 1000 \(U_i\) objects, create a new object \(V_i\) by adding the integer corresponding to the order of the observation to the value of \(U_{ij}\). That is, to the value of the first observation, add 1, to the second, add 2, etc., finishing by adding 1000 to the last listed observation in each object. Be sure to retain the uniform variates as well.
    4. Generate an object containing the sums of the values of the observations in each of the 1000 \(V\)s (that is, one where the \(i\)th entry is \(A_i = \sum_{j=1}^{1000} V_{ij}\)). Plot and describe the distribution of the resulting variable \(A\).
    5. Next, create a second object containing the sums of the \(j\)th observations across each of the 1000 \(V\)s. That is, create an object of length 1000 where the \(j\)th entry is \(S_j = \sum_{i=1}^{1000} V_{ij}\). Describe (graphically and in words) the distribution of the resulting variable \(S\).
  6. A Gumbel distribution (here denoted \(\zeta(\alpha, \beta)\)) is a two-parameter (\(\alpha \in \Re, \beta > 0\)) distribution with PDF \(f_X(x) = \frac{1}{\beta} e^{-(z + e^{-z})}\) where \(z = \frac{x - \alpha}{\beta}\) and CDF \(F_X(x) = e^{-e^{- \frac{z - \alpha}{\beta}}}\). It is related to the standard uniform distribution \(U\) by \(X = \alpha - \beta \log[-\log(U)]\).
    1. Transform your 1000 bundles of 1000 random \(U(0,1)\) draws into 1000 bundles of 1000 draws \(G_{ij}\) from a Gumbel \((1,2)\) distribution.
    2. Plot the empirical variances of \(G\), and discuss how they compare to the theoretical variance of a \(\zeta(1,2)\) distribution.
    3. Plot the density of the values of the 75th percentiles from those 1000 bundles of Gumbel variates.

Part 2: Optimization

  1. Maximum likelihood estimation is a technique for developing estimators. Consider the following likelihood function for the sample mean: \[ \begin{align} f(\mu) &= \prod_{i=1}^{N} \exp( \frac{-(Y_{i} - \mu)^2}{ 2}) \\ \log f(\mu) &= -\frac{1}{2} \left(\sum_{i=1}^{N} Y_{i}^2 - 2\mu \sum_{i=1}^{N} Y_{i} + N\times\mu^2 \right) \end{align} \]
    1. Write a function mean_lik() for the log-likelihood of the sample mean. This function should takes a vector of guesses for the value of the sample mean and a vector of data for which we want to estimate the sample mean. The function should return the resulting log-likelihoods for a given guess of the sample mean.
    2. Import trade.csv data file from the data/ folder.
    3. Use a grid search to find the optimal estimate of the sample mean, testing values ranging from 30,000 to 80,000. Use your function to find the resulting log-likelihood values for the income variable in the trade data frame. Plot the results with the parameter guesses on the x-axis and the resulting log-likelihoods on the y-axis. Label the axes in the graph and provide a title.
    4. Using optim() and your function mean_lik(), find the maximum value of the log-likelihood function. At what parameter value is the log-likelihood of the sample mean of income maximized?
    5. Check your work using the mean() function on income. Do you get the same estimate from optim() and mean()?
    6. Draw a vertical line at the value of your answer to (e) on your graph of the log-likelihood from (c).
  2. Now let’s use this same basic approach to examine the likelihood for logistic regression, a type of regression sometimes employed instead of ordinary least squares (OLS) when a dependent variable is assumed to be drawn from a Bernoulli probability distribution.

    Consider the following log-likelihood function for a logistic regression:

    \[ \begin{align} L(\beta | y_i, x_{1i}, x_{2i}) &= \prod_{i=1}^n \left[ \left( \frac{1}{1 + \exp(-(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i}))} \right)^{y_i} \left(1 - \frac{1}{1 + \exp(-(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i}))} \right)^{1 - y_i} \right] \\ \log L(\beta | y_i, x_{1i}, x_{2i}) &= \sum_{i=1}^n \left[ -y_i \log(1 + \exp (-(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i}))) + (1 - y_i) \log \left(1 - \frac{1}{1 + \exp(-(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i}))} \right)^{1 - y_i} \right] \end{align} \]

    where \(y\) is a vector containing the dependent variable, \(x_1\) is a vector containing the first independent variable, and \(x_2\) is a vector containing the second independent variable.

    1. Write a function llik_logistic() that accepts as arguments a vector of three parameter values, a data frame, and some method for identifying the dependent and independent variables in the model. The function should then return the log-likelihood for those parameter values. Be sure either in the function or prior to passing in the data frame to remove any rows with missing values.
    2. Use a grid search to determine the values for the parameters that maximize the log-likelihood of the function, where free_trade_support is the dependent variable and income and education are the independent variables. Iterate over all possible combinations of the intercept seq(from = -1, 1, by = 0.05), the income parameter seq(from = 0.000005, to = 0.00001, by = 0.000001), and the education parameter seq(from = 0, to = 1, by = 0.05).
    3. Use the optim() function to calculate the parameters that maximize the log-likelihood for the the same variables as before.
    4. Compare these results to what you would obtain from R’s canned function that does this. Run the glm() function with family = binomial and the same model you used above. Compare the resulting estimates to what you obtained using your own function and both the grid search/optim() approaches.

Submit the assignment

Your assignment should be submitted as an R Markdown document rendered as an HTML/PDF document. Don’t know what an R Markdown document is? Read this! Or this! I have included starter files for you to modify to complete the assignment, so you are not beginning completely from scratch.

Follow instructions on homework workflow. As part of the pull request, you’re encouraged to reflect on what was hard/easy, problems you solved, helpful tutorials you read, etc.

This work is licensed under the CC BY-NC 4.0 Creative Commons License.