Graphical data analysis

MACS 33001 University of Chicago

Graphical data analysis

  1. Generate questions about your data
  2. Search for answers by visualising, transforming, and modeling your data
  3. Use what you learn to refine your questions and or generate new questions
  4. Rinse and repeat until you publish a paper

Graphical data analysis

  1. What type of variation occurs within my variables?
  2. What type of covariation occurs between my variables?
  3. Are there outliers in the data?
  4. Do I have missingness? Are there patterns to it?
  5. How much variation/error exists in my statistical estimates? Is there a pattern to it?

Differences between GDA and modeling

Tips dataset

Variable Explanation
obs Observation number
totbill Total bill (cost of the meal), including tax, in US dollars
tip Tip (gratuity) in US dollars
sex Sex of person paying for the meal (0=male, 1=female)
smoker Smoker in party? (0=No, 1=Yes)
day 3=Thur, 4=Fri, 5=Sat, 6=Sun
time 0=Day, 1=Night
size Size of the party

Tips regression

## # A tibble: 8 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  0.207     0.0249     8.29   8.65e-15
## 2 sexM        -0.00854   0.00835   -1.02   3.07e- 1
## 3 smokerYes    0.00364   0.00850    0.428  6.69e- 1
## 4 daySat      -0.00177   0.0183    -0.0967 9.23e- 1
## 5 daySun       0.0167    0.0190     0.876  3.82e- 1
## 6 dayThu      -0.0182    0.0232    -0.784  4.34e- 1
## 7 timeNight   -0.0234    0.0261    -0.895  3.72e- 1
## 8 size        -0.00962   0.00422   -2.28   2.34e- 2
## # A tibble: 1 x 11
##   r.squared adj.r.squared  sigma statistic p.value    df logLik   AIC   BIC
## *     <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1    0.0420        0.0136 0.0607      1.48   0.175     8   342. -665. -634.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>

Exploring tips

Exploring tips

Exploring tips

Exploring tips

GDA vs. CDA

  • Graphical data analysis
  • Confirmatory data analysis

Histograms

Nonparametric density estimation

\[x_0 + 2(j - 1)h \leq X_i < x_0 + 2jh\]

\[\hat{p}(x) = \frac{\#_{i = 1}^n [x_0 + 2(j - 1)h \leq X_i < x_0 + 2jh]}{2nh}\]

\[\hat{p}(x) = \frac{\#_{i = 1}^n [x-h \leq X_i < x+h]}{2nh}\]

\[\hat{p}(x) = \frac{1}{nh} \sum_{i = 1}^n W \left( \frac{x - X_i}{h} \right)\]

\[W(z) = \begin{cases} \frac{1}{2} & \text{for } |z| < 1 \\ 0 & \text{otherwise} \\ \end{cases}\]

\[z = \frac{x - X_i}{h}\]

Naive density estimation

Kernels

\[\hat{x}(x) = \frac{1}{nh} \sum_{i = 1}^k K \left( \frac{x - X_i}{h} \right)\]

Gaussian kernel

\[K(z) = \frac{1}{\sqrt{2 \pi}}e^{-\frac{1}{2} z^2}\]

Comparison of kernels

Selecting the bandwidth \(h\)

Boxplot

Violin plot

Things to look for in continuous variables

  • Assymetry
  • Outliers
  • Multimodality
  • Gaps
  • Heaping
  • Rounding
  • Impossibilities
  • Errors

Galton’s heights

Investigate for gaps or heaping

Comparing the distributions

Comparing the distributions

Comparing the distributions

Outlier detection

Outlier detection

Outlier detection

## # A tibble: 3 x 24
##   title  year length budget rating votes    r1    r2    r3    r4    r5
##   <chr> <int>  <int>  <int>  <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Cure…  1987   5220     NA    3.8    59  44.5   4.5   4.5   4.5     0
## 2 Four…  1967   1100     NA    3      12  24.5   0     4.5   0       0
## 3 Long…  1970   2880     NA    6.4    15  44.5   0     0     0       0
## # ... with 13 more variables: r6 <dbl>, r7 <dbl>, r8 <dbl>, r9 <dbl>,
## #   r10 <dbl>, mpaa <chr>, Action <int>, Animation <int>, Comedy <int>,
## #   Drama <int>, Documentary <int>, Romance <int>, Short <int>

Filter outliers

Compare distributions of subgroups

Compare distributions of subgroups

Multiple windows plot

Multiple windows plot

Boxplot

Categorical variables

  • Discrete variables with a fixed set of possible values

Bar chart

Omitted categories

Order matters

Order matters

Order matters

Order matters

Order matters

Variations on bar charts

Stacked bar chart

Dodged bar chart

Proportional bar chart

Scatterplots

  • Causal relationships (linear and nonlinear)
  • Associations (correlations)
  • Outliers or groups of outliers
  • Clusters
  • Gaps
  • Barriers
  • Conditional relationships

movies example

Smoothing lines

Adding jitter to the graph

Adding jitter to the graph

Adding jitter to the graph

Comparing groups within scatterplots

Comparing groups within scatterplots

Comparing groups within scatterplots

Comparing groups within scatterplots

Comparing groups within scatterplots

Scatterplot matrix

Scatterplot matrix

Scatterplot matrix

Heatmap of correlation coefficients

Heatmap of correlation coefficients