Introduction to the course

Me (Dr. Benjamin Soltoff)

I am a lecturer in the Masters in Computational Social Science program. I earned my PhD in political science from Penn State University in 2015. My research interests focus on judicial politics, state courts, and agenda-setting. Methodologically I am interested in statistical learning and text analysis. I have training and experience in:

  • Probability and statistical theory
  • Descriptive and univariate/bivariate inference
  • Least squares regression
  • Generalized linear models
  • Maximum likelihood estimation
  • Machine learning/statistical learning algorithms
  • Multilevel models
  • Causal inference

I have implemented these methods across a range of software, starting with Stata and eventually making the transition to R and Python. I am not a computer scientist, nor am I a statistician. I am a social scientist who uses statistical methods and computational tools to answer my research questions. That said, I feel fairly well trained in the theory and application of these methods at the graduate-level.

Course website

Go to https://css18.github.io for the course site. This contains the course objectives, required readings, schedules, slides, etc.

Course objectives

  • Construct and execute basic programs in R using elementary programming techniques and tidyverse packages (e.g. loops, conditional statements, user-defined functions)
  • Visualize information and data using appropriate graphical techniques
  • Generate reproducible documents with R Markdown
  • Utilize professional document typesetting with mathematical notation via Markdown and \(\LaTeX\)
  • Define key concepts related to theory of probability
  • Identify and simulate random variables
  • Calculate expected value and variance of random variables
  • Write likelihood functions and optimize using a range of algorithms
  • Conduct hypothesis tests
  • Implement Bayesian methods of inference
  • Evaluate via simulations the major properties and assumptions of ordinary least squares (OLS) regression
  • Conduct inference and hypothesis testing for single and multivariable regression models

This is an extension of the Computational Math/Stats camp. Based on your placement exam results, we will be reviewing some of this material more in-depth while omitting other topics entirely.

What are we skipping

  • Linear algebra - we’ll discuss vectors/matricies/arrays as data objects in R and perform basic algebraic operations, but we are not going to be performing complex linear algebra methods in this course (e.g. matrix decompositions, singular value decomposition [SVD], principal components analysis [PCA]). We will return to those methods in the winter Perspectives quarter.
  • Calculus - we are not reviewing all the rules for calculating derivatives and antiderivatives. We will use calculus in this class, but mostly this will be in a computational approach (i.e. optimizing functions). We are not going to calculate derivatives/integrals by hand.

This class will focus mostly on probability and statistical inference, with some overlap and extension of materials from the camp.

How we will do this

  • This is not an analytically-based course. That is, we will not be exclusively completing pencil-and-paper exercises. There may be some exercises based on this and you will still need to use appropriate mathematical notation to explain your solutions to exercises, but all assignments will be submitted electronically via Git pull request.
  • We will focus more on the theory of how methods work, rather than rote application. The course is not proof-based; instead we will use simulation methods to prove different theorems and statistical principles. This will require programming on your part, performed in R.

Complete the readings

Many classes will have assigned readings. You need to complete these before coming to class. I will assume you have done so and have at least a basic understanding of the material. Classes will be a mix of lecture and live-coding/in-class exercises. If you do not come to class prepared, then there is no point in coming to class.

15 minute rule

We will follow the 15 minute rule in this class. If you encounter a problem in your assignments, spend 15 minutes troubleshooting the problem on your own. Make use of Google and StackOverflow to resolve the error. However, if after 15 minutes you still cannot solve the problem, ask for help. We will use GitHub to ask and answer class-related questions.

Plagiarism

I am trying to balance two competing perspectives:

  1. Collaboration is good - researchers usually collaborate with one another on projects. Developers work in teams to write programs. Why reinvent the wheel when it has already been done?
  2. Collaboration is cheating - this is academia. You are expected to complete your own work. If you copy from someone else, how do I know you actually learned the material?

The point is that collaboration in this class is good - to a point. You are always, unless otherwise noted, expected to write and submit your own work. You should not blindly copy from your peers. You should not copy large chunks of code from the internet. That said, using the internet to debug programs is fine. Asking a classmate to help you is fine (the key phrase is help you, not do it for you).

As Computer Coding Classes Swell, So Does Cheating

The bottom line - if you don’t understand what the program is doing and are not prepared to explain it in detail, you should not submit it.

Evaluations

Students will complete weekly problem sets with a combination of analytical and computational problems. Each problem set is worth 10 points. Final grades will be determined based on cumulative performance across the problem sets. We will follow this basic homework workflow. If you have never used Git before, you’ll quickly learn as you will use Git in both the Perspectives and CAPP sequence.

Software for the course

R

R is open-source software, which means using it is completely free. Second, open-source software is developed collaboratively, meaning the source code is open to public inspection, modification, and improvement.

Lack of point-and-click interface

R, like any computing language, relies on programmatic execution of functions. That is, to do anything you must write code. This differs from popular statistical software such as Stata or SPSS which at their core utilize a command language but overlay them with drop-down menus that enable a point-and-click interface. While much easier to operate, there are several downsides to this approach - mainly that it makes it impossible to reproduce one’s analysis.

Things R does well

  • Data analysis - R was written by statisticians for statisticians, so it is designed first and foremost as a language for statistical and data analysis. Much of the cutting-edge research in machine learning occurs in R, and every week there are packages added to CRAN implementing these new methods. Furthermore, many models in R can be exported to other programming languages such as C, C++, Python, tensorflow, stan, etc.
  • Data visualization - while the base R graphics package is comprehensive and powerful, additional libraries such as ggplot2 and lattice make R the go-to language for power data visualization approaches.

Things R does not do as well

  • Speed - while by no means a slug, R is not written to be a fast, speedy language. Depending on the complexity of the task and the size of your data, you may find R taking a long time to execute your program.

Why are we not using Python?

Python was developed in the 1990s as a general-purpose programming language. It emphasizes simplicity over complexity in both its syntax and core functions. As a result, code written in Python is (relatively) easy to read and follow as compared to more complex languages like Perl or Java. As you can see in the above references, Python is just as, if not more, popular than R. It does many things well, like R, but is perhaps better in some aspects:

  • General computation - since Python is a general computational language, it is more versatile at non-statistical tasks and is a bit more popular outside the statistics community.
  • Speed - because it is a general computing language, Python is optimized to be fast (assuming you write your code optimally). As your data becomes larger or more complex, you might find Python to be faster than R for your analytical needs.
  • Workflow - since Python is a general-purpose language, you can build entire applications using it. R, not so much.

That said, there are also things it does not do as well as R:

  • Visualizations - visual graphics libraries in Python are increasing in number and quality (see matplotlib, pygal, and seaborn), but are still behind R in terms of comprehensiveness and ease of use. Of course, once you wish to create interactive and advanced information visualizations, you can also used more specialized software such as Tableau or D3.
  • Add-on libraries - previously Python was criticized for its lack of libraries to perform statistical analysis and data manipulation, especially relative to the plethora of libraries for R. In recent years Python has begun to catch up with libraries for scientific computing (numpy), data analysis (pandas), and machine learning (scikit-learn). However I personally have found immense difficulty installing and managing packages in Python, even with the use of a package manager such as conda.

At the end of the day, I don’t think it is a debate between learning R vs. Python. Frankly to be a desirable (and therefore highly-compensated) data scientist you should learn both languages. R and Python complement each other, and even R/Python luminaries such as Hadley Wickham and Wes McKinney promote the benefits of both languages:

Since you are not expected to have prior programming experience entering this program and therefore may be learning programming from scratch simultaneously with this class, I cannot rely on you gaining necessary Python skills as we need them. My language of preference is R, and you will see R frequently throughout the winter and spring quarters if you enroll in my sections of Perspectives. Therefore it will be to your benefit to learn how to use R now, and gives you a leg up on your peers in the coming quarters.

RStudio

As previously mentioned, the base R distribution is not the best for developing and writing programs. Instead, we want an integrated development environment (IDE) which will allow us to write and execute code, debug programs, and automate certain tasks. In this course we will use RStudio, perhaps the most popular IDE available for R. Like R, it is open-source, expandable, and provides many useful tools and enhancements over the base R environment.

Git

Git is a version control system originally created for developers to collaborate on large software projects. Git tracks changes in the project over time so that there is always a comprehensive, structured record of the project. Each project is stored in a repository that includes all files that are part of the project. As social scientists, this is more than just computer scripts - this can include data files and reports, as well as our source code.

Git can be used locally by you on a single computer to track changes in a project. You do not need to be connected to the internet to use Git. However if you want to share your work with a larger audience, the easiest solution is to host the repository on a web site for others to download and inspect. You can host a public (open to the world) or private (open to just you or a few individuals) repository. GitHub has become the largest hoster of Git repositories and includes many useful features beyond the standard Git functions.

Chances are that by now you’ve seen or even used GitHub. Professionally, you should know how to use Git and GitHub to manage projects and share code. In this class, we will use Git and GitHub to host our course site, share code, and distribute/collect assignments.

\(\LaTeX\)

  • High-quality typesetting system
  • De facto standard for production of technical and scientific documentation
    • Books
    • Journal articles
  • Free software
  • Renders documents as PDFs
  • Makes typesetting easy

    $$f(x) = \frac{\exp(-\frac{(x - \mu)^2}{2\sigma^2} )}{ \sqrt{2\pi \sigma^2}}$$

    \[f(x) = \frac{\exp(-\frac{(x - \mu)^2}{2\sigma^2} )}{ \sqrt{2\pi \sigma^2}}\]

  • Tables/figures/general typesetting/nice presentations - easier in \(\LaTeX\)
    • Papers
    • Books
    • Dissertations/theses
    • Slides (beamer)
  • Steep learning curve up front, but leads to big dividends later

Markdown

  • Lightweight markup language with plain text formatting syntax
  • Easy to convert to HTML, PDF, and more
  • Used commonly on GitHub documentation, Jupyter Notebooks, R Markdown, and more
  • Simplified syntax compared to \(\LaTeX\) - also less flexibility
  • Publishing formats
    • HTML
    • PDF
    • Websites
    • Slides
    • Dashboards
    • Word/ODT/RTF (eww)

We’ll use Markdown and \(\LaTeX\) in the problem sets to generate the final output for your R Markdown documents. I imagine you’ll also use \(\LaTeX\) extensively in the fall Perspectives course (it’s Dr. Evans’s preferred format), so it helps to learn it right off the bat. Make sure you install the appropriate distribution of \(\LaTeX\) for your computer, as RStudio cannot render R Markdown documents as PDFs without an existing distribution on your computer.

Session Info

devtools::session_info()
## Session info -------------------------------------------------------------
##  setting  value                       
##  version  R version 3.5.1 (2018-07-02)
##  system   x86_64, darwin15.6.0        
##  ui       RStudio (1.1.456)           
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2018-10-25
## Packages -----------------------------------------------------------------
##  package    * version    date       source                              
##  assertthat   0.2.0      2017-04-11 CRAN (R 3.5.0)                      
##  backports    1.1.2      2017-12-13 CRAN (R 3.5.0)                      
##  base       * 3.5.1      2018-07-05 local                               
##  bindr        0.1.1      2018-03-13 CRAN (R 3.5.0)                      
##  bindrcpp   * 0.2.2      2018-03-29 CRAN (R 3.5.0)                      
##  broom      * 0.5.0      2018-07-17 CRAN (R 3.5.0)                      
##  cellranger   1.1.0      2016-07-27 CRAN (R 3.5.0)                      
##  cli          1.0.0      2017-11-05 CRAN (R 3.5.0)                      
##  colorspace   1.3-2      2016-12-14 CRAN (R 3.5.0)                      
##  compiler     3.5.1      2018-07-05 local                               
##  crayon       1.3.4      2017-09-16 CRAN (R 3.5.0)                      
##  datasets   * 3.5.1      2018-07-05 local                               
##  devtools     1.13.6     2018-06-27 CRAN (R 3.5.0)                      
##  digest       0.6.15     2018-01-28 CRAN (R 3.5.0)                      
##  dplyr      * 0.7.6      2018-06-29 cran (@0.7.6)                       
##  emo          0.0.0.9000 2017-10-03 Github (hadley/emo@9f2e0f2)         
##  evaluate     0.11       2018-07-17 CRAN (R 3.5.0)                      
##  fansi        0.3.0      2018-08-13 CRAN (R 3.5.0)                      
##  forcats    * 0.3.0      2018-02-19 CRAN (R 3.5.0)                      
##  ggplot2    * 3.0.0      2018-07-03 CRAN (R 3.5.0)                      
##  ggthemes   * 4.0.0      2018-07-19 CRAN (R 3.5.0)                      
##  glue         1.3.0      2018-07-17 CRAN (R 3.5.0)                      
##  graphics   * 3.5.1      2018-07-05 local                               
##  grDevices  * 3.5.1      2018-07-05 local                               
##  grid         3.5.1      2018-07-05 local                               
##  gtable       0.2.0      2016-02-26 CRAN (R 3.5.0)                      
##  haven        1.1.2      2018-06-27 CRAN (R 3.5.0)                      
##  highr        0.7        2018-06-09 CRAN (R 3.5.0)                      
##  hms          0.4.2      2018-03-10 CRAN (R 3.5.0)                      
##  htmltools    0.3.6      2017-04-28 CRAN (R 3.5.0)                      
##  httpuv       1.4.5      2018-07-19 CRAN (R 3.5.0)                      
##  httr         1.3.1      2017-08-20 CRAN (R 3.5.0)                      
##  jsonlite     1.5        2017-06-01 CRAN (R 3.5.0)                      
##  knitr      * 1.20       2018-02-20 CRAN (R 3.5.0)                      
##  labeling     0.3        2014-08-23 CRAN (R 3.5.0)                      
##  later        0.7.3      2018-06-08 CRAN (R 3.5.0)                      
##  lattice      0.20-35    2017-03-25 CRAN (R 3.5.1)                      
##  lazyeval     0.2.1      2017-10-29 CRAN (R 3.5.0)                      
##  lubridate    1.7.4      2018-04-11 CRAN (R 3.5.0)                      
##  magrittr     1.5        2014-11-22 CRAN (R 3.5.0)                      
##  memoise      1.1.0      2017-04-21 CRAN (R 3.5.0)                      
##  methods    * 3.5.1      2018-07-05 local                               
##  mime         0.5        2016-07-07 CRAN (R 3.5.0)                      
##  miniUI       0.1.1.1    2018-05-18 CRAN (R 3.5.0)                      
##  modelr       0.1.2      2018-05-11 CRAN (R 3.5.0)                      
##  munsell      0.5.0      2018-06-12 CRAN (R 3.5.0)                      
##  nlme         3.1-137    2018-04-07 CRAN (R 3.5.1)                      
##  patchwork  * 0.0.1      2018-09-06 Github (thomasp85/patchwork@7fb35b1)
##  pillar       1.3.0      2018-07-14 CRAN (R 3.5.0)                      
##  pkgconfig    2.0.2      2018-08-16 CRAN (R 3.5.1)                      
##  plyr         1.8.4      2016-06-08 CRAN (R 3.5.0)                      
##  promises     1.0.1      2018-04-13 CRAN (R 3.5.0)                      
##  purrr      * 0.2.5      2018-05-29 CRAN (R 3.5.0)                      
##  R6           2.2.2      2017-06-17 CRAN (R 3.5.0)                      
##  rcfss      * 0.1.5      2018-05-30 local                               
##  Rcpp         0.12.18    2018-07-23 CRAN (R 3.5.0)                      
##  readr      * 1.1.1      2017-05-16 CRAN (R 3.5.0)                      
##  readxl       1.1.0      2018-04-20 CRAN (R 3.5.0)                      
##  rlang        0.2.1      2018-05-30 CRAN (R 3.5.0)                      
##  rmarkdown    1.10       2018-06-11 CRAN (R 3.5.0)                      
##  rprojroot    1.3-2      2018-01-03 CRAN (R 3.5.0)                      
##  rsconnect    0.8.8      2018-03-09 CRAN (R 3.5.0)                      
##  rstudioapi   0.7        2017-09-07 CRAN (R 3.5.0)                      
##  rvest        0.3.2      2016-06-17 CRAN (R 3.5.0)                      
##  scales       1.0.0      2018-08-09 CRAN (R 3.5.0)                      
##  shiny        1.1.0      2018-05-17 CRAN (R 3.5.0)                      
##  stats      * 3.5.1      2018-07-05 local                               
##  stringi      1.2.4      2018-07-20 CRAN (R 3.5.0)                      
##  stringr    * 1.3.1      2018-05-10 CRAN (R 3.5.0)                      
##  tibble     * 1.4.2      2018-01-22 CRAN (R 3.5.0)                      
##  tidyr      * 0.8.1      2018-05-18 CRAN (R 3.5.0)                      
##  tidyselect   0.2.4      2018-02-26 CRAN (R 3.5.0)                      
##  tidyverse  * 1.2.1      2017-11-14 CRAN (R 3.5.0)                      
##  tools        3.5.1      2018-07-05 local                               
##  utf8         1.1.4      2018-05-24 CRAN (R 3.5.0)                      
##  utils      * 3.5.1      2018-07-05 local                               
##  withr        2.1.2      2018-03-15 CRAN (R 3.5.0)                      
##  xml2         1.2.0      2018-01-24 CRAN (R 3.5.0)                      
##  xtable       1.8-2      2016-02-05 CRAN (R 3.5.0)                      
##  yaml         2.2.0      2018-07-25 CRAN (R 3.5.0)

This work is licensed under the CC BY-NC 4.0 Creative Commons License.