I am a lecturer in the Masters in Computational Social Science program. I earned my PhD in political science from Penn State University in 2015. My research interests focus on judicial politics, state courts, and agenda-setting. Methodologically I am interested in statistical learning and text analysis. I have training and experience in:
I have implemented these methods across a range of software, starting with Stata and eventually making the transition to R and Python. I am not a computer scientist, nor am I a statistician. I am a social scientist who uses statistical methods and computational tools to answer my research questions. That said, I feel fairly well trained in the theory and application of these methods at the graduate-level.
Go to https://css18.github.io for the course site. This contains the course objectives, required readings, schedules, slides, etc.
tidyverse
packages (e.g. loops, conditional statements, user-defined functions)This is an extension of the Computational Math/Stats camp. Based on your placement exam results, we will be reviewing some of this material more in-depth while omitting other topics entirely.
This class will focus mostly on probability and statistical inference, with some overlap and extension of materials from the camp.
Many classes will have assigned readings. You need to complete these before coming to class. I will assume you have done so and have at least a basic understanding of the material. Classes will be a mix of lecture and live-coding/in-class exercises. If you do not come to class prepared, then there is no point in coming to class.
15 min rule: when stuck, you HAVE to try on your own for 15 min; after 15 min, you HAVE to ask for help.- Brain AMA pic.twitter.com/MS7FnjXoGH
— Rachel Thomas (@math_rachel) August 14, 2016
We will follow the 15 minute rule in this class. If you encounter a problem in your assignments, spend 15 minutes troubleshooting the problem on your own. Make use of Google and StackOverflow to resolve the error. However, if after 15 minutes you still cannot solve the problem, ask for help. We will use GitHub to ask and answer class-related questions.
I am trying to balance two competing perspectives:
The point is that collaboration in this class is good - to a point. You are always, unless otherwise noted, expected to write and submit your own work. You should not blindly copy from your peers. You should not copy large chunks of code from the internet. That said, using the internet to debug programs is fine. Asking a classmate to help you is fine (the key phrase is help you, not do it for you).
The bottom line - if you don’t understand what the program is doing and are not prepared to explain it in detail, you should not submit it.
Students will complete weekly problem sets with a combination of analytical and computational problems. Each problem set is worth 10 points. Final grades will be determined based on cumulative performance across the problem sets. We will follow this basic homework workflow. If you have never used Git before, you’ll quickly learn as you will use Git in both the Perspectives and CAPP sequence.
R is open-source software, which means using it is completely free. Second, open-source software is developed collaboratively, meaning the source code is open to public inspection, modification, and improvement.
R is widely used in the physical and social sciences, as well as in government, non-profits, and the private sector.
Many developers and social scientists write programs in R. As a result, there is also a large support community available to help troubleshoot problematic code. As seen in the Redmonk programming language rankings (which compare languages’ appearances on Github [usage] and StackOverflow [support]), R appears near the top of both rankings.
R, like any computing language, relies on programmatic execution of functions. That is, to do anything you must write code. This differs from popular statistical software such as Stata or SPSS which at their core utilize a command language but overlay them with drop-down menus that enable a point-and-click interface. While much easier to operate, there are several downsides to this approach - mainly that it makes it impossible to reproduce one’s analysis.
C
, C++
, Python
, tensorflow
, stan
, etc.graphics
package is comprehensive and powerful, additional libraries such as ggplot2
and lattice
make R the go-to language for power data visualization approaches.Python was developed in the 1990s as a general-purpose programming language. It emphasizes simplicity over complexity in both its syntax and core functions. As a result, code written in Python is (relatively) easy to read and follow as compared to more complex languages like Perl or Java. As you can see in the above references, Python is just as, if not more, popular than R. It does many things well, like R, but is perhaps better in some aspects:
That said, there are also things it does not do as well as R:
matplotlib
, pygal
, and seaborn
), but are still behind R in terms of comprehensiveness and ease of use. Of course, once you wish to create interactive and advanced information visualizations, you can also used more specialized software such as Tableau or D3.numpy
), data analysis (pandas
), and machine learning (scikit-learn
). However I personally have found immense difficulty installing and managing packages in Python, even with the use of a package manager such as conda
.At the end of the day, I don’t think it is a debate between learning R vs. Python. Frankly to be a desirable (and therefore highly-compensated) data scientist you should learn both languages. R and Python complement each other, and even R/Python luminaries such as Hadley Wickham and Wes McKinney promote the benefits of both languages:
Python and R are NOT waging war. This is not a helpful characterisation
— Hadley Wickham (@hadleywickham) April 20, 2017
Since you are not expected to have prior programming experience entering this program and therefore may be learning programming from scratch simultaneously with this class, I cannot rely on you gaining necessary Python skills as we need them. My language of preference is R, and you will see R frequently throughout the winter and spring quarters if you enroll in my sections of Perspectives. Therefore it will be to your benefit to learn how to use R now, and gives you a leg up on your peers in the coming quarters.
As previously mentioned, the base R distribution is not the best for developing and writing programs. Instead, we want an integrated development environment (IDE) which will allow us to write and execute code, debug programs, and automate certain tasks. In this course we will use RStudio, perhaps the most popular IDE available for R. Like R, it is open-source, expandable, and provides many useful tools and enhancements over the base R environment.
Git is a version control system originally created for developers to collaborate on large software projects. Git tracks changes in the project over time so that there is always a comprehensive, structured record of the project. Each project is stored in a repository that includes all files that are part of the project. As social scientists, this is more than just computer scripts - this can include data files and reports, as well as our source code.
Git can be used locally by you on a single computer to track changes in a project. You do not need to be connected to the internet to use Git. However if you want to share your work with a larger audience, the easiest solution is to host the repository on a web site for others to download and inspect. You can host a public (open to the world) or private (open to just you or a few individuals) repository. GitHub has become the largest hoster of Git repositories and includes many useful features beyond the standard Git functions.
Chances are that by now you’ve seen or even used GitHub. Professionally, you should know how to use Git and GitHub to manage projects and share code. In this class, we will use Git and GitHub to host our course site, share code, and distribute/collect assignments.
Makes typesetting easy
$$f(x) = \frac{\exp(-\frac{(x - \mu)^2}{2\sigma^2} )}{ \sqrt{2\pi \sigma^2}}$$
\[f(x) = \frac{\exp(-\frac{(x - \mu)^2}{2\sigma^2} )}{ \sqrt{2\pi \sigma^2}}\]
Steep learning curve up front, but leads to big dividends later
We’ll use Markdown and \(\LaTeX\) in the problem sets to generate the final output for your R Markdown documents. I imagine you’ll also use \(\LaTeX\) extensively in the fall Perspectives course (it’s Dr. Evans’s preferred format), so it helps to learn it right off the bat. Make sure you install the appropriate distribution of \(\LaTeX\) for your computer, as RStudio cannot render R Markdown documents as PDFs without an existing distribution on your computer.
devtools::session_info()
## Session info -------------------------------------------------------------
## setting value
## version R version 3.5.1 (2018-07-02)
## system x86_64, darwin15.6.0
## ui RStudio (1.1.456)
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2018-10-25
## Packages -----------------------------------------------------------------
## package * version date source
## assertthat 0.2.0 2017-04-11 CRAN (R 3.5.0)
## backports 1.1.2 2017-12-13 CRAN (R 3.5.0)
## base * 3.5.1 2018-07-05 local
## bindr 0.1.1 2018-03-13 CRAN (R 3.5.0)
## bindrcpp * 0.2.2 2018-03-29 CRAN (R 3.5.0)
## broom * 0.5.0 2018-07-17 CRAN (R 3.5.0)
## cellranger 1.1.0 2016-07-27 CRAN (R 3.5.0)
## cli 1.0.0 2017-11-05 CRAN (R 3.5.0)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.5.0)
## compiler 3.5.1 2018-07-05 local
## crayon 1.3.4 2017-09-16 CRAN (R 3.5.0)
## datasets * 3.5.1 2018-07-05 local
## devtools 1.13.6 2018-06-27 CRAN (R 3.5.0)
## digest 0.6.15 2018-01-28 CRAN (R 3.5.0)
## dplyr * 0.7.6 2018-06-29 cran (@0.7.6)
## emo 0.0.0.9000 2017-10-03 Github (hadley/emo@9f2e0f2)
## evaluate 0.11 2018-07-17 CRAN (R 3.5.0)
## fansi 0.3.0 2018-08-13 CRAN (R 3.5.0)
## forcats * 0.3.0 2018-02-19 CRAN (R 3.5.0)
## ggplot2 * 3.0.0 2018-07-03 CRAN (R 3.5.0)
## ggthemes * 4.0.0 2018-07-19 CRAN (R 3.5.0)
## glue 1.3.0 2018-07-17 CRAN (R 3.5.0)
## graphics * 3.5.1 2018-07-05 local
## grDevices * 3.5.1 2018-07-05 local
## grid 3.5.1 2018-07-05 local
## gtable 0.2.0 2016-02-26 CRAN (R 3.5.0)
## haven 1.1.2 2018-06-27 CRAN (R 3.5.0)
## highr 0.7 2018-06-09 CRAN (R 3.5.0)
## hms 0.4.2 2018-03-10 CRAN (R 3.5.0)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.5.0)
## httpuv 1.4.5 2018-07-19 CRAN (R 3.5.0)
## httr 1.3.1 2017-08-20 CRAN (R 3.5.0)
## jsonlite 1.5 2017-06-01 CRAN (R 3.5.0)
## knitr * 1.20 2018-02-20 CRAN (R 3.5.0)
## labeling 0.3 2014-08-23 CRAN (R 3.5.0)
## later 0.7.3 2018-06-08 CRAN (R 3.5.0)
## lattice 0.20-35 2017-03-25 CRAN (R 3.5.1)
## lazyeval 0.2.1 2017-10-29 CRAN (R 3.5.0)
## lubridate 1.7.4 2018-04-11 CRAN (R 3.5.0)
## magrittr 1.5 2014-11-22 CRAN (R 3.5.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.5.0)
## methods * 3.5.1 2018-07-05 local
## mime 0.5 2016-07-07 CRAN (R 3.5.0)
## miniUI 0.1.1.1 2018-05-18 CRAN (R 3.5.0)
## modelr 0.1.2 2018-05-11 CRAN (R 3.5.0)
## munsell 0.5.0 2018-06-12 CRAN (R 3.5.0)
## nlme 3.1-137 2018-04-07 CRAN (R 3.5.1)
## patchwork * 0.0.1 2018-09-06 Github (thomasp85/patchwork@7fb35b1)
## pillar 1.3.0 2018-07-14 CRAN (R 3.5.0)
## pkgconfig 2.0.2 2018-08-16 CRAN (R 3.5.1)
## plyr 1.8.4 2016-06-08 CRAN (R 3.5.0)
## promises 1.0.1 2018-04-13 CRAN (R 3.5.0)
## purrr * 0.2.5 2018-05-29 CRAN (R 3.5.0)
## R6 2.2.2 2017-06-17 CRAN (R 3.5.0)
## rcfss * 0.1.5 2018-05-30 local
## Rcpp 0.12.18 2018-07-23 CRAN (R 3.5.0)
## readr * 1.1.1 2017-05-16 CRAN (R 3.5.0)
## readxl 1.1.0 2018-04-20 CRAN (R 3.5.0)
## rlang 0.2.1 2018-05-30 CRAN (R 3.5.0)
## rmarkdown 1.10 2018-06-11 CRAN (R 3.5.0)
## rprojroot 1.3-2 2018-01-03 CRAN (R 3.5.0)
## rsconnect 0.8.8 2018-03-09 CRAN (R 3.5.0)
## rstudioapi 0.7 2017-09-07 CRAN (R 3.5.0)
## rvest 0.3.2 2016-06-17 CRAN (R 3.5.0)
## scales 1.0.0 2018-08-09 CRAN (R 3.5.0)
## shiny 1.1.0 2018-05-17 CRAN (R 3.5.0)
## stats * 3.5.1 2018-07-05 local
## stringi 1.2.4 2018-07-20 CRAN (R 3.5.0)
## stringr * 1.3.1 2018-05-10 CRAN (R 3.5.0)
## tibble * 1.4.2 2018-01-22 CRAN (R 3.5.0)
## tidyr * 0.8.1 2018-05-18 CRAN (R 3.5.0)
## tidyselect 0.2.4 2018-02-26 CRAN (R 3.5.0)
## tidyverse * 1.2.1 2017-11-14 CRAN (R 3.5.0)
## tools 3.5.1 2018-07-05 local
## utf8 1.1.4 2018-05-24 CRAN (R 3.5.0)
## utils * 3.5.1 2018-07-05 local
## withr 2.1.2 2018-03-15 CRAN (R 3.5.0)
## xml2 1.2.0 2018-01-24 CRAN (R 3.5.0)
## xtable 1.8-2 2016-02-05 CRAN (R 3.5.0)
## yaml 2.2.0 2018-07-25 CRAN (R 3.5.0)
This work is licensed under the CC BY-NC 4.0 Creative Commons License.