R is an extremely powerful and versatile statistical environment. With this course, students will receive an introductory course in handling R, including data in- and output, data manipulation and conducting basic statistical analysis and visualisation. Topics covered include operating R using RStudio, interfacing R with popular data formats, inference statistics with (un-) weighted data and data/result visualisation. Advanced methods like regression, generalised linear models, some multivariate techniques and missing data handling will be covered as well. This content is enriched with best practice examples regarding reproducible research and version control. All sessions will include demonstrations and hands-on exercises. After completing this course, students should be able to confidently conduct many common statistical analyses in R on their own, have at least a clear vision on how R can be used for reproducible research and extend their knowledge of R autonomously.

For many years, the main purpose of the R statistical environment was to enable statisticians to exchange ideas and methods. These days have passed. Today, R has matured in a very powerful and versatile program, that can (and indeed is) being used for any kind of statistical analysis. R's key advantage is that it is free. Free as in free beer (which saves anyone who masters R the prohibitivelyhigh licensing costs of commercial statistical software) and free as in freedom. This latter free allows statisticians, data scientists and other practitioners to extend and modify R in any way they see fit. At the time of this writing, R stands at version 3.0.1 and offers more than 6,000 extension packages. While R is extremely popular in life sciences, natural sciences, engineering and finance, the social sciences are still refraining from widespread use. The reasons for this are manifold, but R's reputation for being hard to learn is certainly featuring prominently among them. While the last five years have seen a vast improvement of R interfaces and usability, that image has remained so far. This course sets out to correct that image.

Due to its length, this course will not only cover introductory topics, but rather progress quickly into advanced topics. Special emphasis will be put on demonstrating using R and RStudio workflows using best practice examples. This includes a (very brief) introduction to reproducible research (using markdown and pandoc) and version control (using git). Students can either kick back and watch these examples, or replicate them on their own computers. In the latter case, working installations of pandoc and git are required.

The course begins with an introduction to the RStudio interface to R and basic syntactic operations. We will then proceed on to R's different data types and their peculiarities. As R's language logic (a functional programming language) and terms differ radically from less powerful but more widespread concepts like those used in SPSS or Stata or even Excel, sufficient time will be spent on these subjects, to lay strong foundations for the things yet to come.

Once data manipulation is being mastered, attention will turn to using basic inference statistical concepts on simple data sets. Methods covered range from descriptive statistics, to contingency tables and associated hypothesis tests. This section also includes basic data visualization. Here we will also cover importing data sets that are typically found in social science contexts: SPSS' sav, Stata's dta and Excel's xls.

In social sciences, we can hardly ever work with simple data sets. Often enough data is collected using surveys that need to be corrected for bias or sampling was done at multiple levels. In either case, statistics mandates to compensate for the resulting modification of variance and inclusion probability. R is well equipped to handle all of these cases. Unfortunately, hardly any introductory courses in statistical software cover this topic. To empower social scientists to use their knowledge even in situations where the classical statistical assumption of simple random sampling is violated, we will devote ample time to this topic.

Days 3 and 4 will focus on advanced statistical methodology, covering classical ANOVA, OLS regression and generalized linear models, as well as some multivariate statistical methods like principal component analysis, exploratory factor analysis and decision trees. Finally, the course will touch on topics that are usually very important but underemphasised: the treatment of missing values and advanced (i.e. beautiful and/or dynamic) visualisations.

Each session in this course will consist of hands-on practical training using demos and in-class assignments, paired with presentation slides and a comprehensive set of lecture notes. The idea here is to complement instruction with reference material students can use to expand on their gained knowledge after WSMT has ended. Instruction-wise, most sessions will contain best-practice examples, so that students can optimize their own workflows.

Note: this is not a methods course, so while we will use a large array of methods, we will spend little to no time on how or why methods work, and instead focus fully on how to use them with R.

To ensure that all students get the most out of the course, a sound understanding of inferential statistics is being assumed. If needs be, students should review their knowledge before the beginning of the Winter School. Likewise, all students must be confident in using a (their) computer.

ECTS: Students of this course have the possibility to do a small project to obtain 2 ECTS credits. These projects consist of autonomous data analysis, interpretation and reporting. Preferably composed as reproducible research, as presented in the beginning of this course. Students can either work with data of their own (if approved) or complete a provided assignment.


Day-to-day schedule

Topics Details

Introduction to R & RStudio; including best practices in reproducible research and version control

Lab session: lecture intertwined with hands-on demos and exercises

Feb 18th

Basic inference statistics with ordinary and weighted data

Lab session: lecture intertwined with hands-on demos and exercises

Feb 19th

Advanced statistical models

Lab session: lecture intertwined with hands-on demos and exercises

Feb 20th

Advanced statistical models

Lab session: lecture intertwined with hands-on demos and exercises

Missing values, Advanced visualization

Lab session: lecture intertwined with hands-on demos and exercises



All required readings are either in the book R in Action or on this website with the respective session sections.

Day 1

Chapters 1, 2, and 4

Day 2

Chapters 6, and 7 as well as Complex Samples materials online

Day 3

Chapters 8, and 9

Day 4

Chapters 13, and 14

Day 5

Chapters 15, and 16

Each day, students should be practicing R with exercises handed out during sessions. Typically, each day exercises will take no more than 30-60 minutes. Students should be prepared to present their solutions to class on the following day.