1 Basic R

1.1 R help function

There are many helpfiles that explain the functions in R. These helpfiles can be accessed by typing ?functionname, which opens the helpfile for the function called functionname. For example, ?library opens the helpfile for the function library.

1.2 Packages

Packages provide code written by other users that we can use. Before being able to use packages, we need to install them (tell R to download the code). This only needs to be done once.

Before using the package, we then also need to tell R to load the code. This needs to be done in every session we start.

# install.packages('tidyverse') # install tidyverse package (only run once)
                                # note: anything on a line after # will not be run 
library(tidyverse)              # load the code in the tidyverse package (run every session)

1.3 Loading datasets

There are various ways to get data. Often, data comes in the form of .csv files. We can load such datasets with the read_csv() function in R. This function is provided by the tidyverse library.

For example, if we want to load data from the file beer.csv, we can do this like this. You can check in Rstudio in the “Environment” tab (usually top right corner) how the loaded data looks like. There, it will show up as a new variable with the name we gave it.

library(tidyverse)  # load the package
# load the data from the .csv, and store it in a variable called `data`
df <- read_csv("beer.csv") 
# df = read_csv("beer.csv") yields the same 

Various packages also provide data directly. If we want to load data from an R package, we can do it like this:

library(ISLR)  # load the package containing the data, here ISLR
data(OJ)       # load the OJ dataset, it will be stored as `OJ` 

Datasets in R packages are usually described in helpfiles. These can be accessed with ?datasetname, so running ?OJ will open the helpfile for the dataset we just loaded.

1.4 Accessing columns and rows

A dataset contains many rows for various columns, where each column has an assigned name. Specific columns in the dataset can be accessed through multiple ways (after loading the data!):

OJ$StoreID  # use its name to access column StoreID
OJ[,3]      # use its index to access column StoreID (it is the 3rd column)

We can also access specific rows (note the comma!):

OJ[1,]      # access first row of OJ 

Note that when running these codes R will show you the respective output, i.e. it will display you the column you’re selecting.

1.5 Assigning new variables

We can store any output of a function in new variables. For example, we can store the column we access above as a new variable, and then use that variable as input to other functions.

col <- OJ$StoreID 
mean(col) # gives same as mean(OJ$StoreID)

1.6 Functions

R provides many functions, such as themean() used above. Generally, these functions take inputs and provide some output. We can also define our own functions, which is useful if we want to repeatedly do the same operations for some inputs. For example, we can create our own function to calculate the mean with the code below.

my_own_mean_function <- function(observations){
  n_obs <- length(observations) # number of observations 
  mean <- sum(observations) / n_obs 
  return(mean) 
}

# Run the function 
observations <- c(1,2,3,4) # create vector 
mean_own <- my_own_mean_function(observations)

# General structure
functionname <- function(INPUT1,INPUT2,INPUT3){
  # Do stuff with INPUTS, generating OUTPUT
  return(OUTPUT)
  # note: can only return one OUTPUT variable, but can use c(OUTPUt1,OUTPUT2) 

}

1.7 Default values in functions

Various functions in R come with default values. For example, if you type ?pnorm, the helpfile will show that pnorm(q, mean = 0, sd = 1, lower.tail = TRUE). This means that to run pnorm, we do not need to provide parameters mean, sd or lower.tail; they will default to the values specified in the helpfile. However, we can provide them if we want to use pnorm at values different than the default ones.

# These yield the same 
pnorm(2,mean=0,sd=1,lower.tail=FALSE)
## [1] 0.02275013
pnorm(2)
## [1] 0.9772499
# Set a different value for the mean, rest at default
pnorm(2,mean=1)
## [1] 0.8413447
# Set a different value for the standard deviation, rest at default
pnorm(2,sd=2)
## [1] 0.8413447

1.8 Loops

In R (and most programming languages), we can write loops that, as the name suggests, loop over something. There are various types of loops, but here we are only going to use one type: iterate over a range of values and do something with these values. The code for such a loop is shown below. Pay attention to the brackets.

range <- 1:2 # range from 1 to 2 

for (i in range) {
  print(i)          # do something with the iterated i 
}
## [1] 1
## [1] 2
range <- c(1,2)      # alternative way to define range as a vector

1.9 Useful functions for working with data

The tidyverse package provides several useful functions to work with data. One of which is the read_csv() function we already used above to load the data above. The data is still stored in a data frame called df.

To show the first few rows a data frame, we can use the head(). function.

head(df)
## # A tibble: 6 × 11
##   store        upc  week  move price sale  profit brand    packs…¹ items…² units
##   <dbl>      <dbl> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>      <dbl>   <dbl> <chr>
## 1    86 1820000016    91    23  3.49 <NA>   19.0  BUDWEIS…       6      12 oz   
## 2    86 1820000784    91     9  3.79 <NA>   28.2  O'DOUL'…       6      12 oz   
## 3    86 1820000834    91     9  3.69 <NA>   22.0  BUDWEIS…       6      12 oz   
## 4    86 1820000987    91    78  3.29 B       5.78 MICHELO…       6      12 oz   
## 5    86 3410000354    91    35  3.69 <NA>   23.0  MILLER …       6      12 oz   
## 6    86 3410000554    91    12  3.69 <NA>   23.0  MILLER …       6      12 oz   
## # … with abbreviated variable names ¹​packsize, ²​itemsize

Often, we want to access only a specific subset of the data. Within the tidyverse package, this can be easily done with the filter() function.

df_subset <- filter(df,week == 91) # select only those rows where week==91
df_subset <- df %>% filter(week==91) # equivalent, but using pipe operator %>%

Another very common function is to calculate statistics on a group-level. For example, calculate the average beer price over time within a store, or calculate the average beer price in a week across stores.

# create new dataset that is grouped by week 
df_grouped <- group_by(df,week) 
# Create new dataset that summarizes groups by taking the mean of price within the group
# the new dataset has one observation per group
avg_price_week <- summarize(df_grouped,mean(price))

# We can do exactly the same, but with %>% operators 
avg_price_week <- df %>% group_by(week) %>% summarize(mean(price)) 

# plot it with ggplot 
ggplot(avg_price_week,aes(x=week,y=`mean(price)`)) + geom_line()

These two functions provide also much more functionality, as described in the respective helpfiles.

One thing is that in the new dataframe avg_price_week, the newly created varaible for the mean price is called `mean(price)’. The code below shows how to access it, as well as how to rename the variable.

# access variable, note the ` around the variable name
avg_price_week$`mean(price)`

# Rename variables
avg_price_week <- rename(avg_price_week,meanprice = `mean(price)`)
# avg_price-week <- avg_price_week %>% rename(meanprice=`mean(price)`) # equivalent with %>%

Beyond grouping, filtering, and renaming, the tidyverse also provides various other functions that are useful to work with datsets: select(), arrange(), mutate(),slice(). You can can find more details on these in their respective helpfiles.

2 Random variables and distributions

2.1 Known family of distributions

In week 2 we learned that a distribution is a list of all possible outcomes for a random variable, and an associated probability for each outcome. The latter is defined through either a probability mass function (PMF) or a probability density function (PDF), depending on whether the random variable is discrete or continuous.

Some distributions are particularly useful, and they have been parameterized such that only a handful of parameters are needed to fully describe their PMF or PDF.

Below is a list of commonly used known distributions, that we also use in this course. For each distribution, the list shows the parameters and how they are called in R for the functions introduced below.

Distribution name R shortname parameters PDF or PMF
Normal norm mean (\(\mu\)), sd (\(\sigma\)) \(f(x)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}(\frac{x-\mu}{\sigma})^2\right)\)
Binomial binom size (\(n\)), prob (\(p\)) \(P(X=k)={n\choose k} p^k(1-p)^{n-k}\)
Bernoulli binom with size=1 prob (\(p\)) \(P(X=1) = p\)
Uniform unif min,max \(f(x) = \frac{1}{\max - \min} \text{ for }x\in\{\min,\max\}\)
Student t t df (\(\nu\)) \(f(x) = {\Gamma({\nu+1\over2})\over\sqrt{\nu\pi}\Gamma({\nu\over2})}\left(1+\frac{x^2}{\nu} \right)^{-{(v+1)\over2}}\)

2.2 Calculating the PDF or PMF

For known family of distributions, we can calculate the PDF or PMF using the dDIST(x,PARS) function, replacing DIST with the respective distribution short name, and PARS with the required parameters. For the distributions above, this looks as follows:

Distribution name Evaluate PDF or PMF at \(x=1\)
Normal dnorm(1,mean=0,sd=1)
Binomial dbinom(1,size=20,prob=0.3)
Bernoulli dbinom(1,size=1,prob=0.3)
Uniform dunif(1,min=0.3,max=1.2)
Student t dt(1,df=20)

2.3 Cumulative density function (CDF)

The cumulative distribution function of a random variable X evaluated at x (recall, lower case is a realization!) gives us the probability that realizations of X smaller or equal to x occur. Formally, the CDF for a random variable X is given by \[ P(X \leq z) \]

In R, we can calculate the CDF with the pDIST(x,PARS), again replacing DIST and PARS. For the distributions above, this looks as follows.

Distribution name Evaluate CDF at \(x=1\)
Normal pnorm(1,mean=0,sd=1)
Binomial pbinom(1,size=20,prob=0.3)
Bernoulli pbinom(1,size=1,prob=0.3)
Uniform punif(1,min=0.3,max=1.2)
Student t pt(1,df=20)

2.4 Quantile function (inverse of CDF)

The quantile function is defined by the inverse of the CDF. Let \(F(x)\) denote the CDF, then the quantile function is defined by: \[Q(x) = F^{-1}(x)\]

In R, we can calculate the CDF with the qDIST(x,PARS), again replacing DIST and PARS. For the distributions above, this looks as follows.

Distribution name Evaluate CDF at \(x=1\)
Normal qnorm(1,mean=0,sd=1)
Binomial qbinom(1,size=20,prob=0.3)
Bernoulli qbinom(1,size=1,prob=0.3)
Uniform qunif(1,min=0.3,max=1.2)
Student t qt(1,df=20)

We can also verify that this is really the inverse. The below calculates \(Q(F(x))=F^{-1}(F(x))=x\).

x <- 2
qnorm(pnorm(x,mean=1,sd=2),mean=1,sd=2)
## [1] 2

2.5 PDF, CDF, and quantile function in one graph

The following graph shows how the PDF, the CDF and the quantile function relate to one another. The PDF is a continuous function over all possible outcomes of the random variable \(X\), with larger values in regions where outcomes are more likely.

The CDF then is the area under the PDF, up to the specified value \(z\) in \(P(X \leq z)\).

Finally, the quantile function inverts the CDF, and gives the answer to the following question: if the CDF equals some value (e.g. \(P(X \leq z) = 0.3\)), what is \(z\)?

2.6 Drawing random numbers

R also allows us to take random draws from these distributions. This can be done with rDIST(n,PARS), where we again replace DIST and PARS according to the distribution, and set n to the number of draws we want to take.

Distribution name Take \(n=100\) draws
Normal rnorm(100,mean=0,sd=1)
Binomial rbinom(100,size=20,prob=0.3)
Bernoulli rbinom(100,size=1,prob=0.3)
Uniform runif(100,min=0.3,max=1.2)
Student t rt(100,df=20)

Because computers do not really understand randomness, these draws are not fully random, but instead taken from a particular sequence (so-called “pseudo-random”). This allows to reproduce always the same draws, which is useful for replicable results (i.e. making sure you always get exactly the same results when you run the code). To this end, we can set a seed with set.seed(k), where we can choose some integer k, and whenever we set the seed with the same k we’ll get exactly the same random draws.

set.seed(102)
draws = runif(10,min=0,max=2) 
draws1 = runif(10,min=0,max=2)
identical(draws,draws1) # test whether the two are identical
## [1] FALSE
set.seed(102) # set seed again
draws2 = runif(10,min=0,max=2)
identical(draws,draws2) # test whether the two are identical
## [1] TRUE

3 Simulations for the sampling distribution

In R, we can code up how exactly the sampling distribution of different statistics arises.

There are two approaches:

  1. create a population that we effectively take samples from
  2. assume a population distribution (with true parameters), and use the rng to sample from it

In the below we use examples for both.

3.1 Mean / average

The Central limit theorem tells us that the sampling distribution of the average computed from a random sample is approximately normal (if the sample is large enough). we can verify this with the R code below, which uses approach 1.

# Create population with 10000000 individuals from which we take samples
# each observation in this case is a specific number, this could be anything 
# here, I just use the runif() to create the population randomly 
population <- runif(1000000,min=0.5,max=1.5)# the total population is 1000000 individuals 

# Use a loop to take 1000 samples of size n=500 from the population 
# for each sample calculate and store the mean
means <- c() # initialize empty list
for (i in 1:1000) {
  realized_sample <- sample(population,500) # take sample of size 500 from population 
  mean_sample     <- mean(realized_sample)  # calculate mean in sample 
  means <- append(means,mean_sample)        # add calculated mean to list `means` 
}

# vector means now stores means calculated from 1000 samples of the population
# it reflects the sampling distribution
hist(means) # simple way to plot histogram 

Does the resulting histogram now look like the uniform distribution which is the population distribution?

No! The histogram looks like a normal distribution, with the center at 1.0, the mean of the underlying population distribution. This is the Central Limit Theorem in action.

3.2 Confidence intervals

We can also use R to show how we should interpret confidence intervals and how they relate to sampling distributions.

As with an estimator (above the mean), the confidence interval depends on the sample we have in front of us, and, hence, is random. We now simulate this sampling distribution by using approach 2: we assume that the population (the true) distribution is \(X_i \sim N(1,4)\), and sample directly from this distribution. Throughout, we use a 95% confidence interval.

ci_lb_distribution <- c() # initialize empty list to store lower bounds of CI
ci_ub_distribution <- c() # initialize empty list to store upper bounds of CI
for (i in 1:1000) {       # loop for 1000 samples
  realized_sample <- rnorm(500,mean=1,sd=2) # take sample of size 500 from N(1,4)
  mean_sample     <- mean(realized_sample)  # calculate mean in sample
  sd_sample       <- sd(realized_sample)    # calculate standard deviation in sample 
  # calculate lower bound of confidence interval
  ci_lb_sample <- mean_sample - sd_sample / sqrt(500) * qt(0.025,df=499,lower.tail=FALSE)
  # calculate upper bound of confidence interval
  ci_ub_sample <- mean_sample + sd_sample / sqrt(500) * qt(0.025,df=499,lower.tail=FALSE)
  ci_lb_distribution <- append(ci_lb_distribution,ci_lb_sample)    # store lb
  ci_ub_distribution <- append(ci_ub_distribution,ci_ub_sample)    # store ub
}

# Sampling distribution of lower bound
hist(ci_lb_distribution)

We can also verify the interpretation of the confidence interval: does the confidence interval contain the true value (mean=1) in about 95% of the samples?

# Count how many times calculated confidence interval contains true mean=1
n_in_bounds <- sum(ci_lb_distribution < 1 & ci_ub_distribution > 1) # & is AND operator 
pct_in_bounds <- n_in_bounds / 1000

3.3 Hypothesis tests

We now simulate performing hypothesis tests to check the interpretation of the type I / type II errors. For this, we again use approach 2 (assuming a population distribution).

The hypothesis test we simulate is for the following hypothesis: \[ H_0: \mu = 0 \] \[ H_A: \mu \neq 0 \]

We assume that we do not know the population variance and instead need to estimate it. You can check on the slides what that means for the test statistic. Throughout, we choose the significance level \(\alpha=0.05\).

Let’s simulate what happens when null hypothesis is true. The code below takes many samples, and based on the sample, stores whether we reject the null hypothesis in a vector reject_null_store. The sampling is based on approach 2 (see above).

reject_null_store <- c() 
for (i in 1:1000) { # loop for 1000 samples
  # take sample of size 500 under assumption null is true (mean=0)
  realized_sample   <- rnorm(500,mean=0,sd=1) 
  # Calculate test statistic
  mean_sample       <- mean(realized_sample)
  sd_sample         <- sd(realized_sample) 
  # calculate test statistic
  test_statistic    <- (mean_sample - 0) / (sd_sample / sqrt(500)) 
  # Calculate p-value 
  pval              <- 2*pt(abs(test_statistic),df=499,lower.tail=FALSE) 
  # reject if p-value smaller than significance level 
  reject_null_sample <- pval < 0.05
  # Store
  reject_null_store<- append(reject_null_store,reject_null_sample)
}

We now can check how often we reject the null hypothesis, given that it is true. The output below shows that in 5.5% of the samples, we end up rejecting the null hypothesis even though it is true. This is the Type 1 error, and should be roughly equal to the significance level \(\alpha =0.05\) we decided on above.

sum(reject_null_store) / 1000 * 100 
## [1] 5.3

We can also check what happens if the null hypothesis is indeed false. The code below does a similar simulation, but now assumes that the null hypothesis is false (it assumes that \(\mu = 1\).) The output shows that, in this case, we never fail to reject the null hypothesis, given that it is not true. You can experiment with different values for the mean, and see what happens then.

reject_null_store <- c() 
for (i in 1:1000) { # loop for 1000 samples
  # take sample of size 500 under assumption null is false (mean=2)
  realized_sample   <- rnorm(500,mean=1,sd=1) 
  # Calculate test statistic
  mean_sample       <- mean(realized_sample)
  sd_sample         <- sd(realized_sample) 
  test_statistic    <- (mean_sample - 0) / (sd_sample / sqrt(500))
  # Calculate p-value 
  pval              <- 2*pt(abs(test_statistic),df=499,lower.tail=FALSE) 
  # reject if p-value smaller than significance level 
  reject_null_sample <- pval < 0.05
  # Store
  reject_null_store<- append(reject_null_store,reject_null_sample)
}

# Check how ofen we reject 
sum(reject_null_store) / 1000 * 100 
## [1] 100