There are many helpfiles that explain the functions in R. These
helpfiles can be accessed by typing ?functionname
, which
opens the helpfile for the function called functionname
.
For example, ?library
opens the helpfile for the function
library
.
Packages provide code written by other users that we can use. Before being able to use packages, we need to install them (tell R to download the code). This only needs to be done once.
Before using the package, we then also need to tell R to load the code. This needs to be done in every session we start.
# install.packages('tidyverse') # install tidyverse package (only run once)
# note: anything on a line after # will not be run
library(tidyverse) # load the code in the tidyverse package (run every session)
There are various ways to get data. Often, data comes in the form of
.csv
files. We can load such datasets with the
read_csv()
function in R. This function is provided by the
tidyverse
library.
For example, if we want to load data from the file
beer.csv
, we can do this like this. You can check in
Rstudio in the “Environment” tab (usually top right corner) how the
loaded data looks like. There, it will show up as a new variable with
the name we gave it.
library(tidyverse) # load the package
# load the data from the .csv, and store it in a variable called `data`
df <- read_csv("beer.csv")
# df = read_csv("beer.csv") yields the same
Various packages also provide data directly. If we want to load data from an R package, we can do it like this:
library(ISLR) # load the package containing the data, here ISLR
data(OJ) # load the OJ dataset, it will be stored as `OJ`
Datasets in R packages are usually described in helpfiles. These can
be accessed with ?datasetname
, so running ?OJ
will open the helpfile for the dataset we just loaded.
A dataset contains many rows for various columns, where each column has an assigned name. Specific columns in the dataset can be accessed through multiple ways (after loading the data!):
OJ$StoreID # use its name to access column StoreID
OJ[,3] # use its index to access column StoreID (it is the 3rd column)
We can also access specific rows (note the comma!):
OJ[1,] # access first row of OJ
Note that when running these codes R will show you the respective output, i.e. it will display you the column you’re selecting.
We can store any output of a function in new variables. For example, we can store the column we access above as a new variable, and then use that variable as input to other functions.
col <- OJ$StoreID
mean(col) # gives same as mean(OJ$StoreID)
R provides many functions, such as themean()
used above.
Generally, these functions take inputs and provide some output. We can
also define our own functions, which is useful if we want to repeatedly
do the same operations for some inputs. For example, we can create our
own function to calculate the mean with the code below.
my_own_mean_function <- function(observations){
n_obs <- length(observations) # number of observations
mean <- sum(observations) / n_obs
return(mean)
}
# Run the function
observations <- c(1,2,3,4) # create vector
mean_own <- my_own_mean_function(observations)
# General structure
functionname <- function(INPUT1,INPUT2,INPUT3){
# Do stuff with INPUTS, generating OUTPUT
return(OUTPUT)
# note: can only return one OUTPUT variable, but can use c(OUTPUt1,OUTPUT2)
}
Various functions in R come with default values. For example, if you
type ?pnorm
, the helpfile will show that
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE)
. This means
that to run pnorm
, we do not need to provide parameters
mean
, sd
or lower.tail
; they will
default to the values specified in the helpfile. However, we can provide
them if we want to use pnorm
at values different than the
default ones.
# These yield the same
pnorm(2,mean=0,sd=1,lower.tail=FALSE)
## [1] 0.02275013
pnorm(2)
## [1] 0.9772499
# Set a different value for the mean, rest at default
pnorm(2,mean=1)
## [1] 0.8413447
# Set a different value for the standard deviation, rest at default
pnorm(2,sd=2)
## [1] 0.8413447
In R (and most programming languages), we can write loops that, as the name suggests, loop over something. There are various types of loops, but here we are only going to use one type: iterate over a range of values and do something with these values. The code for such a loop is shown below. Pay attention to the brackets.
range <- 1:2 # range from 1 to 2
for (i in range) {
print(i) # do something with the iterated i
}
## [1] 1
## [1] 2
range <- c(1,2) # alternative way to define range as a vector
The tidyverse
package provides several useful functions
to work with data. One of which is the read_csv()
function
we already used above to load the data above. The data is still stored
in a data frame called df
.
To show the first few rows a data frame, we can use the
head()
. function.
head(df)
## # A tibble: 6 × 11
## store upc week move price sale profit brand packs…¹ items…² units
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
## 1 86 1820000016 91 23 3.49 <NA> 19.0 BUDWEIS… 6 12 oz
## 2 86 1820000784 91 9 3.79 <NA> 28.2 O'DOUL'… 6 12 oz
## 3 86 1820000834 91 9 3.69 <NA> 22.0 BUDWEIS… 6 12 oz
## 4 86 1820000987 91 78 3.29 B 5.78 MICHELO… 6 12 oz
## 5 86 3410000354 91 35 3.69 <NA> 23.0 MILLER … 6 12 oz
## 6 86 3410000554 91 12 3.69 <NA> 23.0 MILLER … 6 12 oz
## # … with abbreviated variable names ¹packsize, ²itemsize
Often, we want to access only a specific subset of the data. Within
the tidyverse
package, this can be easily done with the
filter()
function.
df_subset <- filter(df,week == 91) # select only those rows where week==91
df_subset <- df %>% filter(week==91) # equivalent, but using pipe operator %>%
Another very common function is to calculate statistics on a group-level. For example, calculate the average beer price over time within a store, or calculate the average beer price in a week across stores.
# create new dataset that is grouped by week
df_grouped <- group_by(df,week)
# Create new dataset that summarizes groups by taking the mean of price within the group
# the new dataset has one observation per group
avg_price_week <- summarize(df_grouped,mean(price))
# We can do exactly the same, but with %>% operators
avg_price_week <- df %>% group_by(week) %>% summarize(mean(price))
# plot it with ggplot
ggplot(avg_price_week,aes(x=week,y=`mean(price)`)) + geom_line()
These two functions provide also much more functionality, as described in the respective helpfiles.
One thing is that in the new dataframe avg_price_week
,
the newly created varaible for the mean price is called `mean(price)’.
The code below shows how to access it, as well as how to rename the
variable.
# access variable, note the ` around the variable name
avg_price_week$`mean(price)`
# Rename variables
avg_price_week <- rename(avg_price_week,meanprice = `mean(price)`)
# avg_price-week <- avg_price_week %>% rename(meanprice=`mean(price)`) # equivalent with %>%
Beyond grouping, filtering, and renaming, the tidyverse
also provides various other functions that are useful to work with
datsets: select()
, arrange()
,
mutate()
,slice()
. You can can find more
details on these in their respective helpfiles.
In week 2 we learned that a distribution is a list of all possible outcomes for a random variable, and an associated probability for each outcome. The latter is defined through either a probability mass function (PMF) or a probability density function (PDF), depending on whether the random variable is discrete or continuous.
Some distributions are particularly useful, and they have been parameterized such that only a handful of parameters are needed to fully describe their PMF or PDF.
Below is a list of commonly used known distributions, that we also use in this course. For each distribution, the list shows the parameters and how they are called in R for the functions introduced below.
Distribution name | R shortname | parameters | PDF or PMF |
---|---|---|---|
Normal | norm | mean (\(\mu\)), sd (\(\sigma\)) | \(f(x)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}(\frac{x-\mu}{\sigma})^2\right)\) |
Binomial | binom | size (\(n\)), prob (\(p\)) | \(P(X=k)={n\choose k} p^k(1-p)^{n-k}\) |
Bernoulli | binom with size=1 | prob (\(p\)) | \(P(X=1) = p\) |
Uniform | unif | min,max | \(f(x) = \frac{1}{\max - \min} \text{ for }x\in\{\min,\max\}\) |
Student t | t | df (\(\nu\)) | \(f(x) = {\Gamma({\nu+1\over2})\over\sqrt{\nu\pi}\Gamma({\nu\over2})}\left(1+\frac{x^2}{\nu} \right)^{-{(v+1)\over2}}\) |
For known family of distributions, we can calculate the PDF or PMF
using the dDIST(x,PARS)
function, replacing
DIST
with the respective distribution short name, and
PARS
with the required parameters. For the distributions
above, this looks as follows:
Distribution name | Evaluate PDF or PMF at \(x=1\) |
---|---|
Normal | dnorm(1,mean=0,sd=1) |
Binomial | dbinom(1,size=20,prob=0.3) |
Bernoulli | dbinom(1,size=1,prob=0.3) |
Uniform | dunif(1,min=0.3,max=1.2) |
Student t | dt(1,df=20) |
The cumulative distribution function of a random variable X evaluated at x (recall, lower case is a realization!) gives us the probability that realizations of X smaller or equal to x occur. Formally, the CDF for a random variable X is given by \[ P(X \leq z) \]
In R, we can calculate the CDF with the pDIST(x,PARS)
,
again replacing DIST
and PARS
. For the
distributions above, this looks as follows.
Distribution name | Evaluate CDF at \(x=1\) |
---|---|
Normal | pnorm(1,mean=0,sd=1) |
Binomial | pbinom(1,size=20,prob=0.3) |
Bernoulli | pbinom(1,size=1,prob=0.3) |
Uniform | punif(1,min=0.3,max=1.2) |
Student t | pt(1,df=20) |
The quantile function is defined by the inverse of the CDF. Let \(F(x)\) denote the CDF, then the quantile function is defined by: \[Q(x) = F^{-1}(x)\]
In R, we can calculate the CDF with the qDIST(x,PARS)
,
again replacing DIST
and PARS
. For the
distributions above, this looks as follows.
Distribution name | Evaluate CDF at \(x=1\) |
---|---|
Normal | qnorm(1,mean=0,sd=1) |
Binomial | qbinom(1,size=20,prob=0.3) |
Bernoulli | qbinom(1,size=1,prob=0.3) |
Uniform | qunif(1,min=0.3,max=1.2) |
Student t | qt(1,df=20) |
We can also verify that this is really the inverse. The below calculates \(Q(F(x))=F^{-1}(F(x))=x\).
x <- 2
qnorm(pnorm(x,mean=1,sd=2),mean=1,sd=2)
## [1] 2
The following graph shows how the PDF, the CDF and the quantile function relate to one another. The PDF is a continuous function over all possible outcomes of the random variable \(X\), with larger values in regions where outcomes are more likely.
The CDF then is the area under the PDF, up to the specified value \(z\) in \(P(X \leq z)\).
Finally, the quantile function inverts the CDF, and gives the answer to the following question: if the CDF equals some value (e.g. \(P(X \leq z) = 0.3\)), what is \(z\)?
R also allows us to take random draws from these distributions. This
can be done with rDIST(n,PARS)
, where we again replace
DIST
and PARS
according to the distribution,
and set n
to the number of draws we want to take.
Distribution name | Take \(n=100\) draws |
---|---|
Normal | rnorm(100,mean=0,sd=1) |
Binomial | rbinom(100,size=20,prob=0.3) |
Bernoulli | rbinom(100,size=1,prob=0.3) |
Uniform | runif(100,min=0.3,max=1.2) |
Student t | rt(100,df=20) |
Because computers do not really understand randomness, these draws
are not fully random, but instead taken from a particular sequence
(so-called “pseudo-random”). This allows to reproduce always the same
draws, which is useful for replicable results (i.e. making sure you
always get exactly the same results when you run the code). To this end,
we can set a seed with set.seed(k)
, where we can choose
some integer k
, and whenever we set the seed with the same
k
we’ll get exactly the same random draws.
set.seed(102)
draws = runif(10,min=0,max=2)
draws1 = runif(10,min=0,max=2)
identical(draws,draws1) # test whether the two are identical
## [1] FALSE
set.seed(102) # set seed again
draws2 = runif(10,min=0,max=2)
identical(draws,draws2) # test whether the two are identical
## [1] TRUE
In R, we can code up how exactly the sampling distribution of different statistics arises.
There are two approaches:
In the below we use examples for both.
The Central limit theorem tells us that the sampling distribution of the average computed from a random sample is approximately normal (if the sample is large enough). we can verify this with the R code below, which uses approach 1.
# Create population with 10000000 individuals from which we take samples
# each observation in this case is a specific number, this could be anything
# here, I just use the runif() to create the population randomly
population <- runif(1000000,min=0.5,max=1.5)# the total population is 1000000 individuals
# Use a loop to take 1000 samples of size n=500 from the population
# for each sample calculate and store the mean
means <- c() # initialize empty list
for (i in 1:1000) {
realized_sample <- sample(population,500) # take sample of size 500 from population
mean_sample <- mean(realized_sample) # calculate mean in sample
means <- append(means,mean_sample) # add calculated mean to list `means`
}
# vector means now stores means calculated from 1000 samples of the population
# it reflects the sampling distribution
hist(means) # simple way to plot histogram
Does the resulting histogram now look like the uniform distribution which is the population distribution?
No! The histogram looks like a normal distribution, with the center at 1.0, the mean of the underlying population distribution. This is the Central Limit Theorem in action.
We can also use R to show how we should interpret confidence intervals and how they relate to sampling distributions.
As with an estimator (above the mean), the confidence interval depends on the sample we have in front of us, and, hence, is random. We now simulate this sampling distribution by using approach 2: we assume that the population (the true) distribution is \(X_i \sim N(1,4)\), and sample directly from this distribution. Throughout, we use a 95% confidence interval.
ci_lb_distribution <- c() # initialize empty list to store lower bounds of CI
ci_ub_distribution <- c() # initialize empty list to store upper bounds of CI
for (i in 1:1000) { # loop for 1000 samples
realized_sample <- rnorm(500,mean=1,sd=2) # take sample of size 500 from N(1,4)
mean_sample <- mean(realized_sample) # calculate mean in sample
sd_sample <- sd(realized_sample) # calculate standard deviation in sample
# calculate lower bound of confidence interval
ci_lb_sample <- mean_sample - sd_sample / sqrt(500) * qt(0.025,df=499,lower.tail=FALSE)
# calculate upper bound of confidence interval
ci_ub_sample <- mean_sample + sd_sample / sqrt(500) * qt(0.025,df=499,lower.tail=FALSE)
ci_lb_distribution <- append(ci_lb_distribution,ci_lb_sample) # store lb
ci_ub_distribution <- append(ci_ub_distribution,ci_ub_sample) # store ub
}
# Sampling distribution of lower bound
hist(ci_lb_distribution)
We can also verify the interpretation of the confidence interval: does the confidence interval contain the true value (mean=1) in about 95% of the samples?
# Count how many times calculated confidence interval contains true mean=1
n_in_bounds <- sum(ci_lb_distribution < 1 & ci_ub_distribution > 1) # & is AND operator
pct_in_bounds <- n_in_bounds / 1000
We now simulate performing hypothesis tests to check the interpretation of the type I / type II errors. For this, we again use approach 2 (assuming a population distribution).
The hypothesis test we simulate is for the following hypothesis: \[ H_0: \mu = 0 \] \[ H_A: \mu \neq 0 \]
We assume that we do not know the population variance and instead need to estimate it. You can check on the slides what that means for the test statistic. Throughout, we choose the significance level \(\alpha=0.05\).
Let’s simulate what happens when null hypothesis is true. The code
below takes many samples, and based on the sample, stores whether we
reject the null hypothesis in a vector reject_null_store
.
The sampling is based on approach 2 (see above).
reject_null_store <- c()
for (i in 1:1000) { # loop for 1000 samples
# take sample of size 500 under assumption null is true (mean=0)
realized_sample <- rnorm(500,mean=0,sd=1)
# Calculate test statistic
mean_sample <- mean(realized_sample)
sd_sample <- sd(realized_sample)
# calculate test statistic
test_statistic <- (mean_sample - 0) / (sd_sample / sqrt(500))
# Calculate p-value
pval <- 2*pt(abs(test_statistic),df=499,lower.tail=FALSE)
# reject if p-value smaller than significance level
reject_null_sample <- pval < 0.05
# Store
reject_null_store<- append(reject_null_store,reject_null_sample)
}
We now can check how often we reject the null hypothesis, given that it is true. The output below shows that in 5.5% of the samples, we end up rejecting the null hypothesis even though it is true. This is the Type 1 error, and should be roughly equal to the significance level \(\alpha =0.05\) we decided on above.
sum(reject_null_store) / 1000 * 100
## [1] 5.3
We can also check what happens if the null hypothesis is indeed false. The code below does a similar simulation, but now assumes that the null hypothesis is false (it assumes that \(\mu = 1\).) The output shows that, in this case, we never fail to reject the null hypothesis, given that it is not true. You can experiment with different values for the mean, and see what happens then.
reject_null_store <- c()
for (i in 1:1000) { # loop for 1000 samples
# take sample of size 500 under assumption null is false (mean=2)
realized_sample <- rnorm(500,mean=1,sd=1)
# Calculate test statistic
mean_sample <- mean(realized_sample)
sd_sample <- sd(realized_sample)
test_statistic <- (mean_sample - 0) / (sd_sample / sqrt(500))
# Calculate p-value
pval <- 2*pt(abs(test_statistic),df=499,lower.tail=FALSE)
# reject if p-value smaller than significance level
reject_null_sample <- pval < 0.05
# Store
reject_null_store<- append(reject_null_store,reject_null_sample)
}
# Check how ofen we reject
sum(reject_null_store) / 1000 * 100
## [1] 100