4 Estimation

4.1 Point Estimation

Point estimation is all about using one single number (statistic) to estimate a population parameter of interest. Generally, we need to differentiate three different terms:

Parameter: $\theta$
Estimator: $\hat{\theta}(X) = g(X_1,\dots,X_n)$
Estimate: $\hat{\theta}(x) = g(x_1,\dots,x_n)$

$\theta$ is the population parameter we want to know. $\hat{\theta}(X)$ is our estimator, which is a function of our random sample. It is itself a random variable because another random sample will lead to a different number. The estimate $\hat{\theta}(x)$ then is our statistic that we compute for given sample; given realizations $(x_1,\dots,x_n)$, this number is not random. Note that we apply the same function $g()$ in the estimator and estimate, but in the estimator the $X_i$ are random variables whereas in the estimate the $x_i$ are realizations.

To highlight the difference between estimate and estimator consider the following example: the population is 1,000,000 consumers and we want to know how many of them purchased our product. Hence, the population parameter we want to know is \[\theta = \frac{\text{no. purchases in population}}{1,000,000}\]

In principle, we could just call all 1,000,000 consumers and ask, but that would cost way too much time (and time is money)! Instead, we are going to take a random sample with $n$ consumers from the population. In this case, our estimator for the mean is \[\hat{\theta}(X) = g(X_1,\dots,X_n) = \frac{\text{no. purchases in sample}}{n} \]

This is what we now do in R.

# Generate population and calculate population parameter
pop = rbinom(1000000,1,0.2) # generates 1,000,000 consumers that bought (=1) or not(=0)
pop_par = mean(pop)  
pop_par

[1] 0.200342

# Take random sample and calculate estimate on that sample
x = sample(pop,100,replace=TRUE)    # randomly takes 100 consumers out of pop 
estimate = mean(x)
estimate

[1] 0.17

# Calculate sampling error 
sampling_error = abs(estimate-pop_par) 
sampling_error

[1] 0.030342

Both the estimate and sampling error will be different if we take another random sample.

x = sample(pop,100,replace=TRUE)
mean(x)

[1] 0.28

4.2 Confidence Intervals

As we just saw, estimators themselves are random variables and subject to variation across different samples. Hence, point estimates are not enough to learn about the population parameter we are interested in. Rather than provide a single point estimate, it will be better to provide a range of plausible values for the parameter of interest. This is the idea of confidence intervals.

A confidence interval for any parameter $\theta$ will always take the form: \[\hat{\theta}\pm \text{(critical value)}\times \text{SD}(\hat{\theta})\] where $\hat{\theta}$ is an estimator of $\theta$, the critical value is a quantile of a normal or t distribution, and $\text{SD}(\hat{\theta})$ is the standard deviation of the estimator.

4.2.1 Types of Confidence Intervals

We will consider four types of $(1-\alpha)100\%$ confidence intervals.

Confidence interval for mean $\mu$, data are normally distributed, variance $\sigma^2$ is known \[\bar{x}\pm z_{\alpha/2}{\sigma\over\sqrt{n}}\]
Confidence interval for mean $\mu$, data are normally distributed, variance $\sigma^2$ is unknown \[\bar{x}\pm t_{n-1,\alpha/2}{s\over\sqrt{n}}\]
Confidence interval for mean $\mu$, data are not normally distributed \[\bar{x}\pm t_{n-1,\alpha/2}{s\over\sqrt{n}}\]
Confidence interval for proportion $p$, data are binary (0s and 1s) \[\hat{p}\pm z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})\over n}\]

Notice that we can compute the critical values in R. If we need $z_{\alpha/2}$, then we must find the value of a standard normal random variable such that $\alpha/2$ percent of the area is to its right. This is can be found using the normal quantile function qnorm( )!

For example, if $\alpha=0.05$ let’s use qnorm( ) to find $z_{0.025}$.

qnorm(0.025, mean=0, sd=1, lower.tail=FALSE)

[1] 1.959964

A summary of the most commonly used normal critical values are provided below.

Confidence Level $(1-\alpha)$	Critical Value $z_{\alpha/2}$
99%	2.576
95%	1.96
90%	1.645

Similarly, if we need to compute $t_{n-1,\alpha/2}$ with $\alpha=0.05$ and our data have $n=50$ observations, we can use the qt( ) function.

qt(0.025, df=50-1, lower.tail=FALSE)

[1] 2.009575

Notice this critical value is larger than $z_{0.025}$ – this comes from the fact that the t-distribution has fatter tails than the normal distribution. But, it also turns out that a t distribution (which only has one parameter $\nu$ called the “degrees of freedom”) converges to a normal distribution as $\nu\to\infty$. We can formally check this using qt( ).

qt(0.025, df=50, lower.tail=FALSE)

[1] 2.008559

qt(0.025, df=100, lower.tail=FALSE)

[1] 1.983972

qt(0.025, df=1000, lower.tail=FALSE)

[1] 1.962339

qt(0.025, df=10000, lower.tail=FALSE)

[1] 1.960201

Notice how these values approach $z_{0.025}\approx1.96$ as the degrees of freedom parameter gets large.

4.2.2 Interpreting confidence intervals

Suppose an analyst tells you that the 95% confidence interval for the population parameter $\theta$ is $(0.5,1.5)$. What can you conclude from this, i.e., what does it mean that we are 95% confident that $\theta$ is in this interval?

What the confidence interval tells us is that if we take many samples from the population and construct the confidence interval for each sample, then in approximately 95% of the samples the constructed confidence interval will include the true population parameter. Note that, same as the estimator, the confidence interval depends on the sample and hence is random.

We can check in R whether this holds using the same population we generated earlier. For this, we are now going to construct a for loop that does the following: (1) take a random sample of size $n=100$ from the population, (2) compute the 95% confidence interval (type 4 above) for the sample, check whether the population parameter lies within the computed confidence interval and store the result in a vector called poppar_in_ci.

R = 1000                                # how many samples to take 
poppar_in_ci = double(R)
for(r in 1:R){
    x = sample(pop,100,replace=TRUE)                   # generate new sample 
    phat = mean(x)                        
    s = sqrt(phat*(1-phat)/100)           # see formula for CI of proportion above
    poppar_in_ci[r] = phat -1.96 * s < pop_par & 
                        phat + 1.96 * s > pop_par
}
mean(poppar_in_ci)

[1] 0.933

The result of this simulation shows that for approximately 95% of our samples, the constructed confidence interval indeed contains the population parameter value.

Note that in this case, we were sampling from a population we first generated. Often, we take samples just from a hypothetical population, where we assume that the samples are drawn from a distribution that captures the population. For example, in chapter 3 we directly made assumptions on the population distribution and sampled from the respective distributions.

4.2.3 Examples

PROBLEM 1: A wine importer needs to report the average percentage of alcohol in bottles of French wine. From experience with previous kinds of wine, the importer believes that alcolhol percentages are normally distributed and the population standard deviation is 1.2%. The importer randomly samples 60 bottles of the new wine and obtains a sample mean $\bar{x}=9.3\%$. Give a 90% confidence interval for the average percentage of alcohol in all bottles of the new wine.

Solution:

From the problem, we know the following. \[\begin{align*} n = 60\\ \sigma=1.2\\ \bar{x} =9.3\\ \alpha=0.10 \end{align*}\] We must first figure out which type of confidence interval to use. Notice that we are trying to estimate the average percentage of alcohol, so our parameter is a mean $\mu$. Moreover, we are told to assume that the data are normally distributed and the population standard deviation $\sigma$ is known. Thefore, our confidence interval will be of the form: \[\bar{x}\pm z_{\alpha/2}{\sigma\over\sqrt{n}}.\] We can now define each object in R and construct the confidence interval.

n = 60
sigma = 1.2
xbar = 9.3
zalpha = qnorm(0.05, mean=0, sd=1, lower.tail=FALSE)
xbar - zalpha*sigma/sqrt(n)

[1] 9.04518

xbar + zalpha*sigma/sqrt(n)

[1] 9.55482

Therefore, we are 90% confident that the true average alcohol content in all new bottles of wine is between 9.05% and 9.55%.

PROBLEM 2: An economist wants to estimate the average amount in checking accounts at banks in a given region. A random sample of 100 accounts gives $\bar{x}=£357.60$ and $s=£140.00$. Give a 95% confidence interval for $\mu$, the average amount in any checking account at a bank in the given region.

Solution:

From the problem, we know the following. \[\begin{align*} n &= 100\\ \bar{x} &=357.60\\ s &= 140\\ \alpha&=0.5 \end{align*}\] Here we are not told whether the data are normally distributed. However, it won’t matter because we only have an estimate of $\sigma$ (remember that among the four types of confidence intervals we considered earlier, there are no differences between case II and III). Therefore, our confidence interval will be of the form: \[\bar{x}\pm t_{n-1,\alpha/2}{s\over\sqrt{n}}.\] We can again define each object in R and construct the confidence interval.

n = 100
xbar = 357.60
s = 140
talpha = qt(0.025, df=n-1, lower.tail=FALSE)
xbar - talpha*s/sqrt(n)

[1] 329.821

xbar + talpha*s/sqrt(n)

[1] 385.379

Therefore, we are 95% confident that the true average account checking account value in the given region is between £329.82 and £385.38.

PROBLEM 3: The EuStockMarkets data set in R provides daily closing prices of four major European stock indices: Germany DAX (Ibis), Switzerland SMI, France CAC, and UK FTSE. Using this data set, produce a 99% confidence interval for the average closing price of the UK FTSE.

Solution:

First, let’s load in the data from R.

data(EuStockMarkets)
head(EuStockMarkets)

DAX	SMI	CAC	FTSE
1628.75	1678.1	1772.8	2443.6
1613.63	1688.5	1750.5	2460.2
1606.51	1678.6	1718.0	2448.2
1621.04	1684.1	1708.1	2470.4
1618.16	1686.6	1723.1	2484.7
1610.61	1671.6	1714.3	2466.8

Now let’s pull the subset of data we care about (i.e., the UK FTSE column).

uk = EuStockMarkets[,4]

Notice that we are not told anything about the true distribution of the data. Therefore, our confidence interval will be of the form: \[\bar{x}\pm t_{n-1,\alpha/2}{s\over\sqrt{n}}.\] Next, let’s compute each component necessary to construct the 99% confidence interval.

n = length(uk)
xbar = mean(uk)
s = sd(uk)
talpha = qt(0.005, df=n-1, lower.tail=FALSE)
xbar - talpha*s/sqrt(n)

[1] 3507.248

xbar + talpha*s/sqrt(n)

[1] 3624.038

Therefore, we are 99% confident that the true closing price for the UK FTSE index is between £3,507.25 and £3,624.04.

Finally, notice that we can use the t.test( ) function to perform the same analysis but with fewer ste