Estimation with Python

Table of Contents

This chapter aims to show how to compute point estimates and confidence intervals. We explore the following cases:

We start by importing the data in the file data1.csv

The mean

The theory

Assume that we're interested in the variable age from the imported data, and we would like to know the following information:

We will first assume that sequence of employees's age is a random sample. We will write as a sequence of random variables $X_1,\ldots,X_n$.

We assume also that $X_1,\ldots,X_n$ are generated from a Normal distribution with mean $\mu$ and with variance $\sigma^2$.

It's known that the $\overline{X}$ is an estimator of the mean $\mu$. Since the sample mean is also a random variable with normal distribution with mean $\mu$ and variance $\displaystyle\frac{\sigma^2}{n}$, we can provide an interval that provides the error on the estimation. It's called the \textbf{Confidence Interval}.

We aim in chapter to show how to compute the confidence interval of the mean $\mu$ with level $(1-\alpha)$, $\alpha\in (0,1)$. We denoted by $\mbox{CI}_{1-\alpha}(\mu)$

If $\sigma^2$ is known, $\mbox{CI}_{1-\alpha}(\mu)$ is expressed as follows: $$ \left(\overline{X} - z_{1-\alpha/2} \displaystyle\frac{\sigma}{\sqrt{n}},\;\overline{X} + z_{1-\alpha/2} \displaystyle\frac{\sigma}{\sqrt{n}}\right)$$

where $$\overline{X}=\displaystyle\frac{1}{n}\displaystyle\sum_{i=1}^n X_i$$ is the sample mean and $z_{1-\alpha/2}$ is the percentile associated to $(1-\alpha/2)$ from the standard normal distribution: $$F_Z(z_{1-\alpha/2})=1-\alpha/2$$

where $Z$ is a random variable with standard normal distribution.

If $\sigma^2$ is unknown, $\mbox{CI}_{1-\alpha}(\mu)$ is expressed as follows: $$ \left(\overline{X} - t_{1-\alpha/2,n-1} \displaystyle\frac{S}{\sqrt{n}},\;\overline{X} + t_{1-\alpha/2,n-1} \displaystyle\frac{S}{\sqrt{n}}\right)$$

where $S^2$ is the sample mean: $$S^2=\displaystyle\frac{1}{n-1}\displaystyle\sum_{i=1}^n (X_i-\overline{X})^2$$ and $t_{1-\alpha/2,n-1}$ is the percentile associated to $(1-\alpha/2)$ from the $t-$distribution with $n-1$ degrees of freedom: $$F_{T_{n-1}}(t_{1-\alpha/2,n-1})=1-\alpha/2$$ where $T_{n-1}$ is a random variable with $t-$distribution $n-1$ degrees of freedom.

Practice with Python

We will write two functions. A first one returns the $\mbox{CI}_{1-\alpha}(\mu)$ when $\sigma^2$ is known and the second functions returns $\mbox{CI}_{1-\alpha}(\mu)$ when $\sigma^2$ is unknown.

1st function

2nd function

Practice

simulating data

Computing the sample mean

$\mbox{CI}_{1-\alpha}(\mu)$ with known variance and $\alpha=0.05$

Computing the sample standard deviation

$\mbox{CI}_{1-\alpha}(\mu)$ with unknown variance and $\alpha=0.05$

We can also use a function already implemented in the library scipy to compute $\mbox{CI}_{1-\alpha}(\mu)$ when the variance is unknown.

We can also use scipy.stats.norm.interval to compute $\mbox{CI}_{1-\alpha}(\mu)$ with known variance

Practice with data

Application: comparing the average of the Daily rate between Men and Women.

The female sample size

The female sample mean

The female standard error

The DailyRate Female $\mbox{CI}(95\%)$

We compute then the The DailyRate Female CI(95%)

We select now the DailyRatesample for Men and Women separatly

We can then visualize these Confidence intervals together to see the difference between the DailyRate means.

We will now write a Python function that can compare the Confidence Intervals of the means for a given continuous variable according to groups defined by a categorical variable.

DailyRate and EducationField

DailyRate and Department

The proportion

Assume that we would like to estimate the probability $p$ to win an election for candidate A. We randomly select $n$ and consider $X$ the number of people reported that will vote for A.

The probability or the proportion $p$ is then estimated by

$$\widehat{p}=\displaystyle\frac{X}{n}$$

In most of the cases the number $n$, the sample size, is large and the probability distribution of $\widehat{p}$ is approximated with a Normal distribution with mean $\mu=p$ and variable $\sigma^2=\displaystyle\frac{p(1-p)}{n}$. We can provide then, for a given $\alpha\in(0,1)$, a \textbf{Confidence interval} with level $1-\alpha$:

$$\mbox{CI}_{1-\alpha}(p)=\left(\widehat{p}-z_{1-\alpha/2} \sqrt{\displaystyle\frac{\widehat{p}(1-\widehat{p})}{n}},\; \widehat{p}+z_{1-\alpha/2} \sqrt{\displaystyle\frac{\widehat{p}(1-\widehat{p})}{n}}\right)$$

where $z_{1-\alpha/2}$ is the $z-$score associate to $1-\alpha/2$. It means, if $Z$ is a standard normal distribution, then $$F_Z(z_{1-\alpha/2})=1-\alpha/2$$ Where $F_Z$ is the CDF of $Z$.

Example: A Survey was conducted to estimate the probability $p$ to vote a candidate $A$. Among the 1200 participated in the Survey, 560 reported that will vote for the candidate A. Find an estimation of $p$ and its 95\%-Confidence Interval.

Importing the Function for computing proportion confidence intervals

There's also four oothers methods to compute the proportion confidence interval:

Example: We would like to compare the proportion of frequently traveling between the three departments. We start by computing first the contingency table between the variables BusinessTravel and Departments.

The difference of two proportions

We observe now two independents samples with different sizes $n_1$ and $n_2$. We estimate from each sample a proportion. We aim to provide a confidence interval of the difference between these proportions. It can be expressed as follows:

$$\mbox{CI}_{1-\alpha}(p_1-p_2)=\left(\widehat{p}_1-\widehat{p}_2-z_{1-\alpha/2} \sqrt{\displaystyle\frac{\widehat{p}_1(1-\widehat{p}_1)}{n_1}+\displaystyle\frac{\widehat{p}_2(1-\widehat{p}_2)}{n_2}},\; \widehat{p}_1-\widehat{p}_2+z_{1-\alpha/2} \sqrt{\displaystyle\frac{\widehat{p}_1(1-\widehat{p}_1)}{n_1}+\displaystyle\frac{\widehat{p}_2(1-\widehat{p}_2)}{n_2}}\right)$$

We will write the following Python function

The Confidence interval of the difference between the proportions of frequently traveling between R&D and HR departments

The Confidence interval of the difference between the proportions of frequently traveling between R&D and Sales departments