# Estimation with Python¶

This chapter aims to show how to compute point estimates and confidence intervals. We explore the following cases:

• Estimation and confidence interval of the mean,
• Comparing the confidence of the means by group
• Estimation and Confidence interval of the proportion
• Confidence interval of the difference of proportions.

We start by importing the data in the file data1.csv

## The mean¶

### The theory¶

Assume that we're interested in the variable age from the imported data, and we would like to know the following information:

• the average age of the employees in the survey?
• the probability that one given employee has an age higher than 50?
• the distribution of the employees' age between the different departments?

We will first assume that sequence of employees's age is a random sample. We will write as a sequence of random variables $X_1,\ldots,X_n$.

We assume also that $X_1,\ldots,X_n$ are generated from a Normal distribution with mean $\mu$ and with variance $\sigma^2$.

It's known that the $\overline{X}$ is an estimator of the mean $\mu$. Since the sample mean is also a random variable with normal distribution with mean $\mu$ and variance $\displaystyle\frac{\sigma^2}{n}$, we can provide an interval that provides the error on the estimation. It's called the \textbf{Confidence Interval}.

We aim in chapter to show how to compute the confidence interval of the mean $\mu$ with level $(1-\alpha)$, $\alpha\in (0,1)$. We denoted by $\mbox{CI}_{1-\alpha}(\mu)$

If $\sigma^2$ is known, $\mbox{CI}_{1-\alpha}(\mu)$ is expressed as follows: $$\left(\overline{X} - z_{1-\alpha/2} \displaystyle\frac{\sigma}{\sqrt{n}},\;\overline{X} + z_{1-\alpha/2} \displaystyle\frac{\sigma}{\sqrt{n}}\right)$$

where $$\overline{X}=\displaystyle\frac{1}{n}\displaystyle\sum_{i=1}^n X_i$$ is the sample mean and $z_{1-\alpha/2}$ is the percentile associated to $(1-\alpha/2)$ from the standard normal distribution: $$F_Z(z_{1-\alpha/2})=1-\alpha/2$$

where $Z$ is a random variable with standard normal distribution.

If $\sigma^2$ is unknown, $\mbox{CI}_{1-\alpha}(\mu)$ is expressed as follows: $$\left(\overline{X} - t_{1-\alpha/2,n-1} \displaystyle\frac{S}{\sqrt{n}},\;\overline{X} + t_{1-\alpha/2,n-1} \displaystyle\frac{S}{\sqrt{n}}\right)$$

where $S^2$ is the sample mean: $$S^2=\displaystyle\frac{1}{n-1}\displaystyle\sum_{i=1}^n (X_i-\overline{X})^2$$ and $t_{1-\alpha/2,n-1}$ is the percentile associated to $(1-\alpha/2)$ from the $t-$distribution with $n-1$ degrees of freedom: $$F_{T_{n-1}}(t_{1-\alpha/2,n-1})=1-\alpha/2$$ where $T_{n-1}$ is a random variable with $t-$distribution $n-1$ degrees of freedom.

### Practice with Python¶

We will write two functions. A first one returns the $\mbox{CI}_{1-\alpha}(\mu)$ when $\sigma^2$ is known and the second functions returns $\mbox{CI}_{1-\alpha}(\mu)$ when $\sigma^2$ is unknown.

1st function

2nd function

Practice

simulating data

Computing the sample mean

$\mbox{CI}_{1-\alpha}(\mu)$ with known variance and $\alpha=0.05$

Computing the sample standard deviation

$\mbox{CI}_{1-\alpha}(\mu)$ with unknown variance and $\alpha=0.05$

We can also use a function already implemented in the library scipy to compute $\mbox{CI}_{1-\alpha}(\mu)$ when the variance is unknown.

We can also use scipy.stats.norm.interval to compute $\mbox{CI}_{1-\alpha}(\mu)$ with known variance

### Practice with data¶

Application: comparing the average of the Daily rate between Men and Women.

The female sample size

The female sample mean

The female standard error

The DailyRate Female $\mbox{CI}(95\%)$

We compute then the The DailyRate Female CI(95%)

We select now the DailyRatesample for Men and Women separatly

We can then visualize these Confidence intervals together to see the difference between the DailyRate means.

We will now write a Python function that can compare the Confidence Intervals of the means for a given continuous variable according to groups defined by a categorical variable.

DailyRate and EducationField

DailyRate and Department

## The proportion¶

Assume that we would like to estimate the probability $p$ to win an election for candidate A. We randomly select $n$ and consider $X$ the number of people reported that will vote for A.

The probability or the proportion $p$ is then estimated by

$$\widehat{p}=\displaystyle\frac{X}{n}$$

In most of the cases the number $n$, the sample size, is large and the probability distribution of $\widehat{p}$ is approximated with a Normal distribution with mean $\mu=p$ and variable $\sigma^2=\displaystyle\frac{p(1-p)}{n}$. We can provide then, for a given $\alpha\in(0,1)$, a \textbf{Confidence interval} with level $1-\alpha$:

$$\mbox{CI}_{1-\alpha}(p)=\left(\widehat{p}-z_{1-\alpha/2} \sqrt{\displaystyle\frac{\widehat{p}(1-\widehat{p})}{n}},\; \widehat{p}+z_{1-\alpha/2} \sqrt{\displaystyle\frac{\widehat{p}(1-\widehat{p})}{n}}\right)$$

where $z_{1-\alpha/2}$ is the $z-$score associate to $1-\alpha/2$. It means, if $Z$ is a standard normal distribution, then $$F_Z(z_{1-\alpha/2})=1-\alpha/2$$ Where $F_Z$ is the CDF of $Z$.

Example: A Survey was conducted to estimate the probability $p$ to vote a candidate $A$. Among the 1200 participated in the Survey, 560 reported that will vote for the candidate A. Find an estimation of $p$ and its 95\%-Confidence Interval.

Importing the Function for computing proportion confidence intervals

There's also four oothers methods to compute the proportion confidence interval:

Example: We would like to compare the proportion of frequently traveling between the three departments. We start by computing first the contingency table between the variables BusinessTravel and Departments.

## The difference of two proportions¶

We observe now two independents samples with different sizes $n_1$ and $n_2$. We estimate from each sample a proportion. We aim to provide a confidence interval of the difference between these proportions. It can be expressed as follows:

$$\mbox{CI}_{1-\alpha}(p_1-p_2)=\left(\widehat{p}_1-\widehat{p}_2-z_{1-\alpha/2} \sqrt{\displaystyle\frac{\widehat{p}_1(1-\widehat{p}_1)}{n_1}+\displaystyle\frac{\widehat{p}_2(1-\widehat{p}_2)}{n_2}},\; \widehat{p}_1-\widehat{p}_2+z_{1-\alpha/2} \sqrt{\displaystyle\frac{\widehat{p}_1(1-\widehat{p}_1)}{n_1}+\displaystyle\frac{\widehat{p}_2(1-\widehat{p}_2)}{n_2}}\right)$$

We will write the following Python function

The Confidence interval of the difference between the proportions of frequently traveling between R&D and HR departments

The Confidence interval of the difference between the proportions of frequently traveling between R&D and Sales departments