Estimation with Python

Table of Contents

This chapter aims to show how to compute point estimates and confidence intervals. We explore the following cases:

We start by importing the data in the file data1.csv

The mean

The theory

Assume that we're interested in the variable age from the imported data, and we would like to know the following information:

We will first assume that sequence of employees's age is a random sample. We will write as a sequence of random variables X1,,Xn.

We assume also that X1,,Xn are generated from a Normal distribution with mean μ and with variance σ2.

It's known that the X¯ is an estimator of the mean μ. Since the sample mean is also a random variable with normal distribution with mean μ and variance σ2n, we can provide an interval that provides the error on the estimation. It's called the \textbf{Confidence Interval}.

We aim in chapter to show how to compute the confidence interval of the mean μ with level (1α), α(0,1). We denoted by CI1α(μ)

If σ2 is known, CI1α(μ) is expressed as follows: (X¯z1α/2σn,X¯+z1α/2σn)

where X¯=1ni=1nXi is the sample mean and z1α/2 is the percentile associated to (1α/2) from the standard normal distribution: FZ(z1α/2)=1α/2

where Z is a random variable with standard normal distribution.

If σ2 is unknown, CI1α(μ) is expressed as follows: (X¯t1α/2,n1Sn,X¯+t1α/2,n1Sn)

where S2 is the sample mean: S2=1n1i=1n(XiX¯)2 and t1α/2,n1 is the percentile associated to (1α/2) from the tdistribution with n1 degrees of freedom: FTn1(t1α/2,n1)=1α/2 where Tn1 is a random variable with tdistribution n1 degrees of freedom.

Practice with Python

We will write two functions. A first one returns the CI1α(μ) when σ2 is known and the second functions returns CI1α(μ) when σ2 is unknown.

1st function

2nd function

Practice

simulating data

Computing the sample mean

CI1α(μ) with known variance and α=0.05

Computing the sample standard deviation

CI1α(μ) with unknown variance and α=0.05

We can also use a function already implemented in the library scipy to compute CI1α(μ) when the variance is unknown.

We can also use scipy.stats.norm.interval to compute CI1α(μ) with known variance

Practice with data

Application: comparing the average of the Daily rate between Men and Women.

The female sample size

The female sample mean

The female standard error

The DailyRate Female CI(95%)

We compute then the The DailyRate Female CI(95%)

We select now the DailyRatesample for Men and Women separatly

We can then visualize these Confidence intervals together to see the difference between the DailyRate means.

We will now write a Python function that can compare the Confidence Intervals of the means for a given continuous variable according to groups defined by a categorical variable.

DailyRate and EducationField

DailyRate and Department

The proportion

Assume that we would like to estimate the probability p to win an election for candidate A. We randomly select n and consider X the number of people reported that will vote for A.

The probability or the proportion p is then estimated by

p^=Xn

In most of the cases the number n, the sample size, is large and the probability distribution of p^ is approximated with a Normal distribution with mean μ=p and variable σ2=p(1p)n. We can provide then, for a given α(0,1), a \textbf{Confidence interval} with level 1α:

CI1α(p)=(p^z1α/2p^(1p^)n,p^+z1α/2p^(1p^)n)

where z1α/2 is the zscore associate to 1α/2. It means, if Z is a standard normal distribution, then FZ(z1α/2)=1α/2 Where FZ is the CDF of Z.

Example: A Survey was conducted to estimate the probability p to vote a candidate A. Among the 1200 participated in the Survey, 560 reported that will vote for the candidate A. Find an estimation of p and its 95\%-Confidence Interval.

Importing the Function for computing proportion confidence intervals

There's also four oothers methods to compute the proportion confidence interval:

Example: We would like to compare the proportion of frequently traveling between the three departments. We start by computing first the contingency table between the variables BusinessTravel and Departments.

The difference of two proportions

We observe now two independents samples with different sizes n1 and n2. We estimate from each sample a proportion. We aim to provide a confidence interval of the difference between these proportions. It can be expressed as follows:

CI1α(p1p2)=(p^1p^2z1α/2p^1(1p^1)n1+p^2(1p^2)n2,p^1p^2+z1α/2p^1(1p^1)n1+p^2(1p^2)n2)

We will write the following Python function

The Confidence interval of the difference between the proportions of frequently traveling between R&D and HR departments

The Confidence interval of the difference between the proportions of frequently traveling between R&D and Sales departments