Hypothesis testing is a data analysis method conducted to test one hypothesis (call null hypothesis, $H_0$) against another hypothesis (the alternative hypothesis, $H_1$).
We will use the random sample $X_1,\ldots, X_n$ (the data) to help decide between the hypothesis $H_0$ or $H_1$ with a fixed level of significance $\alpha$ which is the error to reject $H_0$ knowing that $H_0$ is correct.
$$\alpha = \mathbb{P}\left(H_0\mbox{ rejected } \mid H_0 \mbox{ true}\right)$$Any Hypothesis testing procedure should derive the following
The test statistics $T$: It's a random variable computed from the random sample and where the probability distribution of $T$ is known when $H_0$ is true.
The observed statistics $t_\mbox{obs}$ from $T$ computed from the observed random sample $x_1,\ldots,x_n$:
There are two types of Hypothesis testing procedures: parametric and non-parametric testing.
Parametric hypothesis testing is a testing procedure used when the hypothesis is based on comparing a population parameter to given values.
Non-parametric hypothesis testing is used when the hypothesis is not based on a population parameter. It's about testing one assumption against its opposite.
Parametric tests are more powerful and reliable than non-parametric tests.
The hypothesis is developed on the parameters of the population distribution.
We will see in this chapter how to perform Hypothesis testing in the following cases:
We would like to test the following null hypothesis $$ H_0: \; \mu=\mu_0$$ versus $$ H_1: \; \mu\not=\mu_0$$ where $\mu_0$ is given from a random sample $X_1,\ldots,X_n$ assumed to be generated from a Normal distribution with mean $\mu$ and unknown variance $\sigma^2$.
The test statistics of the test is $$T=\sqrt{n}\,\displaystyle\frac{\overline{X}-\mu}{S}$$ where $\overline{X}$ is the sample mean and $S$ is the sample standard deviation.
Under $H_0$, $T$ is equal to $$T=\sqrt{n}\,\displaystyle\frac{\overline{X}-\mu_0}{S}$$ and follows a $t-$distribution with $n-1$ degrees of freedom.
Let's generate a random sample of size 15, mean $\mu=-2$, and standard deviation $\sigma=2$ (then with variance $\sigma^2=4$).
import numpy as np
np.random.seed(7654567)
x=np.random.normal(size=15,loc=-2,scale=2)
x
array([-3.89395351, -2.55250403, -2.39746865, 0.98703995, -2.8607237 , -1.39481847, -0.74237156, -4.63886095, -3.81258843, -2.89188048, -1.50048335, -3.31429764, -3.28882943, -0.96256977, -2.0684448 ])
We assume that the random sample x
is generated from a Normal probability distribution with unknown mean and variance. We will test now the following hypothesis
versus $$ H_1: \; \mu\not=-2$$
from scipy.stats import ttest_1samp
test_mean=ttest_1samp(x, -2, alternative='two-sided')
The output above shows that $t_\mbox{obs}$ is
test_mean.statistic
-0.948081719359044
and the pvalue is
test_mean.pvalue
0.3591669622802579
Since this p-value is greater than 0.05 (5%), we can conclude that we can accept the hypothesis $H_0$.
How can the test statistc and the p-value are computed?
m=np.mean(x)
m
-2.3555169880044056
std_error=np.std(x)/np.sqrt(len(x)-1)
std_error
0.3749855953817513
Under $ H_0: \; \mu=-2$, the test statistics is then
(m+2)/std_error
-0.948081719359044
test_mean.statistic
-0.948081719359044
The p-value is then computed as follows
from scipy import stats
X = stats.t(len(x)-1)
2*X.cdf(test_mean.statistic)
0.3591669622802579
And the p-value is
test_mean.pvalue
0.3591669622802579
We can also perform the following hypothesis testing. It's called the lower-tail alternative test:
$$ H_0: \; \mu=-2$$versus $$ H_1: \; \mu<-2$$
test_mean1=ttest_1samp(x, -2, alternative='less')
test_mean1.statistic
-0.948081719359044
test_mean1.pvalue
0.17958348114012895
It's computed as follows
X.cdf(test_mean.statistic)
0.17958348114012895
We can also perform the following hypothesis testing. It's called the upper-tail alternative test:
$$ H_0: \; \mu=-2$$versus $$ H_1: \; \mu>-2$$
test_mean2=ttest_1samp(x, -2, alternative='greater')
test_mean2.statistic
-0.948081719359044
test_mean2.pvalue
0.8204165188598711
The p-value is computed as follows
1-X.cdf(test_mean.statistic)
0.8204165188598711
We have two types of hypothesis testing comparing two means:
Example: Compare the difference in blood pressure level for a group of patients before and after some drug treatment.
Example: Comparing between the salaries of a sample of men and women employees.
We are going to the following hypothesis:
Example: we're comparing the grades between the Quiz 1 and the Quiz 2
import pandas as pd
df=pd.read_csv('student_grades.csv')
df
Student | Section | Quiz 1 | Quiz 2 | Midterm 1 | |
---|---|---|---|---|---|
0 | 1 | Section 1 | 17.5 | 13.4575 | 22.5 |
1 | 2 | Section 1 | 16.5 | 14.2125 | 25.0 |
2 | 3 | Section 2 | 12.5 | 14.6150 | 23.5 |
3 | 4 | Section 2 | 10.5 | 15.2100 | 25.0 |
4 | 5 | Section 2 | 5.5 | 12.2225 | 24.5 |
5 | 6 | Section 1 | 14.0 | 14.5875 | 23.5 |
6 | 7 | Section 2 | 11.0 | NaN | 20.5 |
7 | 8 | Section 2 | 15.0 | 18.7500 | 25.0 |
8 | 9 | Section 2 | 17.0 | 16.5000 | 25.0 |
9 | 10 | Section 1 | 14.0 | 19.2500 | 19.5 |
10 | 11 | Section 2 | 11.0 | 10.9625 | 24.5 |
11 | 12 | Section 1 | 10.5 | 7.3650 | 19.5 |
12 | 13 | Section 1 | 16.5 | 18.2450 | 25.0 |
13 | 14 | Section 1 | 14.5 | 18.8375 | 21.5 |
14 | 15 | Section 1 | 14.5 | 11.1100 | 25.0 |
15 | 16 | Section 1 | 8.0 | 8.7475 | 24.0 |
16 | 17 | Section 1 | 14.5 | 15.9125 | 21.0 |
17 | 18 | Section 2 | 14.0 | 8.8975 | 16.5 |
18 | 19 | Section 1 | 11.0 | 4.1425 | 21.0 |
19 | 20 | Section 2 | 17.0 | 14.9500 | 25.0 |
20 | 21 | Section 1 | 18.5 | 13.6250 | 24.0 |
21 | 22 | Section 2 | 14.0 | 13.5025 | 23.0 |
22 | 23 | Section 1 | 10.5 | 17.7025 | 21.5 |
23 | 24 | Section 1 | 10.0 | 15.5025 | 24.0 |
24 | 25 | Section 2 | 11.0 | 12.8750 | 20.0 |
25 | 26 | Section 2 | 15.0 | 12.2075 | 25.0 |
26 | 27 | Section 2 | 16.0 | 19.0650 | 24.5 |
27 | 28 | Section 2 | 19.5 | 20.0000 | 24.5 |
28 | 29 | Section 1 | 15.0 | 11.9200 | 24.0 |
29 | 30 | Section 1 | 18.0 | 14.4950 | 25.0 |
30 | 31 | Section 2 | 14.5 | 16.5650 | 25.0 |
31 | 32 | Section 2 | 18.5 | 18.8250 | 24.0 |
32 | 33 | Section 1 | 14.0 | 12.1900 | 25.0 |
33 | 34 | Section 2 | 11.5 | 10.4950 | 23.5 |
34 | 35 | Section 2 | 10.0 | 10.5725 | 24.0 |
35 | 36 | Section 2 | 20.0 | 18.5000 | 25.0 |
36 | 37 | Section 1 | 11.0 | 14.9400 | 25.0 |
37 | 38 | Section 1 | 15.0 | 14.7175 | 23.5 |
38 | 39 | Section 2 | 18.5 | 19.5050 | 23.5 |
39 | 40 | Section 1 | NaN | 20.0000 | 25.0 |
40 | 41 | Section 2 | 9.5 | 13.4625 | 20.0 |
41 | 42 | Section 1 | 14.0 | 11.8775 | 20.5 |
42 | 43 | Section 2 | 3.0 | NaN | 13.0 |
43 | 44 | Section 2 | 13.0 | 20.0000 | 25.0 |
44 | 45 | Section 1 | 16.0 | 18.5700 | 25.0 |
45 | 46 | Section 1 | 10.5 | 12.8325 | 23.0 |
46 | 47 | Section 1 | 16.5 | 15.4725 | 21.0 |
47 | 48 | Section 2 | 10.5 | 18.2550 | 23.5 |
48 | 49 | Section 1 | 17.5 | 17.8750 | 25.0 |
49 | 50 | Section 2 | 13.5 | 9.8100 | 23.5 |
50 | 51 | Section 1 | 14.5 | 19.5050 | 23.5 |
51 | 52 | Section 1 | 16.5 | 11.7875 | 21.5 |
We remove the missing values from the data
df=df.dropna()
df
Student | Section | Quiz 1 | Quiz 2 | Midterm 1 | |
---|---|---|---|---|---|
0 | 1 | Section 1 | 17.5 | 13.4575 | 22.5 |
1 | 2 | Section 1 | 16.5 | 14.2125 | 25.0 |
2 | 3 | Section 2 | 12.5 | 14.6150 | 23.5 |
3 | 4 | Section 2 | 10.5 | 15.2100 | 25.0 |
4 | 5 | Section 2 | 5.5 | 12.2225 | 24.5 |
5 | 6 | Section 1 | 14.0 | 14.5875 | 23.5 |
7 | 8 | Section 2 | 15.0 | 18.7500 | 25.0 |
8 | 9 | Section 2 | 17.0 | 16.5000 | 25.0 |
9 | 10 | Section 1 | 14.0 | 19.2500 | 19.5 |
10 | 11 | Section 2 | 11.0 | 10.9625 | 24.5 |
11 | 12 | Section 1 | 10.5 | 7.3650 | 19.5 |
12 | 13 | Section 1 | 16.5 | 18.2450 | 25.0 |
13 | 14 | Section 1 | 14.5 | 18.8375 | 21.5 |
14 | 15 | Section 1 | 14.5 | 11.1100 | 25.0 |
15 | 16 | Section 1 | 8.0 | 8.7475 | 24.0 |
16 | 17 | Section 1 | 14.5 | 15.9125 | 21.0 |
17 | 18 | Section 2 | 14.0 | 8.8975 | 16.5 |
18 | 19 | Section 1 | 11.0 | 4.1425 | 21.0 |
19 | 20 | Section 2 | 17.0 | 14.9500 | 25.0 |
20 | 21 | Section 1 | 18.5 | 13.6250 | 24.0 |
21 | 22 | Section 2 | 14.0 | 13.5025 | 23.0 |
22 | 23 | Section 1 | 10.5 | 17.7025 | 21.5 |
23 | 24 | Section 1 | 10.0 | 15.5025 | 24.0 |
24 | 25 | Section 2 | 11.0 | 12.8750 | 20.0 |
25 | 26 | Section 2 | 15.0 | 12.2075 | 25.0 |
26 | 27 | Section 2 | 16.0 | 19.0650 | 24.5 |
27 | 28 | Section 2 | 19.5 | 20.0000 | 24.5 |
28 | 29 | Section 1 | 15.0 | 11.9200 | 24.0 |
29 | 30 | Section 1 | 18.0 | 14.4950 | 25.0 |
30 | 31 | Section 2 | 14.5 | 16.5650 | 25.0 |
31 | 32 | Section 2 | 18.5 | 18.8250 | 24.0 |
32 | 33 | Section 1 | 14.0 | 12.1900 | 25.0 |
33 | 34 | Section 2 | 11.5 | 10.4950 | 23.5 |
34 | 35 | Section 2 | 10.0 | 10.5725 | 24.0 |
35 | 36 | Section 2 | 20.0 | 18.5000 | 25.0 |
36 | 37 | Section 1 | 11.0 | 14.9400 | 25.0 |
37 | 38 | Section 1 | 15.0 | 14.7175 | 23.5 |
38 | 39 | Section 2 | 18.5 | 19.5050 | 23.5 |
40 | 41 | Section 2 | 9.5 | 13.4625 | 20.0 |
41 | 42 | Section 1 | 14.0 | 11.8775 | 20.5 |
43 | 44 | Section 2 | 13.0 | 20.0000 | 25.0 |
44 | 45 | Section 1 | 16.0 | 18.5700 | 25.0 |
45 | 46 | Section 1 | 10.5 | 12.8325 | 23.0 |
46 | 47 | Section 1 | 16.5 | 15.4725 | 21.0 |
47 | 48 | Section 2 | 10.5 | 18.2550 | 23.5 |
48 | 49 | Section 1 | 17.5 | 17.8750 | 25.0 |
49 | 50 | Section 2 | 13.5 | 9.8100 | 23.5 |
50 | 51 | Section 1 | 14.5 | 19.5050 | 23.5 |
51 | 52 | Section 1 | 16.5 | 11.7875 | 21.5 |
from scipy.stats import ttest_rel
test_diffmean1=ttest_rel(df['Quiz 1'],df['Quiz 2'])
test_diffmean1.statistic
-1.1156139485204906
test_diffmean1.pvalue
0.2701415971643917
We have tested here the following hypothesis:
$$ H_0:\,\mbox{ the averages of the grades in Q1 and Q2 are equal}$$versus $$ H_0:\,\mbox{ the averages of the grades in Q1 and Q2 are different}$$
We used a paired t-test and we can conclude that $H_0$ can't be rejected since the p-value is greater than 0.05 (5%) (the given level of significance).
We can test also if the mean $\mu_1$ of the grades of the Quiz 1 is higher than the mean $\mu_2$ of the grades of the Quiz 2:
$$ H_0: \, \mu_1\geq \mu_2$$versus $$ H_1:\, \mu_1<\mu_2 $$
test_diffmean1a=ttest_rel(df['Quiz 1'],df['Quiz 2'],alternative='less')
test_diffmean1a.statistic
-1.1156139485204906
test_diffmean1a.pvalue
0.13507079858219584
Conclusion: $H_0$ can't be rejected
test_diffmean1b=ttest_rel(df['Quiz 1'],df['Quiz 2'],alternative='greater')
test_diffmean1b.pvalue
0.8649292014178042
We will compare now the mean of the Quiz 1 between Section 1 and 2
x1=df['Quiz 1'].loc[df['Section ']=='Section 1']
x1
0 17.5 1 16.5 5 14.0 9 14.0 11 10.5 12 16.5 13 14.5 14 14.5 15 8.0 16 14.5 18 11.0 20 18.5 22 10.5 23 10.0 28 15.0 29 18.0 32 14.0 36 11.0 37 15.0 41 14.0 44 16.0 45 10.5 46 16.5 48 17.5 50 14.5 51 16.5 Name: Quiz 1, dtype: float64
x2=df['Quiz 1'].loc[df['Section ']=='Section 2']
x2
2 12.5 3 10.5 4 5.5 7 15.0 8 17.0 10 11.0 17 14.0 19 17.0 21 14.0 24 11.0 25 15.0 26 16.0 27 19.5 30 14.5 31 18.5 33 11.5 34 10.0 35 20.0 38 18.5 40 9.5 43 13.0 47 10.5 49 13.5 Name: Quiz 1, dtype: float64
from scipy.stats import ttest_ind
test_diffmean2=ttest_ind(x1,x2)
test_diffmean2.statistic
0.4200033340360798
test_diffmean2.pvalue
0.6763969243532744
We conclude that both sections have the same means of the grades in Quiz 1.
Assume that we would like to test the following Hypothesis: $$H_0\,: \,\,p=p_0,\;\; \mbox{ versus }H_1\,:\,\, p\not=p_0$$
where $p$ is a parameter proportion and $p_0$ is a given value of $p$. The hypothesis $H_0$ (null hypothesis) and $H_1$ will be tested from a given data on a random sample $X_1,\ldots, X_n$ with a Bernoulli distribution. The random variables $X_1,\ldots, X_n$ are binary variable with values $1$ and $0$ where each random variable $X_i$ takes the value 1 with probability $p$.
In most of the cases the data reported in this type of tests is the number $X$ of success among the $n$ trials. This later random variable $X$ is defined as follows: $$ X=X_1+\ldots+X_n=\sum_{k=1}^n X_k$$ and has a Binomial distribution with size $n$ and probability $p$.
The probability $p$ is often estimated using the random variable $\widehat{p}$:
$$\widehat{p}=\displaystyle\frac{X}{n}.$$When $n$ is large ($\geq 30$), the random variable $\widehat{p}$ follows approximatively a Normal distribution:
$$\widehat{p}\sim \mathcal{N}\left(p, \displaystyle\frac{p(1-p)}{n}\right).$$Hence the random variable $Z=\sqrt{n}\displaystyle\frac{\widehat{p}-p}{\sqrt{\widehat{p}(1-\widehat{p})}}$ follows approximately a standard normal distribution.
The random variable $Z$ will be the test statistic of the the proportion hypothesis testing.
Example: According to the Washington Post, nearly 45% of all Americans are born with brown eyes, although their eyes don't necessarily stay brown. A random sample of 80 adults found 32 with brown eyes. Is there sufficient evidence at the .01 level to indicate that the proportion of brown-eyed adults differs from the proportion of Americans who are born with brown eyes?
In this example the sample size $n=80$, the random sample is $X_1,\ldots,X_n$ where each $X_i$ is representing an American adult, $X_i$ is $1$ if this adult has brown eyes and $0$ elsewhere. The random variable $X$ is the number of Americans with brown eyes among the 80 surveyed adults. In this example the observed of $X$ is 32.
We are testing here the following two hypothesis:
Under the hypothesis $H_0$ ($p=.45$), the random variable $Z$ is equal to: $$Z=\sqrt{80}\displaystyle\frac{\widehat{p}-.45}{\sqrt{.4\times .6}}$$
The observed value of $\widehat{p}$ from the data is: $$\widehat{p}_{\mbox{obs}}=\displaystyle\frac{32}{80}=.4$$
Then the observed value of $Z$ is $$Z_{\mbox{obs}}=\sqrt{80}\displaystyle\frac{.4-.45}{\sqrt{.4\times .6}}=-0.913$$
Since the alternative hypothesis is $H_1\,:\, p\not=.45$. The test is called two-sided test and the p-value of the test is computed as follows:
$$\begin{array}{rcl} \mbox{p-value}& = & \mathbb{P}\left(|Z|\geq |z_\mbox{obs}| \mid H_0 \mbox{ true }\right)\\ & = & \mathbb{P}\left(|Z|\geq 0.913 \mid p=.45\right) \\ & = & 2\times (1-F_Z(0.913)) \\ & = & 0.361 \end{array}$$where $F_Z$ is a cumulative probability function (CDF) of a standard normal distribution.
Since $1\%=0.01$ is the level of significance of the test, and the p-value is greater than $1\%$, we can decide to not reject the null hypothesis $H_0$.
import numpy as np
from statsmodels.stats.proportion import proportions_ztest
stat, pvalue=proportions_ztest(count=32, nobs=80, value=.45, alternative='two-sided')
stat
-0.9128709291752767
pvalue
0.3613104285261789
phat=32/80
phat
0.4
Zscore=np.sqrt(80)*(phat-.45)/np.sqrt(.4*.6)
Zscore
-0.9128709291752768
from scipy.stats import norm,t
2*(1-norm.cdf(np.abs(Zscore),loc=0,scale=1))
0.361310428526179
The alternative hypothesis $H_1$ can be also one the following:
alternatine='smaller'
alternatine='larger'
In case of alternatine='less'
we proceed as follows:
stat, pvalue=proportions_ztest(count=32, nobs=80, value=.45, alternative='smaller')
stat
-0.9128709291752767
pvalue
0.18065521426308945
The pvalue is computed as follows
norm.cdf(stat,0,1)
0.18065521426308945
In case of alternatine='larger'
we proceed as follows:
stat, pvalue=proportions_ztest(count=32, nobs=80, value=.45, alternative='larger')
stat
-0.9128709291752767
pvalue
0.8193447857369105
The pvalue is computed as follows
1-norm.cdf(stat,0,1)
0.8193447857369105
Example: Pizza-Hut claims that 90% of its order are delivered within 10 minutes of the time the order is placed. A sample of 100 order revealed that 82 were delivered within the promised time. At 10% significance level, can we conclude that at maximum 90% of the orders are delivered in less than 10 minutes?
We are testing in this example the following hypothesis testing:
We can also conclude that sample size is $n=100$ and the observed value for $X=92$.
stat, pvalue=proportions_ztest(count=82,nobs=100,value=.9,alternative='smaller')
stat
-2.0823168251814157
pvalue
0.018656770036782476
The pvalue is computed as follows:
norm.cdf(stat,0,1)
0.018656770036782476
Example: Of a sample of 361 owners of retail service and business firms that had gone into bankruptcy, 105 reported having no professional assistance prior to opening the business. It is claimed that at most 25% of all members of this population had no professional assistance before opening the business. Test the aforementioned claim at $\alpha=0.01$
We're testing in this example the following hypothesis:
We can also conclude that sample size is 𝑛=361 and the observed value for $X=105$.
stat, pvalue=proportions_ztest(count=105,nobs=361,value=.25,alternative='smaller')
stat
1.709349971523915
pvalue
0.9563069289956099
Here we have two samples, defined by a proportion, and we want to see if we can make an assertion about whether the overall proportions of one of the underlying populations is greater than / less than / different to the other.
Example: we want to compare two different populations to see how their tests relate to each other:
We use a 2-sample z-test to check if the sample allows us to accept or reject the null hypothesis
We will use for this test statistic the following hypothesis testing:
$$Z=\displaystyle\frac{\widehat{p}_A-\widehat{p}_B}{S_p}$$
Where $S_p$, called the pooled standard error, is computed as follows: $$ S_p=\sqrt{\widehat{p}(1-\widehat{p})\times \left(\displaystyle\frac{1}{n_A}+\displaystyle\frac{1}{n_B}\right)}$$ and $$\widehat{p}=\displaystyle\frac{n_A\widehat{p}_A+n_B\widehat{p}_B}{n_A+n_B}$$ is the pooled proportion.
Let $z_\mbox{obs}$ be the observed value of $Z$ under $H_0$. The pvalue associated to the two-sided alternative test can be computed as follows: $$\mbox{pvalue}=\mathbb{P}\left(|Z|\geq |z_\mbox{obs}|\mid H_0\mbox{ true }\right)$$ and the test statistic follows approximately the standard normal distribution.
In Python we proceed as follows:
from statsmodels.stats.proportion import proportions_ztest
import numpy as np
from scipy.stats import norm,t
sample_success_a, sample_size_a = (410, 500)
sample_success_b, sample_size_b = (379, 400)
We create Python arrays for the number of successes and for the sample sizes.
successes = np.array([sample_success_a, sample_success_b])
samples = np.array([sample_size_a, sample_size_b])
successes
array([410, 379])
samples
array([500, 400])
stat, pvalue = proportions_ztest(count=successes, nobs=samples, alternative='two-sided')
stat
-5.7802476568050825
pvalue
7.459074060078635e-09
Let's see now how the test statistic and the pvalue are computed
p_pooled=successes.sum()/samples.sum()
p_pooled
0.8766666666666667
Sp=np.sqrt(p_pooled*(1-p_pooled)*(np.sum(1/samples)))
Sp
0.02205787841112558
zscore = (successes[0]/samples[0]-successes[1]/samples[1])/Sp
zscore
-5.7802476568050825
stat
-5.7802476568050825
pv=2*(1-norm.cdf(np.abs(zscore),0,1))
pv
7.459074025106815e-09
pvalue
7.459074060078635e-09
Example: Let's consider the Titanic
data
import numpy as np
import pandas as pd
import scipy.stats.distributions as dist
Importing the data
df= pd.read_csv('titanic.csv')
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
df.shape
(1309, 12)
We encode then the variable Survived
df['Survived'].value_counts()
0 815 1 494 Name: Survived, dtype: int64
df['Survived'] = df['Survived'].map({1:'Survived',0:'Not Survived'})
df['Survived'].value_counts()
Not Survived 815 Survived 494 Name: Survived, dtype: int64
We select Sex
and Survived
variables.
df1=df[['Sex','Survived']]
df1.shape
(1309, 2)
df1=df1.dropna()
df1.shape
(1309, 2)
tab=pd.crosstab(df1.Survived,df1.Sex)
tab
Sex | female | male |
---|---|---|
Survived | ||
Not Survived | 81 | 734 |
Survived | 385 | 109 |
tab=np.array(tab)
tab
array([[ 81, 734], [385, 109]], dtype=int64)
tab[1]
array([385, 109], dtype=int64)
counts=tab[1]
counts
array([385, 109], dtype=int64)
nobs=tab.sum(axis=0)
nobs
array([466, 843], dtype=int64)
Probability of surviving by Gender
counts/nobs
array([0.82618026, 0.12930012])
Comparing both probabilities
stat, pvalue = proportions_ztest(count=counts, nobs=nobs, alternative='two-sided')
stat
24.905334404307723
pvalue
6.513244629920008e-137
Exercise: Comparing the surviving probability between different other groups.