Descriptive Statistics with Python

This chapter aims to learn how to perform your first data analysis using Python. We will first talk about the types of variables found in any collected dataset. We then show the kind of statistics that should be calculated: mean, median, mode, to provide a good description of the data. We finish the chapter by showing how to study the pairwise relationships between variables. We will then show how to compute correlation coefficients and draw the correlation's heat map. We will also present the $\chi^2-$test to determine the nature of the relationship between two categorical variables

Learning outcomes:

Table of Contents

Type of variables

Discrete Variable

Continuous Variable

Measuring central tendancy

Mean: $\overline{x}=\displaystyle\frac{1}{n}\displaystyle\sum_{i=1}^n x_i$

The mode is the highest-occuring item in a group of observations

The median is the midpoint or middle value in a group of observations. It is also called the 50th percentile.

The median is also called the 50% quantile or the 2nd quantile

we can compute the 1st quantile or the 25% quantile

and the the 3rd quantile or the 75% quantile

And we can compute the five numbers summary. It's composed of the min, 1st quantile, median, 3rd quantile, and max

Measuring dispersion

Range is the difference between the maximum and the minimum

The Inter-quantile range:

The Variance measures the deviation from the mean $$\sigma^2=\displaystyle\frac{1}{n-1}\sum_{i=1}^n (x_i-\overline{x})^2$$

The standard deviation is the square root of the variance

We can also compute all these measures using one Python function

describe can be used on the whole dataset.

Measure relationship between two variables

In statistics, we're always to find the variables that are related to each other. After the one-dimensional description of the variables (called also flat sorting of the data) we explore also the pairwise relationship between the variables.

Between two continuous variables

To determine the relationship between two continuous variables, we use the correlation coefficient. It's often denoted by $\rho$. It's a number belonging to $[0,1]$ and can be interpreted as follows

Before computing $\rho$ we should first draw a plot call the scatter plot:

We will check the relationship between HourlyRate and YearsAtCompany. We will first draw the scatter plot. We need then to install matplotlib library.

We import then matplotlib

The correlation coefficient between these two variables is displayed in the following matrix

and $\rho$ can be extracted as follows

Interpreation: there's no evidence of a linear relationship between the variables HourlyRate and YearsAtCompany

We can also compute at the same all the pairwise correlations between the variables of the data. ONLY CORRELATIONS BETWEEN CONTINUOUS VARIABLES HAVE A STATISTICAL INTERPRETATION.

Problem: Check the relationship between professional experience variables. We will be only interested in the following variables: YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, and YearsWithCurrManager

Solution: We will first compute the correlation matrix rounded with 2 digits

It's very common to represent the correlation matrix with a agraph called a heat map. To make this visualization we will need to install first seaborn

We import then seaborn library with the matplotlib library

We draw then the correlation matrix cormat created above as a heat map

Between two disrete variables

The relationship between two discrete variables is measured using contingency tables.

A contingency table is a multi-way table that describes a data set in which each observation belongs to one category for each of several variables. For example, if there are two variables, one with $r$ levels and one with $k$ levels, then we have a contingency table. The table can be described in terms of the number of observations that fall into a given cell of the table, e.g. $T_{ij}$ is the number of observations that have level $i$ for the first variable and level $j$ for the second variable.

The contingency table of the variables Attrition and Gender can be computed using crosstab function from Pandas library

We can add margins to the contingency table

We will now explore the library statsmodels that supports a variety of approaches for analyzing contingency tables, including methods for assessing independence

In a probabilistic way, the lack of relationship between two discrete variables can be expressed using two independent variables:

Two random discrete random variables $A$ and $B$ are independent if for all $i$, $j$ $$\underbrace{\mathbb{P}(A=i,B=j)}_{P_{ij}}=\underbrace{\mathbb{P}(A=i)}_{P_{i+}}\times \underbrace{\mathbb{P}(B=j)}_{P_{+j}}$$

We import then the necessary libraries

Estimating marginal probabilities of the variable gender

Computing the fitting contingency table

The coefficients in the previous table are called the expected value: $E_{ij}$ is the expected value for the cell in the $i^\mbox{th}$ column and $j^\mbox{th}$ row. $E_{ij}$ can be computed as follows: $$E_{ij}=\displaystyle\frac{T_{i+}\times T_{+j}}{N}$$ where $T_{i+}=\sum_j T_{ij}$, $T_{+j}=\sum_i T_{ij}$, and $N$ is the sample size.

We compute then the Pearson residuals: $$r_{ij}=\displaystyle\frac{T_{ij}-E_{ij}}{\sqrt{E_{ij}}}$$

When the variables are independents, the pearson residuals are expected to be close to zero and with a modulus non higher than 2.

To decide about the relationship of the variables (independence or no), we compute the $\chi^2$ statistics and the corresponding p-value.

The variables Attrition and Gender are both nominal variables, we consider the test measuring the association between nominal variables

The $\chi^2-$statistics of the test

It's given by $\sum_{ij}r_{ij}^2$

The corresponding degree of freedoms: It's equal ($(T_{i+}-1)\times(T_{+j}-1)$)

We consider the p-value

Compared to 0.05, the p-value is higher than 0.05, we can decide that there's no relationship between the variables Attrition and Gender.

To finish this analysis we show draw the mosaic plot implemented in statsmodels library