Probability distributions with Python

Table of Contents

Introduction

For any random experience, we try to define a numerical application that represents the outcomes of the experience. For example, when we flip a coin. The outcomes are either head or tail, and we will then define an application $X$ with values $\{0,1\}$, where $X=0$ if we obtain tail and $X=1$ if we obtain head.

Such an application $X$ will be then called a numerical random variable.

Definition A random variable $X$ is an application from a set $\Omega$, the outcomes of the random experience and a subset of real numbers: $$X\,:\,\Omega \longrightarrow \mathbb{R}\;\;\; \omega \longmapsto X(\omega)\in \mathbb{R}$$

There are two types of random variables:

Random variables

Bernoulli random variables

We flip a coin and the outcomes are either head or tail, The values of $X$ are either $0$ or $1$. $X$ will be called a Bernoulli random variable. We will be interested to the probability to observe head, $\mathbb{P}(X=1)$, and the probability to observe tail, $\mathbb{P}(X=0)$.

If we assume $p=\mathbb{P}(X=1)$, we will say that $X$ is a Bernoulli random variable with parameter $p$

Tossing a dice

When we toss a dice the outcomes are the numbers : 1,2,3,4,5,6. We can then define a random variable $X$ with values in $\{1,2,\ldots,6\}$ where we will be interested to the probabilities: $$\mathbb{P}(X=i)=p_i,\;\;\forall \,i=1,\ldots,6$$

Poisson random variables

Assume that we are interested in the number of customers arriving in a store during one hour. Let's denote by $X$ this number. It's a random number, and it can be any number belonging to the set of integers $\mathbb{N}$.

$X$ is then a discrete random variables with values in $\mathbb{N}$. Some scientists have proved that we can find a positive constant $\lambda>0$ such that for all $n\in \mathbb{N}$, we have $$\mathbb{P}(X=n)=e^{-\lambda} \displaystyle\frac{\lambda^n}{n!}$$

Life time random variables

In this example, we're interested in the lifetime of electronic components. A positive random variable $ X$ can represent this random experience, where the event $\{X>t\}$, for any positive $t$, means that the electronic component is working after the time $t$. Event $\{X<t\}$ means that the electronic component stopped working before the instant $t$.

The random variable $X$ is then a continuous random variable and we will be interested to compute for any positive numbers $a<b$, the following probability: $$\mathbb{P}(a<X<b)$$

Some scientist proved that we can find a positive constant $\lambda>0$ such that for all $a<b$: $$\mathbb{P}(a<X<b)=\displaystyle\int_a^b \lambda e^{-\lambda x}dx$$

What should we know about a random variable?

Any random variable is characterized by its probability distribution that can be described by the cumulative distribution function (CDF) $F_X$: $$\begin{array}{lll} F_X & : & \mathbb{R} \longrightarrow [0,1] \\ & & t\longmapsto F_X(t)=\mathbb{P}(X\leq t) \end{array}$$

If $X$ is a discrete we define from $F_X$ the probability function: $$\forall\, x\;\; f_X(t)=F_X(t)-\lim_{x\rightarrow t, x<t} F_X(x)=\mathbb{P}(X\leq t)-\mathbb{P}(X<t)$$

If $X$ is a continuous random variable we can define from $F_X$ the density probability function (pdf) $f_X$. It's the derivative of $F_X$: $f_X=F_X'$.

The density probability function can be used to derive the probability that $X$ falls in an iterval $(a,b)$: $$\mathbb{P}(X\in (a,b))=\mathbb{P}(a<X<b)=\displaystyle\int_a^b f_X(t)dt$$

A random variable is fully determined if we know either the CDF or PDF. We will also to write codes that imitate the random experience that provides us the outcomes of $X$. We call this process simulating the random variable.

We will see now with the library scipy Python functions that will help us to compute the CDF, the PDF, and to simulate numbers from the probabilty distribution of the random variable.

Probability distributions with Python

We should first installing SciPy

SciPy is a free and open source library for scientific computing that is built on top of NumPy. We have already seen that NumPy has the capability of drawing samples from many common distributions (type help(np.random) in the python interpreter), but SciPy has the added capability of computing the probability of observing events, and it can perform computations directly on the probability mass/density functions.

Bernoulli distribution

We define the Bernoulli random variable

The Probability function

$\mathbb{P}(X=1)$

$\mathbb{P}(X=0)$

For any $x\not \in \{0,1\}$, $\mathbb{P}(X=x)=0$

The cumulative distribution function $F_X(x)=\mathbb{P}(X\leq x)$

Let's draw the CDF of a Bernoulli random variable. We will first generate random number in $\mathbb{R}$.

We sort the data

We calculate then the CDF values. We calcule here $F_X(x)$ when $X$ follows a Bernoulli with parameter $p=.4$

The mean of $X$, $\mathbb{E}(X)=p$

The variance of $X$, $\mbox{Var}(X)=p(1-p)$

We also can generate random numbers according to the Bernoulli distribution

Poisson distribution

A random variable $X$ with Poisson distribution, with parameter $\lambda>0$, is a discrete random variable with values in $\mathbb{N}$, with probability function: for all $k\in \mathbb{N}$, $$p(k)=\mathbb{P}(X=k)=e^{-\lambda} \displaystyle\frac{x^k}{k!}$$

$\mathbb{E}(X)=\lambda$ and $\mbox{Var}(X)=\lambda$

We consider $X$ a random variable with Poisson distribution $\lambda=1.4$

Let's draw the probability function of a Poisson random variable

Cumulative distribution function

Let's draw the CDF then

Generate Poisson random numbers

Let's estimate the probability of $\mathbb{P}(X<3)$ and compare it with the theorical value. We will call an empirical estimation of $\mathbb{P}(X<3)$

The theorical value is

We will show now that we can estimate with a good accuracy the theoritical probability with the empirical estimation by increasing the number of generations

plt.plot(x, phat) plt.axhline(y=X.cdf(2.999),color='red', linestyle='--') plt.xlabel('Number of generations')

Exercise

The following table provides the Python commands corresponding to each discrete probability distributions

Name Python command parameters
Geometric $p$ stats.geom(p) $p$ is the probability
Binomial $(n,p)$ stats.binom(n, p) $n$ is the size and $p$ is the probability

Normal distribution

A random variable variable $X$ with Normal distribution if its probability density function is defined as follows: $$f(x)=\displaystyle\frac{1}{\sqrt{2\pi}\sigma} \exp\left[ -\left(\displaystyle\frac{x-\mu}{\sigma^2}\right)^2\right]$$

where $\mu\in \mathbb{R}$ and $\sigma>0$.

$X$ is a continuous random variable with values in $\mathbb{R}$. We write $X\sim \mathcal{N}(\mu,\sigma^2)$.

$\mathbb{E}(X)=\mu$ and $\mbox{Var}(X)=\sigma^2$

$\mathbb{P}\left(\mu-1.96\sigma \leq X\leq \mu+1.96\sigma\right)=.95$

$Z=\displaystyle\frac{X-\mu}{\sigma}$ follows the Normal distribution $\mathcal{N}(0,1)$.

We consider a random variable with mean $\mu=1.4$ and variance $\sigma^2=2^2$

Exponential distribution

A random variable variable $X$ with Exponential distribution if its probability density function is defined as follows: $$f(x)=(1/\lambda)\, \exp\left(-x/\lambda \right),$$ if $x\geq 0$, and $f(x)=0$, if $x< 0$, where $\lambda>0$ and called the scale parameter.

$X$ is a continuous random variable with values in $\mathbb{R}^*_+$. We write $X\sim \mathcal{E}(\lambda)$.

$\mathbb{E}(X)=\lambda$ and $\mbox{Var}(X)=\lambda^2$.

The cumulative distribution function of an exponential random variable is expressed as following: $$F_X(x)=1-e^{-x/\lambda}$$ when $x>0$ and $F_X(x)=0$, when $x\leq 0$.

Let's the curve of the CDF of $X$

We simulate 1000 random numbers from a Normal distribution with mean $\mu=0$ and $\sigma=20$.

We compute the CDF of each element in x