For any random experience, we try to define a numerical application that represents the outcomes of the experience. For example, when we flip a coin. The outcomes are either head or tail, and we will then define an application $X$ with values $\{0,1\}$, where $X=0$ if we obtain tail and $X=1$ if we obtain head.
Such an application $X$ will be then called a numerical random variable.
Definition A random variable $X$ is an application from a set $\Omega$, the outcomes of the random experience and a subset of real numbers: $$X\,:\,\Omega \longrightarrow \mathbb{R}\;\;\; \omega \longmapsto X(\omega)\in \mathbb{R}$$
There are two types of random variables:
We flip a coin and the outcomes are either head or tail, The values of $X$ are either $0$ or $1$. $X$ will be called a Bernoulli random variable. We will be interested to the probability to observe head, $\mathbb{P}(X=1)$, and the probability to observe tail, $\mathbb{P}(X=0)$.
If we assume $p=\mathbb{P}(X=1)$, we will say that $X$ is a Bernoulli random variable with parameter $p$
When we toss a dice the outcomes are the numbers : 1,2,3,4,5,6. We can then define a random variable $X$ with values in $\{1,2,\ldots,6\}$ where we will be interested to the probabilities: $$\mathbb{P}(X=i)=p_i,\;\;\forall \,i=1,\ldots,6$$
Assume that we are interested in the number of customers arriving in a store during one hour. Let's denote by $X$ this number. It's a random number, and it can be any number belonging to the set of integers $\mathbb{N}$.
$X$ is then a discrete random variables with values in $\mathbb{N}$. Some scientists have proved that we can find a positive constant $\lambda>0$ such that for all $n\in \mathbb{N}$, we have $$\mathbb{P}(X=n)=e^{-\lambda} \displaystyle\frac{\lambda^n}{n!}$$
In this example, we're interested in the lifetime of electronic components. A positive random variable $ X$ can represent this random experience, where the event $\{X>t\}$, for any positive $t$, means that the electronic component is working after the time $t$. Event $\{X<t\}$ means that the electronic component stopped working before the instant $t$.
The random variable $X$ is then a continuous random variable and we will be interested to compute for any positive numbers $a<b$, the following probability: $$\mathbb{P}(a<X<b)$$
Some scientist proved that we can find a positive constant $\lambda>0$ such that for all $a<b$: $$\mathbb{P}(a<X<b)=\displaystyle\int_a^b \lambda e^{-\lambda x}dx$$
Any random variable is characterized by its probability distribution that can be described by the cumulative distribution function (CDF) $F_X$: $$\begin{array}{lll} F_X & : & \mathbb{R} \longrightarrow [0,1] \\ & & t\longmapsto F_X(t)=\mathbb{P}(X\leq t) \end{array}$$
If $X$ is a discrete we define from $F_X$ the probability function: $$\forall\, x\;\; f_X(t)=F_X(t)-\lim_{x\rightarrow t, x<t} F_X(x)=\mathbb{P}(X\leq t)-\mathbb{P}(X<t)$$
If $X$ is a continuous random variable we can define from $F_X$ the density probability function (pdf) $f_X$. It's the derivative of $F_X$: $f_X=F_X'$.
The density probability function can be used to derive the probability that $X$ falls in an iterval $(a,b)$: $$\mathbb{P}(X\in (a,b))=\mathbb{P}(a<X<b)=\displaystyle\int_a^b f_X(t)dt$$
A random variable is fully determined if we know either the CDF or PDF. We will also to write codes that imitate the random experience that provides us the outcomes of $X$. We call this process simulating the random variable.
We will see now with the library scipy Python functions that will help us to compute the CDF, the PDF, and to simulate numbers from the probabilty distribution of the random variable.
We should first installing SciPy
!pip install scipy
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: scipy in c:\users\dhafe\appdata\roaming\python\python310\site-packages (1.7.3) Requirement already satisfied: numpy<1.23.0,>=1.16.5 in c:\users\dhafe\appdata\roaming\python\python310\site-packages (from scipy) (1.22.1)
SciPy is a free and open source library for scientific computing that is built on top of NumPy. We have already seen that NumPy has the capability of drawing samples from many common distributions (type help(np.random)
in the python interpreter), but SciPy has the added capability of computing the probability of observing events, and it can perform computations directly on the probability mass/density functions.
import numpy as np
from scipy import stats
We define the Bernoulli random variable
X= stats.bernoulli(.3)
The Probability function
$\mathbb{P}(X=1)$
X.pmf(1)
0.3
$\mathbb{P}(X=0)$
X.pmf(0)
0.7
For any $x\not \in \{0,1\}$, $\mathbb{P}(X=x)=0$
X.pmf(4)
0.0
X.pmf(-2)
0.0
The cumulative distribution function $F_X(x)=\mathbb{P}(X\leq x)$
X.cdf(-1)
0.0
X.cdf(0)
0.7
X.cdf(0.2)
0.7
X.cdf(1)
1.0
X.cdf(10)
1.0
Let's draw the CDF of a Bernoulli random variable. We will first generate random number in $\mathbb{R}$.
import numpy as np
data = np.random.random(1000)
data[1:10]
array([0.62485778, 0.12170412, 0.25287054, 0.78623599, 0.34879936, 0.74056271, 0.42610447, 0.2546339 , 0.63981476])
data.min()
0.0005284937096231568
data.max()
0.9998493971559735
data=-3*(2*data-1)
data.min()
-2.999096382935841
data.max()
2.996829037742261
data[:8]
array([-2.05663816, -0.74914665, 2.26977529, 1.48277674, -1.71741591, 0.90720382, -1.44337625, 0.4433732 ])
We sort the data
x = np.sort(data)
x[:8]
array([-2.99909638, -2.99394417, -2.97917872, -2.97467941, -2.97431385, -2.97152315, -2.96997004, -2.95919471])
We calculate then the CDF values. We calcule here $F_X(x)$ when $X$ follows a Bernoulli with parameter $p=.4$
import scipy
y = scipy.stats.bernoulli.cdf(x,.4)
import matplotlib.pyplot as plt
plt.step(x, y,'r*', where='post')
plt.xlabel('x')
Text(0.5, 0, 'x')
The mean of $X$, $\mathbb{E}(X)=p$
X.mean()
0.3
The variance of $X$, $\mbox{Var}(X)=p(1-p)$
X.var()
0.21
We also can generate random numbers according to the Bernoulli distribution
X.rvs()
1
X.rvs(20)
array([1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
z=X.rvs(100)
z.mean()
0.37
z.var()
0.23309999999999995
A random variable $X$ with Poisson distribution, with parameter $\lambda>0$, is a discrete random variable with values in $\mathbb{N}$, with probability function: for all $k\in \mathbb{N}$, $$p(k)=\mathbb{P}(X=k)=e^{-\lambda} \displaystyle\frac{x^k}{k!}$$
$\mathbb{E}(X)=\lambda$ and $\mbox{Var}(X)=\lambda$
from scipy import stats
We consider $X$ a random variable with Poisson distribution $\lambda=1.4$
X= stats.poisson(1.4)
X.pmf(3)
0.11277701150929471
X.pmf(4)
0.039471954028253146
Let's draw the probability function of a Poisson random variable
x=np.arange(0,20)
y = scipy.stats.poisson.pmf(x,1.4)
y
array([2.46596964e-01, 3.45235750e-01, 2.41665025e-01, 1.12777012e-01, 3.94719540e-02, 1.10521471e-02, 2.57883433e-03, 5.15766866e-04, 9.02592015e-05, 1.40403202e-05, 1.96564483e-06, 2.50172979e-07, 2.91868475e-08, 3.14319896e-09, 3.14319896e-10, 2.93365237e-11, 2.56694582e-12, 2.11395538e-13, 1.64418752e-14, 1.21150659e-15])
plt.bar(x, y)
plt.xticks(x)
plt.show()
X.mean()
1.4
X.var()
1.4
Cumulative distribution function
X.cdf(2)
0.8334977381226298
X.cdf(5)
0.9967988507880886
Let's draw the CDF then
data = np.random.random(1000)
data=7*data
x=np.sort(data)
y = scipy.stats.poisson.cdf(x,1.4)
plt.step(x, y,'r*', where='post')
plt.xlabel('x')
Text(0.5, 0, 'x')
Generate Poisson random numbers
X.rvs()
5
X.rvs(20)
array([0, 0, 2, 1, 2, 2, 3, 0, 0, 2, 4, 2, 1, 1, 1, 3, 0, 1, 2, 2])
z=X.rvs(100)
z.mean()
1.47
z.var()
1.5090999999999999
Let's estimate the probability of $\mathbb{P}(X<3)$ and compare it with the theorical value. We will call an empirical estimation of $\mathbb{P}(X<3)$
z<3
array([False, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, False, True, True, True, True, False, False, True, True, True, True, True, True, True, True, True, False, True, True, False, True, False, True, False, True, True, True, True, False, True, True, True, True, True, True, True, True, False, True, False, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, False, True, True, False, True, True, True, True, True])
zz=z<3
zz.sum()
83
phat=zz.sum()/len(zz)
phat
0.83
The theorical value is
X.cdf(2.999)
0.8334977381226298
We will show now that we can estimate with a good accuracy the theoritical probability with the empirical estimation by increasing the number of generations
x=np.arange(1,1001)
z=X.rvs(1000)
zz=z<3
phat=zz.cumsum()/x
plt.plot(x, phat) plt.axhline(y=X.cdf(2.999),color='red', linestyle='--') plt.xlabel('Number of generations')
Exercise
The following table provides the Python commands corresponding to each discrete probability distributions
Name | Python command | parameters |
---|---|---|
Geometric $p$ | stats.geom(p) | $p$ is the probability |
Binomial $(n,p)$ | stats.binom(n, p) | $n$ is the size and $p$ is the probability |
A random variable variable $X$ with Normal distribution if its probability density function is defined as follows: $$f(x)=\displaystyle\frac{1}{\sqrt{2\pi}\sigma} \exp\left[ -\left(\displaystyle\frac{x-\mu}{\sigma^2}\right)^2\right]$$
where $\mu\in \mathbb{R}$ and $\sigma>0$.
$X$ is a continuous random variable with values in $\mathbb{R}$. We write $X\sim \mathcal{N}(\mu,\sigma^2)$.
$\mathbb{E}(X)=\mu$ and $\mbox{Var}(X)=\sigma^2$
$\mathbb{P}\left(\mu-1.96\sigma \leq X\leq \mu+1.96\sigma\right)=.95$
$Z=\displaystyle\frac{X-\mu}{\sigma}$ follows the Normal distribution $\mathcal{N}(0,1)$.
We consider a random variable with mean $\mu=1.4$ and variance $\sigma^2=2^2$
import math
from scipy import stats
X= stats.norm(loc=1.4,scale=2)
X.cdf(3)
0.7881446014166034
X.mean()
1.4
X.var()
4.0
X.pdf(3)
0.14484577638074136
X.pdf(1.4)
0.19947114020071635
1/math.sqrt(2*math.pi*4)
0.19947114020071635
X.cdf(-10)
5.99037140106353e-09
X.cdf(0)
0.24196365222307303
X.cdf(1.4)
0.5
X.cdf(1.4-1.96*2).round(3)
0.025
X.cdf(1.4+1.96*2).round(3)
0.975
X.rvs()
-1.4972065404834658
data = X.rvs(1000)
import numpy as np
data = np.random.normal(size=1000,loc=0,scale=2)
x = np.sort(data)
x[0:8]
array([-7.00527363, -6.40776959, -5.53323088, -5.22003149, -5.07977012, -4.95514965, -4.87125955, -4.83750203])
import scipy
y = scipy.stats.norm.cdf(x,1.4,2)
y[0:8]
array([1.31911847e-05, 4.73304266e-05, 2.63527680e-04, 4.66453618e-04, 5.97889450e-04, 7.42561820e-04, 8.57427691e-04, 9.08096829e-04])
import matplotlib.pyplot as plt
plt.step(x, y)
plt.xlabel('x')
plt.ylabel('CDF of X')
Text(0, 0.5, 'CDF of X')
A random variable variable $X$ with Exponential distribution if its probability density function is defined as follows: $$f(x)=(1/\lambda)\, \exp\left(-x/\lambda \right),$$ if $x\geq 0$, and $f(x)=0$, if $x< 0$, where $\lambda>0$ and called the scale parameter.
$X$ is a continuous random variable with values in $\mathbb{R}^*_+$. We write $X\sim \mathcal{E}(\lambda)$.
$\mathbb{E}(X)=\lambda$ and $\mbox{Var}(X)=\lambda^2$.
import math
from scipy import stats
X= stats.expon(scale=2.2)
X.mean()
2.2
X.var()
4.840000000000001
2.2**2
4.840000000000001
X.pdf(2)
0.1831319643314241
X.pdf(-1)
0.0
The cumulative distribution function of an exponential random variable is expressed as following: $$F_X(x)=1-e^{-x/\lambda}$$ when $x>0$ and $F_X(x)=0$, when $x\leq 0$.
X.cdf(3)
0.7442708400868994
1-math.exp(-3/2.2)
0.7442708400868994
math.exp(0)
1.0
X.pdf(0)
0.45454545454545453
1/2.2
0.45454545454545453
X.pdf(3)
0.11624052723322756
(1/2.2)*math.exp(-3/2.2)
0.11624052723322756
X.rvs()
2.2496880342664096
Let's the curve of the CDF of $X$
import numpy as np
We simulate 1000 random numbers from a Normal distribution with mean $\mu=0$ and $\sigma=20$.
data = np.random.normal(size=1000,loc=0,scale=20)
x=np.sort(data)
x.min()
-56.402034243025184
x.max()
58.70478274607458
We compute the CDF of each element in x
y = scipy.stats.expon.cdf(x,scale=2.2)
import matplotlib.pyplot as plt
plt.step(x, y)
plt.xlabel('x')
plt.ylabel('CDF of X')
Text(0, 0.5, 'CDF of X')