The Pandas Library


Description from the Pandas documentation:

Here are just a few of the things that Pandas does well:

Series and DataFrames

We should first import Pandas into Python after installing it from the CMD promt:

pip install pandas

The Panda Series

The Series data structure in Pandas is a one-dimensional labeled array.

Creating a Panda Serie:

From a list

The series should contains homogeneous types

We create series

from a dictionary

From a numpy array

I'm using linspace to create an array with spaced numbers over a specified interval: 15 numbers between 0 and 10

The array must be with dimension 1

Pandas DataFrames

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. You can create a DataFrame from:

Reading the data.

Sample data: HR Employee Attrition and Performance You can get it from here and add it to your working directory:

Importing the xlsx file by considering the variable EmployeeNumber as an Index variable

pd.read_excel(io="path to your excel data file, index_col='name of the column containing the row numbers or indexes')

Types of the variables

A preview of the data (the first 3 rows)

Name of the columns in the imported data.

The preview of the variable Attrition

Data Manipulation

Selecting some variables from the original data and displaying a preview.

Creating a new variables. Transforming the Age in years to the Age in months.

Deleting the new created variable

Extracting the some observations from on specific variable

Extracting some rows from the whole dataframe

Selecting specific rows from the index variable EmployeeNumbers

What's the YearsAtCompany of the row with EmployeeNumber equal to 94?

Frequency of the variable Department

A barplot of the variable Department

Creating a pie chart

Frequency of the variable Attrition

Frequency in percentage

Compute the average of the variable HourlyRate

What's the overall statisfaction of the Employees?

Let us change the levels of the variable satisfaction by creating first a disctionary

Computing percentages

Sorting by frequencies (it's the default option)

Canceling the default sorting option and the bars will be sorted according to the categories

Selecting observation of a specific interest: Those with either "Low" or "Very High" Job statisfaction

Let's then remove the categories or levels that we won't use

The categories 'Medium' and 'High' won't be displayed

The Low statisfaction group

and the Very High satisfaction group

The average of the Age of each group

Comparing densities

By Department

We can normalize it

We can compare it with the whole sample