How to Use pandas DataFrames in Python to Analyze and Manipulate Data

If you want to analyze the data in Python, you will want to familiarize yourself with pandas, because it facilitates the realization of data analysis. Dataframa is the main data format you interact with. Here’s how to use it.
What is pandas?
Pandas is a Python module that is popular in data science and data analysis. It is a way to organize data in dataframes and offers many operations that you can perform on this data. It was initially developed by AQR Capital Management, but it was open in the late 2000s.
To install pandas using Pypi:
pip install pandas
It is best to work with Pandas using a jupyter notebook or another interactive python session. IPython is ideal for occasional data explorations in the terminal, but Jupyter will save a recording of your calculations, which is useful when you return to a set of data days or weeks later and fight to remember what you have done. I created my own code of code examples that you can examine on my github page. This is where screenshots come from.
What is a dataaframe?
A dataaframa is the main data structure with which you work in Pandas. As a calculation sheet or a relational database, it organizes the data in lines and columns. The columns are grouped by a header name. The concept is similar to R Data Cames, another popular programming language in statistics and data science. DataFrame columns may contain both text and digital data, including integers and floating comma numbers. Columns may also contain data from chronological series.
How to create a dataframe
Assuming that you have already installed pandas, you can create a small data from other elements.
I will create columns representing a linear function which could be used for regression analysis later. First of all, I will create the X axis, or the independent variable, from a nommpy table:
import numpy as np
x = np.linspace(-10,10)
Then I will create the Y column or the dependent variable as a simple linear function:
y = 2*x + 5
I will now import pandas and create the dataframe.
import pandas as pd
As with Numpy, the shortening of the names of the pandas will facilitate the tape.
The Pandas Dataframe method takes a dictionary of the names of the columns and the lists of the real data. I will create a dataframa called “DF” with columns entitled “X” and “Y”. The data will be the NUMPY tables that I created earlier.
df = pd.DataFrame({'x':x,'y':y})
Importation of a dataframe
Although it is possible to create data from zero, it is more common to import data from another source. Since the data content is tabular, the spreadsheets are a popular source. The highest values of the spreadsheet will become column names.
To read in an Excel spreadsheet, use the Read_Excel method:
df = pd.read_excel('/path/to/spreadsheet.xls')
Being an open source fan, I tend to gravitate towards LibreOffice Calc rather than on Excel, but I can also import other types of files. The .csv format is widely used and I can export my data in this format.
df = pd.read_csv('/path/to/data.csv')
Practical feature is the possibility of copying from the clipboard. It is ideal for smaller data sets to obtain more advanced calculations that I cannot obtain in a spreadsheet:
df = pd.read_clipboard()
Dataframa
Now that you’ve created a dataframe, the next step is to examine the data.
One way to do so is to get the first five lines of the dataframe with the head method
df.head()
I have already used the head command on Linux or other UNIX type systems, it is similar. If you know the tail command, there is a similar method in the pandas that obtains the last lines of a dataaframa
df.tail()
You can use table cutting methods to display a precise line subset. To see lines 1 to 3:
df[1:3]
With the HEAD command in Linux, you can display an exact number of lines with a digital argument. You can do the same in Pandas. To see the first 10 lines:
df.head(10)
The tail method works the same way.
df.tail(10)
More interestingly is to examine the existing data sets. A popular way to demonstrate it is with the passenger data set on the Titanic. It is available on Kaggle. Many other statistical libraries like Seaborn and Penguin will allow you to load in examples of datasets so that you do not have to download them. Pandas Dataframes will also be used mainly to supply data in these libraries, for example to create a layout or calculate a linear regression.
With downloaded data, you will need to import it:
titanic = pd.read_csv('data/Titanic-Dataset.csv')
Let’s look at the head again
titanic.head()
We can also see all columns with the column method
titanic.columns
Pandas offers many methods for obtaining information on the data set. The method described offers descriptive statistics of all digital columns in DataFrame.
titanic.describe()
The first is average or average. Then, the standard deviation, or how closely or closely spaced around the average. Then comes the minimum value, the lower quartile or the 25th centile, the median or the 50th centile, the upper quartile or the 75th centile, and the maximum value. These values constitute the “summary” of legendary statistician John Tukey Tukey. You can quickly see how your data is distributed using these numbers.
To access a column in itself, call the name of the dataframe with the name of the Carré-Crochet column (”[])
For example, to display the column with the names of the passengers:
titanic['Name']
Because the list is so long, it will be truncated by default. To see the full name list, use the To_String method.
titanic['Name'].to_string()
You can also deactivate the truncation. To turn it off with columns with a large number of lines:
pd.set_option('display.max_rows', None)
You can also use other methods when selecting by line. To see descriptive statistics on a column:
titanic['Age'].describe()
You can also access individual values
titanic['Age'].mean()
titanic['Age'].median()
Adding and deleting columns
Not only can you also examine the columns, but you can also add new ones. You can add a column to fill it with values, as you would with a Python table, but you can also transform data and add it to new columns.
Let’s go back to the original dataframe that we have created, DF. We can carry out operations on each element of a column. For example, to square the X column:
df['x']**2
We can create a new column with these values:
df['x2'] = df['x']**2
To delete a column, you can use the deposit function
df.drop('x2',axis=1)
The argument of the axis indicates to Pandas to operate by columns instead of the lines.
Perform operations on columns
As mentioned above, you can carry out operations on columns. You can carry out mathematical and statistical operations on them.
We can add our X and Y columns:
df['x'] + df['y']
You can select multiple columns with double supports.
To see the names and ages of the Titanic passengers:
titanic[['Name','Age']]
The elements of the column must be separated by a commissioned character (,).
You can also search for dataframes, similar to SQL searches. To see the rows of passengers who were over 30 when mounted on the unfortunate lining, you can use a Boolean selection inside the supports:
titanic[titanic['Age'] > 30]
It’s like SQL instruction:
SELECT * FROM titanic WHERE Age > 30
You can select the column using .loc before the supports:
titanic.loc [titanic['Age'] > 30]
Let’s make a bar plot where Titanic’s passengers got involved. We can do our own DataFrame subset with the three boarding points, Southampton, England; Cherbourg, France; And Queenstown, Ireland (now COBH).
embarked = titanic['Embarked'].value_counts()
This will create a new data with the number of people who have launched each other. But we have a problem. The column headers are simply letters for the name of the port. Replacement by the complete names of the port. The Renomme method will take a dictionary of ancient names and new ones.
embarked = embarked.rename({'S':'Southhampton','C':'Cherbourg','Q':'Queenstown'})
With the renowned columns, we can make our bar graphs. It’s easy with Pandas:
embarked.plot(kind='bar')
This should help you start exploring Pandas data sets. Pandas is one of the reasons why Python has become so popular with statisticians, data scientists and anyone who needs to explore data.




