How to Use pandas DataFrames in Python to Analyze and Manipulate Data

https://www.profitableratecpm.com/f4ffsdxe?key=39b1ebce72f3758345b2155c98e6709c

If you want to analyze the data in Python, you will want to familiarize yourself with pandas, because it facilitates the realization of data analysis. Dataframa is the main data format you interact with. Here’s how to use it.

What is pandas?

The official pandas website.

Pandas is a Python module that is popular in data science and data analysis. It is a way to organize data in dataframes and offers many operations that you can perform on this data. It was initially developed by AQR Capital Management, but it was open in the late 2000s.

To install pandas using Pypi:

        pip install pandas

It is best to work with Pandas using a jupyter notebook or another interactive python session. IPython is ideal for occasional data explorations in the terminal, but Jupyter will save a recording of your calculations, which is useful when you return to a set of data days or weeks later and fight to remember what you have done. I created my own code of code examples that you can examine on my github page. This is where screenshots come from.

What is a dataaframe?

A dataaframa is the main data structure with which you work in Pandas. As a calculation sheet or a relational database, it organizes the data in lines and columns. The columns are grouped by a header name. The concept is similar to R Data Cames, another popular programming language in statistics and data science. DataFrame columns may contain both text and digital data, including integers and floating comma numbers. Columns may also contain data from chronological series.

How to create a dataframe

Assuming that you have already installed pandas, you can create a small data from other elements.

I will create columns representing a linear function which could be used for regression analysis later. First of all, I will create the X axis, or the independent variable, from a nommpy table:

        import numpy as np
x = np.linspace(-10,10)

Then I will create the Y column or the dependent variable as a simple linear function:

        y = 2*x + 5
    

I will now import pandas and create the dataframe.

        import pandas as pd
    

As with Numpy, the shortening of the names of the pandas will facilitate the tape.

The Pandas Dataframe method takes a dictionary of the names of the columns and the lists of the real data. I will create a dataframa called “DF” with columns entitled “X” and “Y”. The data will be the NUMPY tables that I created earlier.

        
df = pd.DataFrame({'x':x,'y':y})

Importation of a dataframe

Although it is possible to create data from zero, it is more common to import data from another source. Since the data content is tabular, the spreadsheets are a popular source. The highest values ​​of the spreadsheet will become column names.

To read in an Excel spreadsheet, use the Read_Excel method:

        
df = pd.read_excel('/path/to/spreadsheet.xls')

Being an open source fan, I tend to gravitate towards LibreOffice Calc rather than on Excel, but I can also import other types of files. The .csv format is widely used and I can export my data in this format.

        
df = pd.read_csv('/path/to/data.csv')

Practical feature is the possibility of copying from the clipboard. It is ideal for smaller data sets to obtain more advanced calculations that I cannot obtain in a spreadsheet:

        
df = pd.read_clipboard()

Dataframa

Now that you’ve created a dataframe, the next step is to examine the data.

One way to do so is to get the first five lines of the dataframe with the head method

        df.head()
Pandas dataframe head of "df" Display of X and Y columns.

I have already used the head command on Linux or other UNIX type systems, it is similar. If you know the tail command, there is a similar method in the pandas that obtains the last lines of a dataaframa

        
df.tail()
Pandas Tail (last five lines) of DF DataFrame.

You can use table cutting methods to display a precise line subset. To see lines 1 to 3:

        df[1:3]
    
DataFrame-Array-SlicedataFrane Tablel Slice.

With the HEAD command in Linux, you can display an exact number of lines with a digital argument. You can do the same in Pandas. To see the first 10 lines:

        df.head(10)
    
Dataframe head displaying the first 10 lines.

The tail method works the same way.

        df.tail(10)
    

More interestingly is to examine the existing data sets. A popular way to demonstrate it is with the passenger data set on the Titanic. It is available on Kaggle. Many other statistical libraries like Seaborn and Penguin will allow you to load in examples of datasets so that you do not have to download them. Pandas Dataframes will also be used mainly to supply data in these libraries, for example to create a layout or calculate a linear regression.

With downloaded data, you will need to import it:

        titanic = pd.read_csv('data/Titanic-Dataset.csv')
    

Let’s look at the head again

        titanic.head()
    
Pandas Head of Titanic Passengers data set.

We can also see all columns with the column method

        titanic.columns
    
Pandas columns of the Titanic passenger data set.

Pandas offers many methods for obtaining information on the data set. The method described offers descriptive statistics of all digital columns in DataFrame.

        titanic.describe()
    
Descriptive statistics of the Titanic data set.

The first is average or average. Then, the standard deviation, or how closely or closely spaced around the average. Then comes the minimum value, the lower quartile or the 25th centile, the median or the 50th centile, the upper quartile or the 75th centile, and the maximum value. These values ​​constitute the “summary” of legendary statistician John Tukey Tukey. You can quickly see how your data is distributed using these numbers.

To access a column in itself, call the name of the dataframe with the name of the Carré-Crochet column (”[])

For example, to display the column with the names of the passengers:

        titanic['Name']
    
Passenger names of the Titanic data set.

Because the list is so long, it will be truncated by default. To see the full name list, use the To_String method.

        titanic['Name'].to_string()
    

You can also deactivate the truncation. To turn it off with columns with a large number of lines:

        pd.set_option('display.max_rows', None)
    

You can also use other methods when selecting by line. To see descriptive statistics on a column:

        titanic['Age'].describe()
    
Pandas descriptive statistics of the age column of the Titanic passenger data set.

You can also access individual values

        titanic['Age'].mean()
titanic['Age'].median()
Average and median Titaness passengers of the data set.

Adding and deleting columns

Not only can you also examine the columns, but you can also add new ones. You can add a column to fill it with values, as you would with a Python table, but you can also transform data and add it to new columns.

Let’s go back to the original dataframe that we have created, DF. We can carry out operations on each element of a column. For example, to square the X column:

        df['x']**2
    
Pandas DataFrame X Column with a square.

We can create a new column with these values:

        df['x2'] = df['x']**2
    

To delete a column, you can use the deposit function

        df.drop('x2',axis=1)
    

The argument of the axis indicates to Pandas to operate by columns instead of the lines.

Perform operations on columns

As mentioned above, you can carry out operations on columns. You can carry out mathematical and statistical operations on them.

We can add our X and Y columns:

        df['x'] + df['y']
    
Pandas df dataframe x column plus y column.

You can select multiple columns with double supports.

To see the names and ages of the Titanic passengers:

        titanic[['Name','Age']]
Name Titanic and age columns of the dataframe pandas.

The elements of the column must be separated by a commissioned character (,).

You can also search for dataframes, similar to SQL searches. To see the rows of passengers who were over 30 when mounted on the unfortunate lining, you can use a Boolean selection inside the supports:

        titanic[titanic['Age'] > 30]
    
Pandas Titanic Dataframe showing rows of passengers over 30 years old.

It’s like SQL instruction:

        SELECT * FROM titanic WHERE Age > 30
    

You can select the column using .loc before the supports:

titanic.loc [titanic['Age'] > 30]
Column of Pandas Age of Titanic Passengers out of 30.

Let’s make a bar plot where Titanic’s passengers got involved. We can do our own DataFrame subset with the three boarding points, Southampton, England; Cherbourg, France; And Queenstown, Ireland (now COBH).

        embarked = titanic['Embarked'].value_counts()
    

This will create a new data with the number of people who have launched each other. But we have a problem. The column headers are simply letters for the name of the port. Replacement by the complete names of the port. The Renomme method will take a dictionary of ancient names and new ones.

        embarked = embarked.rename({'S':'Southhampton','C':'Cherbourg','Q':'Queenstown'})
    

With the renowned columns, we can make our bar graphs. It’s easy with Pandas:

        embarked.plot(kind='bar')
Display of a bar graph with ports to which passengers have embarked on the Titanic.


This should help you start exploring Pandas data sets. Pandas is one of the reasons why Python has become so popular with statisticians, data scientists and anyone who needs to explore data.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button