How to Iterate Over a Pandas Dataframe | by Marcello Politi | May, 2023


Photo by Sid Balachandran on Unsplash

Row-major vs Column-major, Pandas best practices

If you have some experience in data science you surely have faced developed algorithms from tabular data, common challenges of this kind are for example the Titanic — Machine Learning From Disaster or the Boston Housing.

Data represented in tabular form (such as CSV files) can be distinguished into row-major format and column-major format. In computing, row-major order and column-major order are methods for storing multidimensional arrays in linear storage such as random access memory. Depending on the paradigm with which the format was designed, there are best practices to follow to optimize file read and write times. Very often data scientists unfortunately use libraries such as pandas in the wrong way going to waste valuable time

Row major format means that in a table, consecutive rows are saved consecutively in memory. So if I am reading row i , then accessing row i+1 will be a very fast operation.

Formats that follow the Column major format paradigm, such as Parquet, consecutively save columns in memory.

In Machine Learning we often have the case where the rows are the data samples and the columns are the features. So we will use a CSV file if we need to access samples quickly while Parquet if we often need to access features (e.g. to calculate statistics etc.).

src: https://en.wikipedia.org/wiki/Row-_and_column-major_order#/media/File:Row_and_column_major_order.svg

Pandas

Pandas is a library widely used in data science, especially when dealing with tabular data. Pandas is built on the concept of DataFrame, precisely a tabular representation of data. The DataFrame though follows the column major format paradigm.

So iterating a DataFrame, row by row, as is often done, is very slow. Let’s look at an example Let’s import the BostonHousing DataFrame and iterate it.

import pandas as pd
import time
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')
df.head()

In this first experiment, we iterate through the columns of the DataFrame (df.columns) and then access all the elements in each column, and calculate the time it takes to finish the process.

#iterating df by column
start = time.time()
for col in df.columns:
for item in df[col]:
pass
print(time.time() -start , " seconds")

#OUTPUT: 0.0021004676818847656 seconds

Instead, in this second experiment we iterate for rows in the DataFrame with the df.iloc function, which returns the contents of the entire row.

#iterating df by row
n_rows = len(df)
start = time.time()
for i in range(n_rows):
for item in df.iloc[i]:
pass
print(time.time() -start , " seconds")

#OUTPUT : 0.059470415115356445 seconds

As you can see the result of the second experiment is much greater than the first. In this case, our dataset was very small, but if you try with your own larger working dataset you will notice how this difference will become more and more pronounced.

Numpy

Fortunately, the numpy library comes to our rescue. When we use numpy we can specify the major order we want to use, by default the row-major order is used.

So what we can do is convert a pandas DataFrame to numpy and iterate the latter line by line. Let’s look at some experiments.

We first convert the DataFrame to a numpy format.

df_np = df.to_numpy()
n_rows, n_cols = df_np.shape

Now let’s iterate the data by column, and calculate the time.

#iterating numpy by columns
start = time.time()
for j in range(n_cols):
for item in df_np[:,j]:
pass
print(time.time() -start, " seconds")

#OUTPUT : 0.002185821533203125 seconds

Now same thing iterating by rows.

#iterating numpy by row
start = time.time()
for i in range(n_rows):
for item in df_np[i]:
pass
print(time.time() -start, " seconds")

#OUTPUT : 0.0023500919342041016 seconds

We see that by using numpy the speed of both experiments is increased! Moreover, the difference between the two is minimal.

In this paper, we introduced the difference between row-major and column-major paradigms when dealing with tabular data. We pointed out a common mistake that is made by many data scientists using Pandas. The time difference in accessing the data, in this case, is minimal because we used a small dataset. But you have to be careful because the bigger the dataset you use the bigger this difference will become in turn, and you might lose a lot of time just reading the data. As a solution always try to use numpy whenever possible.
Follow me for more articles of this type!😉

Marcello Politi

Linkedin, Twitter, Website





Source link

Leave a Comment