In order to compare the performance of the aforementioned libraries when they are tasked with DataFrame operations, let us assume the following: All eligible merchandise of Store1 will be discounted by 20%, all eligible merchandise of Store2 will be discounted by 30%, and the discounted prices will be saved in a new dataframe. We use the word “eligible” because, as discussed above, items that have a Discountability of 0 can not be discounted. So, we will perform an execution time comparison on applying a function to the rows of a DataFrame, which is a prevalent task.
Code implementations are shown in sections B.1-B.8, and the performance comparison is shown in section B.9.
B.1 Pandas with pyarrow
In our first experiment for DataFrame operations, we will harness the capabilities of Apache Arrow, given its recent interoperability with Pandas 2.0. As shown in the first line of the code below, we convert a Pandas DataFrame to a pyarrow Table, which is an efficient way to represent columnar data in memory. Each column is stored separately, which allows for efficient compression and data queries. Then, the Store1, Store2, and Discountability columns are passed to the function scale_columns(), which scales the columns by the appropriate discount (0.2 or 0.3) and the mask value (0 or 1) of the Discountability column. The scaled columns are returned by the function as a tuple. Finally, the Table result_table is converted to a Pandas DataFrame.
Key point: Multiplications inside the function scale_columns() are implemented using the pyarrow.compute function multiply(). Note that EVERY multiplication has to be implemented using multiply(). For example, if above, we replaced pc.multiply(0.2,mask_col) with pc.multiply(0.2*mask_col), we would get an error. Finally, we use the subtract() function of pyarrow.compute for subtraction.
B2. With the Pandas method apply()
Our second experiment will use the Pandas DataFrame apply() method that performs row-wise operations. The function scale_columns() performs the scaling in the code below. Note that the actual scaling is done inside the nested function discount_store().
Key point: The nested function returns a Pandas Series that contains a dictionary where the key is the column, and the value is the scaled store price. You may wonder why we are returning this type of object. The reason is that the Pandas apply() method returns a Series where the index is the column name. Thus sending to apply() an object that has a similar structure to what it has to return facilitates computations. For education purposes, in the github directory of my code (link at the end of article), I provide two implementations using apply(). The one shown here, and one more, where the nested function returns the scaled values as a tuple. This implementation is more computationally intensive because the returned type is a different structure from what apply() has to return.
B.3. With Pandas itertuples()
itertuples() is a fast Pandas row iterator that produces namedtuples (column-name, value-of-corresponding-row). The code that implements the scaling of the store prices and computes the discounted prices using itertuples() is shown below.
Key point: The Pandas itertuples() method is often faster than Pandas apply(), particularly for larger datasets, as in our code example. The reason is that apply() calls a function for every row, while itertuples() does not and can take advantage of vectorized operations while returning a lightweight iterator that does not create new objects and is memory-efficient.
B.4 Pandas Vectorized Operations
And now, we have come to the most elegant and fastest way to compute discounted store prices: Vectorization! This is a way that allows the application of the desired operations to entire arrays at once instead of iterating using loops. The implementation is shown below.
B.5 With Numba
Numba is a just-in-time (JIT) compiler for Python that, at runtime, translates Python code into optimized machine code. It is instrumental in optimizing operations involving loops and NumPy arrays.
The @numba.njit symbol shown below is a decorator that tells Numba that the function that follows is to be compiled into machine code.
B.6 With Dask
Similar to NumPy, Dask offers vectorized operations, with the additional advantage that these operations can be applied in a parallel and distributed manner.
Additional advantages of Dask include: (a) Lazy evaluation, i.e., Dask arrays and operations are built on a task graph and executed only when the result is requested, by calling compute(), for example. (b) Out of core processing, i.e., it can process datasets that do not fit into memory. ( c ) Integration with Pandas.
B.7 With Polars
Polars offers vectorized operations that leverage the speed and memory efficiency of the Rust language. Similar to Dask, it offers lazy evaluation, out-of-core processing, and integration with Pandas. Implementation is shown below.
B.8 With RAPIDS.ai cuDF
Finally, we use cuDF to implement the price discounting function via vectorized operations. The cuDF library is built on top of CUDA, and so its vectorized operations leverage the computational strength of GPUs for faster processing. One handy feature of cudF is the offering of CUDA kernels. These are optimized functions that harness the parallel nature of GPUs. The cuDF implementation is shown below.
B.9 Execution Time Comparison of Function Application
Similar to the timing of DataFrame creation, timeit was used to measure the execution time of the price discounting function and its saving in a new DataFrame. Below are the execution times, ranked from fastest to slowest.
- With RAPIDS.ai, cuDF: 0.008894977000011295
- With Polars: 0.009557300014421344
- With pyarrow Table: 0.028865800006315112
- With Pandas 2.0 vectorized operations: 0.0536642000079155
- With Dask: 0.6413181999814697
- With Numba: 0.9497981000000095
- With Pandas 2.0 itertuples(): 1.0882243000087328
- With Pandas 2.0 apply(): 197.63155489997007
Not surprisingly, the GPU-enabled Rapids.ai cuDF achieves the fastest time, followed very closely by the lightning-fast Polars library. The execution times of pyarrow Table and Pandas 2.0 vectorized operations follow closely, too; cuDF’s execution time is only about 4.64 times faster on average than the latter two. Dask, Numba, and the Pandas method itertuples() have an inferior but reasonable performance (approximately 72, 107, and 122 times slower than Polars, respectively). Finally, the Pandas apply() has an outstandingly inferior performance to all other methods, which is also not surprising given that this method works in a row-wise, relatively slow manner for large datasets.
Key points: (a) Generally speaking, apply() is an excellent way to implement functions on Pandas DataFrames for small to medium-sized datasets. But, when dealing with large datasets, as in our case, it is best to look at other ways to implement the functions. (b) If you decide to use apply() in Pandas 2.0, ensure that this is done on a DataFrame created the standard way and not with pyarrow in the back end. The reason is that a very significant datatype conversion overhead will slow your computations considerably.