Pandas 2.0: A Game-Changer for Data Scientists?


Being built on top of numpy made it hard for pandas to handle missing values in a hassle-free, flexible way, since numpy does not support null values for some data types.

For instance, integers are automatically converted to floats, which is not ideal:

Missing Values: Conversion to float. Snippet by Author.

Note how points automatically changes from int64 to float64 after the introduction of a singleNone value.

There is nothing worst for a data flow than wrong typesets, especially within a data-centric AI paradigm.

Erroneous typesets directly impact data preparation decisions, cause incompatibilities between different chunks of data, and even when passing silently, they might compromise certain operations that output nonsensical results in return.

As an example, at the Data-Centric AI Community, we’re currenlty working on a project around synthetic data for data privacy. One of the features, NOC (number of children), has missing values and therefore it is automatically converted to float when the data is loaded. The, when passing the data into a generative model as a float , we might get output values as decimals such as 2.5 — unless you’re a mathematician with 2 kids, a newborn, and a weird sense of humor, having 2.5 children is not OK.

In pandas 2.0, we can leverage dtype = 'numpy_nullable', where missing values are accounted for without any dtype changes, so we can keep our original data types (int64 in this case):

Leveraging ‘numpy_nullable’, pandas 2.0 can handle missing values without changing the original data types. Snippet by Author.

It might seem like a subtle change, but under the hood it means that now pandas can natively use Arrow’s implementation of dealing with missing values. This makes operations much more efficient, since pandas doesn’t have to implement its own version for handling null values for each data type.



Source link

Leave a Comment