# Beyond Numpy and Pandas: Unlocking the Potential of Lesser-Known Python Libraries

## Introducing Xarray

Xarray is a Python library that extends the features and functionalities of NumPy, giving us the possibility to work with labeled arrays and datasets.

As they say on their website, in fact:

Xarray makes working with labeled multi-dimensional arrays in Python simple, efficient, and fun!

And also:

Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience.

In other words, it extends the functionality of NumPy arrays by adding labels or coordinates to the array dimensions. These labels provide metadata and enable more advanced analysis and manipulation of multi-dimensional data.

For example, in NumPy, arrays are accessed using integer-based indexing.

In Xarray, instead, each dimension can have a label associated with it, making it easier to understand and manipulate the data based on meaningful names.

For example, instead of accessing data with `arr[0, 1, 2]`, we can use `arr.sel(x=0, y=1, z=2)` in Xarray, where `x`, `y`, and `z` are dimension labels.

This makes the code much more readable!

So, let’s see some features of Xarray.

## Some features of Xarray in action

As usual, to install it:

`\$ pip install xarray`

FEATURE ONE: WORKING WITH LABELED COORDINATES

Suppose we want to create some data related to temperature and we want to label these with coordinates like latitude and longitude. We can do it like so:

`import xarray as xrimport numpy as np# Create temperature datatemperature = np.random.rand(100, 100) * 20 + 10# Create coordinate arrays for latitude and longitudelatitudes = np.linspace(-90, 90, 100)longitudes = np.linspace(-180, 180, 100)# Create an Xarray data array with labeled coordinatesda = xr.DataArray(temperature,dims=['latitude', 'longitude'],coords={'latitude': latitudes, 'longitude': longitudes})# Access data using labeled coordinatessubset = da.sel(latitude=slice(-45, 45), longitude=slice(-90, 0))`

And if we print them we get:

`# Print dataprint(subset)>>><xarray.DataArray (latitude: 50, longitude: 25)>array([[13.45064786, 29.15218061, 14.77363206, ..., 12.00262833,16.42712411, 15.61353963],[23.47498117, 20.25554247, 14.44056286, ..., 19.04096482,15.60398491, 24.69535367],[25.48971105, 20.64944534, 21.2263141 , ..., 25.80933737,16.72629302, 29.48307134],...,[10.19615833, 17.106716  , 10.79594252, ..., 29.6897709 ,20.68549602, 29.4015482 ],[26.54253304, 14.21939699, 11.085207  , ..., 15.56702191,19.64285595, 18.03809074],[26.50676351, 15.21217526, 23.63645069, ..., 17.22512125,13.96942377, 13.93766583]])Coordinates:* latitude   (latitude) float64 -44.55 -42.73 -40.91 ... 40.91 42.73 44.55* longitude  (longitude) float64 -89.09 -85.45 -81.82 ... -9.091 -5.455 -1.818`

So, let’s see the process step-by-step:

1. We’ve created the temperature values as a NumPy array.
2. We’ve defined the latitudes and longitueas values as NumPy arrays.
3. We’ve stored all the data in an Xarray array with the method `DataArray()`.
4. We’ve selected a subset of the latitudes and longitudes with the method `sel()` that selects the values we want for our subset.

The result is also easily readable, so labeling is really helpful in a lot of cases.

FEATURE TWO: HANDLING MISSING DATA

Suppose we’re collecting data related to temperatures during the year. We want to know if we have some null values in our array. Here’s how we can do so:

`import xarray as xrimport numpy as npimport pandas as pd# Create temperature data with missing valuestemperature = np.random.rand(365, 50, 50) * 20 + 10temperature[0:10, :, :] = np.nan  # Set the first 10 days as missing values# Create time, latitude, and longitude coordinate arraystimes = pd.date_range('2023-01-01', periods=365, freq='D')latitudes = np.linspace(-90, 90, 50)longitudes = np.linspace(-180, 180, 50)# Create an Xarray data array with missing valuesda = xr.DataArray(temperature,dims=['time', 'latitude', 'longitude'],coords={'time': times, 'latitude': latitudes, 'longitude': longitudes})# Count the number of missing values along the time dimensionmissing_count = da.isnull().sum(dim='time')# Print missing valuesprint(missing_count)>>><xarray.DataArray (latitude: 50, longitude: 50)>array([[10, 10, 10, ..., 10, 10, 10],[10, 10, 10, ..., 10, 10, 10],[10, 10, 10, ..., 10, 10, 10],...,[10, 10, 10, ..., 10, 10, 10],[10, 10, 10, ..., 10, 10, 10],[10, 10, 10, ..., 10, 10, 10]])Coordinates:* latitude   (latitude) float64 -90.0 -86.33 -82.65 ... 82.65 86.33 90.0* longitude  (longitude) float64 -180.0 -172.7 -165.3 ... 165.3 172.7 180.0`

And so we obtain that we have 10 null values.

Also, if we take a look closely at the code, we can see that we can apply Pandas’ methods to an Xarray like `isnull.sum()`, as in this case, that counts the total number of missing values.

FEATURE ONE: HANDLING AND ANALYZING MULTI-DIMENSIONAL DATA

The temptation to handle and analyze multi-dimensional data is high when we have the possibility to label our arrays. So, why not try it?

For example, suppose we’re still collecting data related to temperatures at certain latitudes and longitudes.

We may want to calculate the mean, the max, and the median temperatures. We can do it like so:

`import xarray as xrimport numpy as npimport pandas as pd# Create synthetic temperature datatemperature = np.random.rand(365, 50, 50) * 20 + 10# Create time, latitude, and longitude coordinate arraystimes = pd.date_range('2023-01-01', periods=365, freq='D')latitudes = np.linspace(-90, 90, 50)longitudes = np.linspace(-180, 180, 50)# Create an Xarray datasetds = xr.Dataset({'temperature': (['time', 'latitude', 'longitude'], temperature),},coords={'time': times,'latitude': latitudes,'longitude': longitudes,})# Perform statistical analysis on the temperature datamean_temperature = ds['temperature'].mean(dim='time')max_temperature = ds['temperature'].max(dim='time')min_temperature = ds['temperature'].min(dim='time')# Print values print(f"mean temperature:n {mean_temperature}n")print(f"max temperature:n {max_temperature}n")print(f"min temperature:n {min_temperature}n")>>>mean temperature:<xarray.DataArray 'temperature' (latitude: 50, longitude: 50)>array([[19.99931701, 20.36395016, 20.04110699, ..., 19.98811842,20.08895803, 19.86064693],[19.84016491, 19.87077812, 20.27445405, ..., 19.8071972 ,19.62665953, 19.58231185],[19.63911165, 19.62051976, 19.61247548, ..., 19.85043831,20.13086891, 19.80267099],...,[20.18590514, 20.05931149, 20.17133483, ..., 20.52858247,19.83882433, 20.66808513],[19.56455575, 19.90091128, 20.32566232, ..., 19.88689221,19.78811145, 19.91205212],[19.82268297, 20.14242279, 19.60842148, ..., 19.68290006,20.00327294, 19.68955107]])Coordinates:* latitude   (latitude) float64 -90.0 -86.33 -82.65 ... 82.65 86.33 90.0* longitude  (longitude) float64 -180.0 -172.7 -165.3 ... 165.3 172.7 180.0max temperature:<xarray.DataArray 'temperature' (latitude: 50, longitude: 50)>array([[29.98465531, 29.97609171, 29.96821276, ..., 29.86639343,29.95069558, 29.98807808],[29.91802049, 29.92870312, 29.87625447, ..., 29.92519055,29.9964299 , 29.99792388],[29.96647016, 29.7934891 , 29.89731136, ..., 29.99174546,29.97267052, 29.96058079],...,[29.91699117, 29.98920555, 29.83798369, ..., 29.90271746,29.93747041, 29.97244906],[29.99171911, 29.99051943, 29.92706773, ..., 29.90578739,29.99433847, 29.94506567],[29.99438621, 29.98798699, 29.97664488, ..., 29.98669576,29.91296382, 29.93100249]])Coordinates:* latitude   (latitude) float64 -90.0 -86.33 -82.65 ... 82.65 86.33 90.0* longitude  (longitude) float64 -180.0 -172.7 -165.3 ... 165.3 172.7 180.0min temperature:<xarray.DataArray 'temperature' (latitude: 50, longitude: 50)>array([[10.0326431 , 10.07666029, 10.02795524, ..., 10.17215336,10.00264909, 10.05387097],[10.00355858, 10.00610942, 10.02567816, ..., 10.29100316,10.00861792, 10.16955806],[10.01636216, 10.02856619, 10.00389027, ..., 10.0929342 ,10.01504103, 10.06219179],...,[10.00477003, 10.0303088 , 10.04494723, ..., 10.05720692,10.122994  , 10.04947012],[10.00422182, 10.0211205 , 10.00183528, ..., 10.03818058,10.02632697, 10.06722953],[10.10994581, 10.12445222, 10.03002468, ..., 10.06937041,10.04924046, 10.00645499]])Coordinates:* latitude   (latitude) float64 -90.0 -86.33 -82.65 ... 82.65 86.33 90.0* longitude  (longitude) float64 -180.0 -172.7 -165.3 ... 165.3 172.7 180.0`

And we obtained what we wanted, also in a clearly readable way.

And again, as before, to calculate the max, min, and mean values of temperatures we’ve used Pandas’ functions applied to an array.