# Practical Tips for Improving Exploratory Data Analysis | by Radmila M. | Aug, 2023

I learnt how to apply this tip, when I worked on the research paper related to wind energy analysis and prediction [1]. While doing the EDA for this project, I faced a necessity to create a summary matrix that would reflect all the relationships between the wind parameters in order to find which of them have the strongest influence on each other. The first idea came to my mind was to build a ‘good old’ correlation matrix that I used to see in many Data Science / Data Analysis projects.

As you know, a correlation matrix is used to quantify and summarize linear relationships between variables. In the following code snippet, the `corrcoef` function was used on the feature columns of Wind Power Generated Data. Here I also applied the `heatmap` function from Seaborn to plot the correlation matrix array as a heat map:

`import matplotlib.pyplot as pltimport seaborn as snsimport pandas as pdimport numpy as np# read datadata = pd.read_csv('T1.csv')print(data)# rename columns to make their titles shorterdata.rename(columns={'LV ActivePower (kW)':'P','Wind Speed (m/s)':'Ws','Theoretical_Power_Curve (KWh)':'Power_curve','Wind Direction (°)': 'Wa'},inplace=True)cols = ['P', 'Ws', 'Power_curve', 'Wa']# build the matrixcorrelation_matrix = np.corrcoef(data[cols].values.T)hm = sns.heatmap(correlation_matrix,cbar=True, annot=True, square=True, fmt='.3f',annot_kws={'size': 15},cmap='Blues',yticklabels=['P', 'Ws', 'Power_curve', 'Wa'],xticklabels=['P', 'Ws', 'Power_curve', 'Wa'])# save the figureplt.savefig('image.png', dpi=600, bbox_inches='tight')plt.show()`

Analysing the resulting graphical results, it can be concluded that wind speed and active power have a strong correlation, but I think many people will agree with me that this is not an easy way to interpret the results when using this kind of visualization, because here we have only numbers.

A good alternative to the correlation matrix would be the scatterplot matrix, which allows you to visualize pairwise correlations between different features of a data set in one place. In this case, `sns.pairplot` should be used:

`import matplotlib.pyplot as pltimport seaborn as snsimport pandas as pd# read datadata = pd.read_csv('T1.csv')print(data)# rename columns to make their titles shorterdata.rename(columns={'LV ActivePower (kW)':'P','Wind Speed (m/s)':'Ws','Theoretical_Power_Curve (KWh)':'Power_curve','Wind Direction (°)': 'Wa'},inplace=True)cols = ['P', 'Ws', 'Power_curve', 'Wa']# build the matrixsns.pairplot(data[cols], height=2.5)plt.tight_layout()# save the figureplt.savefig('image2.png', dpi=600, bbox_inches='tight')plt.show()`

By looking at the scatterplot matrix, one can quickly eyeball how the data is distributed and whether it contains outliers or not. However, the main drawback of this kind of charts is related to the presence of duplicates due to the pairwise approach to plotting data.

In the end, I decided to combine the above graphs into one, where the lower left part will contain scatter plots of the selected parameters, and the upper right part will contain bubbles of different sizes and colours: larger circles mean that the studied parameters have a stronger linear correlation. The diagonal of the matrix will display the distribution of each feature: a narrow peak here would indicate that this particular parameter does not change too much, while other features change.

The code for building this summary matrix is given below. Here the map consists of three parts — `fig.map_lower`, `fig.map_diag`, `fig.map_upper`:

`import pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns# read datadata = pd.read_csv('T1.csv')print(data)# rename columns to make their titles shorterdata.rename(columns={'LV ActivePower (kW)':'P','Wind Speed (m/s)':'Ws','Theoretical_Power_Curve (KWh)':'Power_curve','Wind Direction (°)': 'Wa'},inplace=True)cols = ['P', 'Ws', 'Power_curve', 'Wa']# buid the matrixdef correlation_dots(*args, **kwargs):corr_r = args[0].corr(args[1], 'pearson')ax = plt.gca()ax.set_axis_off()marker_size = abs(corr_r) * 3000ax.scatter([.5], [.5], marker_size,[corr_r], alpha=0.5,cmap = 'Blues',vmin = -1, vmax = 1,transform = ax.transAxes)font_size = abs(corr_r) * 40 + 5sns.set(style = 'white', font_scale = 1.6)fig = sns.PairGrid(data, aspect = 1.4, diag_sharey = False)fig.map_lower(sns.regplot)fig.map_diag(sns.histplot)fig.map_upper(correlation_dots)# save the figureplt.savefig('image3.jpg', dpi = 600, bbox_inches = 'tight')plt.show()`

The summary matrix combines the advantages of the two previously studied diagrams — its lower (left) part imitates the scatterplot matrix, and its upper (right) fragment graphically reflects the numerical results of the correlation matrix.

From time to time I have to present the results of EDA to colleagues and clients, so visualization is a key assistant for me in this task. I always try to add various elements to the diagrams, such as arrows and notes, to make them even more attractive and readable.

Let’s go back to the EDA implementation case for a wind project discussed above. When it comes to wind energy, one of the most important parameters is a power curve. The power curve of a wind turbine (or the entire wind farm) is a graph showing the amount of electricity generated at various wind speeds. It is important to note that turbines do not operate at low wind speeds. Their start-up is associated with a cut-in speed, which is usually in the range of 2.5–5 m/s. At speeds between 12 and 15 m/s, the nominal power is reached. Finally, each turbine has an upper limit on the wind speed at which it can safely operate. Once this limit of the cut-out speed is reached, the wind turbine will not produce electricity unless its speed drops back into the operating range.

The studied dataset includes both the theoretical power curve (which is a typical curve from the manufacturer without any outliers) and the actual curve obtained if we plot wind power versus speed. The latter usually contains many points outside the ideal theoretical shape which might be caused by turbine failure, incorrect SCADA measurements, or unscheduled maintenance.

Now we will create an image that would display both types of the power curve — first, without any additional items, except legend:

`import pandas as pdimport matplotlib.pyplot as plt# read datadata = pd.read_csv('T1.csv')print(data)# rename columns to make their titles shorterdata.rename(columns={'LV ActivePower (kW)':'P','Wind Speed (m/s)':'Ws','Theoretical_Power_Curve (KWh)':'Power_curve','Wind Direction (°)': 'Wa'},inplace=True)# build the plotplt.scatter(data['Ws'], data['P'], color='steelblue', marker='+', label='actual')plt.scatter(data['Ws'], data['Power_curve'], color='black', label='theoretical')plt.xlabel('Wind Speed')plt.ylabel('Power')plt.legend(loc='best')# save the figureplt.savefig('image4.png', dpi=600, bbox_inches='tight')plt.show()`

As you can see, the graph needs an explanation, since it does not contain any additional details.

But what if we add lines to highlight the three main areas of the graph with cut-in, nominal and cut-out speeds marked, as well as a note with arrow to show one of the outliers?

Let’s check how the graph will look like in this case:

`import pandas as pdimport matplotlib.pyplot as plt# read datadata = pd.read_csv('T1.csv')print(data)# rename columns to make their titles shorterdata.rename(columns={'LV ActivePower (kW)':'P','Wind Speed (m/s)':'Ws','Theoretical_Power_Curve (KWh)':'Power_curve','Wind Direction (°)': 'Wa'},inplace=True)# build the plotplt.scatter(data['Ws'], data['P'], color='steelblue', marker='+', label='actual')plt.scatter(data['Ws'], data['Power_curve'], color='black', label='theoretical')# add vertical lines, text notes and arrowplt.vlines(x=3.05, ymin=10, ymax=350, lw=3, color='black')plt.text(1.1, 355, r"cut-in", fontsize=15)plt.vlines(x=12.5, ymin=3000, ymax=3500, lw=3, color='black')plt.text(13.5, 2850, r"nominal", fontsize=15)plt.vlines(x=24.5, ymin=3080, ymax=3550, lw=3, color='black')plt.text(21.5, 2900, r"cut-out", fontsize=15)plt.annotate('outlier!', xy=(18.4,1805), xytext=(21.5,2050),arrowprops={'color':'red'})plt.xlabel('Wind Speed')plt.ylabel('Power')plt.legend(loc='best')# save the figureplt.savefig('image4_2.png', dpi=600, bbox_inches='tight')plt.show()`

When analysing wind data, we often want to have comprehensive information about the potential of wind energy. Therefore, in addition to the dynamics of wind energy, it is necessary to have a graph showing how the wind speed depends on the wind direction.

To illustrate the changes in wind power, the following code can be used:

`import pandas as pdimport matplotlib.pyplot as plt# read datadata = pd.read_csv('T1.csv')print(data)# rename columns to make their titles shorterdata.rename(columns={'LV ActivePower (kW)':'P','Wind Speed (m/s)':'Ws','Theoretical_Power_Curve (KWh)':'Power_curve','Wind Direction (°)': 'Wa'},inplace=True)# resample 10-min data into hourly time measurementsdata['Date/Time'] = pd.to_datetime(data['Date/Time'])fig = plt.figure(figsize=(10,8))group_data = (data.set_index('Date/Time')).resample('H')['P'].sum()# plot wind power dynamicsgroup_data.plot(kind='line')plt.ylabel('Power')plt.xlabel('Date/Time')plt.title('Power generation (resampled to 1 hour)')# save the figureplt.savefig('wind_power.png', dpi=600, bbox_inches='tight')plt.show()`

Below is the resulting plot:

As one might noticed, the profile of wind power dynamics has a quite complex, irregular shape.

A windrose, or a polar rose plot, is a special diagram for representing the distribution of meteorological data, typically wind speeds by direction [3]. There is a simple module `windrose` for the `matplotlib` library, which allows to easily build this sort of visualizations, e.g.:

`import pandas as pdimport matplotlib.pyplot as pltimport numpy as npfrom windrose import WindroseAxes# read datadata = pd.read_csv('T1.csv')print(data)# rename columns to make their titles shorterdata.rename(columns={'LV ActivePower (kW)':'P','Wind Speed (m/s)':'Ws','Theoretical_Power_Curve (KWh)':'Power_curve','Wind Direction (°)': 'Wa'},inplace=True)wd  = data['Wa']ws = data['Ws']# plot normalized wind rose in a form of a stacked histogramax = WindroseAxes.from_ax()ax.bar(wd, ws, normed=True, opening=0.8, edgecolor='white')ax.set_legend()# save the figureplt.savefig('windrose.png', dpi = 600, bbox_inches = 'tight')plt.show()`

Looking at the wind rose map, one can notice that there are two main wind directions — north-east and south-west.

But how to merge these two images into a single one? The most obvious option is to use `add_subplot`. Though due to the specialities of `windrose` library, it is not a straightforward task:

`import pandas as pdimport matplotlib.pyplot as pltimport numpy as npfrom windrose import WindroseAxes# read datadata = pd.read_csv('T1.csv')print(data)# rename columns to make their titles shorterdata.rename(columns={'LV ActivePower (kW)':'P','Wind Speed (m/s)':'Ws','Theoretical_Power_Curve (KWh)':'Power_curve','Wind Direction (°)': 'Wa'},inplace=True)data['Date/Time'] = pd.to_datetime(data['Date/Time'])fig = plt.figure(figsize=(10,8))# plot both plots as subplotsax1 = fig.add_subplot(211)group_data = (data.set_index('Date/Time')).resample('H')['P'].sum()group_data.plot(kind='line')ax1.set_ylabel('Power')ax1.set_xlabel('Date/Time')ax1.set_title('Power generation (resampled to 1 hour)')ax2 = fig.add_subplot(212, projection='windrose')wd  = data['Wa']ws = data['Ws']ax = WindroseAxes.from_ax()ax2.bar(wd, ws, normed=True, opening=0.8, edgecolor='white')ax2.set_legend()# save the figureplt.savefig('image5.png', dpi=600, bbox_inches='tight')plt.show()`

In this case, the result looks like this:

The major downside here is that the two subplots differ in size, and because of that we have a lot of white empty space around the windrose chart.

To make things easier, I recommend taking a different approach, using the `Python Imaging Library` (PIL) [4] with just 11 (!) lines of code:

`import numpy as npimport PILfrom PIL import Image# list images that needs to be mergedlist_im = ['wind_power.png','windrose.png']imgs = [PIL.Image.open(i) for i in list_im]# resize all images to match the smallestmin_shape = sorted([(np.sum(i.size), i.size) for i in imgs])[0][1]# for a vertical stacking - we use vstackimages_comb = np.vstack((np.asarray(i.resize(min_shape)) for i in imgs))images_comb = PIL.Image.fromarray(imgs_comb)# save the figureimgages_comb.save('image5_2.png', dpi=(600,600))`

Here the output looks a bit prettier, because two images has the same size, since the code picks the smallest one and rescale others to match images:

By the way, while working with `PIL` one can use a horizontal stacking as well — for instance, let’s compare and contract a ‘silent’ and a ‘talkative’ power curve charts with each other:

`import numpy as npimport PILfrom PIL import Imagelist_im = ['image4.png','image4_2.png']imgs = [PIL.Image.open(i) for i in list_im]# pick the image which is the smallest, and resize the others to match it (can be arbitrary image shape here)min_shape = sorted([(np.sum(i.size), i.size) for i in imgs])[0][1]imgs_comb = np.hstack((np.asarray(i.resize(min_shape)) for i in imgs))### save that beautiful pictureimgs_comb = PIL.Image.fromarray(imgs_comb)imgs_comb.save('image4_merged.png', dpi=(600,600))`

In this post I shared with you three tips on how to make the EDA process easier. I hope, you found these advice useful for yourself and would start to apply them to your data tasks, too.

These tips perfectly match the formula that I always try to apply while doing the EDA: customize → itemize → optimize.

Well, you may ask, why on earth does this matter? I can say that actually it matters, because:

• It is very important to customize your charts to the particular needs that you face right now. For instance, instead of creating tons of infographics, think how you can combine several ones into just one, as we did while creating a summary matrix, which combines the strengths of both scatterplot and correlation charts.
• All of your charts should speak for themselves. Thus, you need to know how to itemize important stuff on the chart to make it detailed and well readable. Compare how big the difference is between a ‘silent’ and a ‘talkative’ power curves.
• And finally, every data specialist should learn how to optimize the EDA process to make things faster (and life easier). If you have to merge two images into one, do not necessary use `add_subplot` option all the time.

What else? I can definitely say that the EDA is a very creative and interesting step in working with data (not to mention that it is also super important).

Let your infographics shine like diamonds, and don’t forget to enjoy the process!

1. Paper “Data-driven applications for wind energy analysis and prediction: The case of “La Haute Borne” wind farm”. https://doi.org/10.1016/j.dche.2022.100048