Unleashing the Power of Prompt Engineering for Data Scientists


The word of the last 150 years could be “automation”. In fact, the world has evolved from making stuff by hand to assembly lines. And while crafts are still (highly) valuable, “mass production” has become a word coupled with “automation”.

The mechanization and automation of the work are increasing and this has permeated different fields, not only the ones directly involved in producing goods like, for example, manufacturing or agriculture.

If we take a look at a Software, for example, the first thing that we should see is automation. When I was learning Python about three years ago a mentor, reviewing my first project, told me: “Federico, developing software means automating stuff!”. And, if you’re asking, then yes: my first project was a mess (as with everything we do for the first time!).

Anyway, here’s the truth: human beings have evolved with a clear goal: automating stuff. This can be related to automating boring stuff or “hard work”. It doesn’t matter; the point is the direction towards automation.

In this scenario, prompt engineering is just the latest tool that can help us automate stuff. On a “code sight,” this means automating the automatable: software development is automation at its core, and using prompt engineering means pushing automation even harder.

The truth, in fact, is that even in software development there are boring tasks, even if, for example, we’ve created classes that can be imported (but need to be modified a little bit).

Think of that for a moment: as a Data Scientist, how many prototypes per week do you develop? And how much time do you have to develop them?

You create a prototype and then what? The specification of the projects change, the customer changes ideas, your boss is not satisfied…well, you named it.

So, why should we spend a lot of effort on low-value but time-consuming tasks rather than automate them? Here are the core concepts of prompt engineering, to me.

So, let’s see how prompt engineering can affect Data scientists, and then let’s see how we can create useful and efficient prompts.

Every new technology has some pros and some cons, and so does prompt engineering. First, let’s see the pros and then the cons.

Prompt Engineering for Data Scientists: pros

  1. Faster learning. If you’re a beginner in Data Science (and in Software Development, in general) you’ll find very beneficial a tool like ChatGPT because it’s like having a senior developer available 24/7. Anyway, it shouldn’t be trusted as an oracle, primarily because it still makes some errors. I’ve written a dedicated article on how to effectively start coding in the era of ChatGPT here if you’re interested.
  2. Faster prototypes. In my opinion, one of the most important parts of the job of a Data Scientist is prototyping. In fact, very often we need to give a fast answer based on data (often, with little available data. Very often, with dirty data). So, a prototype is what can give the sense of the answer a customer needs, giving us time for: a) ask/get for more data, b) ask/get for more specifications, c) clean data, and d) make necessary research.
  3. Faster debugging and error management. We have to be honest: debugging and error management in software development is more a curse than a delight. This is also true when we develop software for Machine/Deep Learning algorithms. ChatGPT is a great tool for debugging and error management: with the right prompt, it can find errors and bugs in a matter of seconds, making us save a lot of time and effort. Just a quick reminder: since ChatGPT (and similar tools) work in the cloud and also they may use our prompts to train their algorithms, remember to not write code with sensible information because they can get you in trouble in cases of data breaches.
  4. Faster research. An important part of the job of a Data Scientist is making research. We absolutely need to research a lot of stuff to solve our problems like: info on particular libraries and their usages, info related to the domain knowledge of the problem we’re facing, etc…Well, a good prompt is generally useful to let us grab the info we need. The only thing to remember is that we always need to verify the correctness of the output by deepening it on the internet or on books. Especially with code, is always important to read the documentation: otherwise, the risk is to copy and paste the code without actually understanding it.

Prompt Engineering for Data Scientists: cons

  1. Possibility to lose your job. Yes, we have to say it: AI tools can make us lose our job. It seems a contradiction: the necessity for data professionals in the market is increasing in these months, but tools like ChatGPT may substitute us. Well, let’s say the truth: this possibility is far away at the time because AI tools need the supervision of an expert, as we’ve also discussed in the pros. Sure, you can ask for some code and for some data analysis, but if you don’t know how to use it what do you do with the code? So, yes: prompt engineering may lead to job loss for some data professionals, but it’s a matter of years not months.
  2. Possibility to forgot how to code. This is an actual problem. If we rely too much on prompt engineering rather than writing code by ourselves, we can forget how to code. You know: coding is a matter of practicing and it needs everyday practice. Sure, is like going on a bicycle: you’ll never forget how to do so. But, you know: relying too much on the prompts rather than on writing code can atrophy your muscles because you’ve become too much comfortable. So, use tools like ChatGPT but don’t rely only on those: strive to write code as much as you can. Because I know you love to code, so don’t leave it too much to machines.
  3. Possibility to not learn new things. The beauty of working in IT, especially in the data field, is that new topics and technologies are born barely every day. This is one of the most important reasons why I changed my career to work in IT: because I love to always learn new things, and I’d like that this is recognized as a good thing (yes, there are fields/companies where self-improvement is not considered a good thing). But if you rely only on the answer you get from your prompts and copy and paste the code (or the info you get) you won’t learn new things. On the code sight, at least you need to ask for explanations, then read the documentation. And then, if possible, try the code in different scenarios so that you can get comfortable with the code you received. This is also the same approach we need to have when we ask for general information: always deepen them by reading other sources (as it always should be: but let’s just keep this in mind!).
  4. Possibility to implement “wordy code”. The fact that a piece of code just works is not a good reason to leave it as is, as you know. Code quality is also important for those who will read that snippet in the future (maybe, you). Sometimes, due to different factors (often, to not correct prompts), tools like ChatGPT can give us “wordy code” with unnecessary parts. And here comes the importance of a professional: don’t blindly copy and paste it. If you copy and paste it, after you verified it works, strive to improve it as much as you can.

Now, with these pros and cons in mind, let’s see some effective prompts we can create in ChatGPT to improve and speed up our jobs as Data Scientists.

Example n°1: prototype

Suppose we are studying a binary classification example. We want to compare three ML models using learning curves, but we don’t actually remember the code to use. Also, our repositories are a little bit messy and we don’t want to waste much time.

To speed up our job in such situations, we can ask ChatGPT a prompt like the following: “Create, in Python, a binary classification problem with synthetic data. Standardize the data using the standard scaler, then split the data into the train and the test set. Then, fit the train set with a random forest, a KNN, and a SVM classifiers. Then, plot ROC curve to compare the models.” And these are the results:

Certainly! I can help you create a binary classification problem using synthetic data in Python and apply the standard scaler for data standardization. Then, we can split the data into train and test sets. Finally, we’ll fit the train set with three classifiers: Random Forest, K-Nearest Neighbors (KNN), and Support Vector Machine (SVM), and plot the ROC curve to compare the models.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, auc

# Generate synthetic data for binary classification
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=10, random_state=42)

# Standardize the data using StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Fit the train set with classifiers
rf_classifier = RandomForestClassifier(random_state=42)
knn_classifier = KNeighborsClassifier()
svm_classifier = SVC(probability=True, random_state=42)

rf_classifier.fit(X_train, y_train)
knn_classifier.fit(X_train, y_train)
svm_classifier.fit(X_train, y_train)

# Generate predictions for the test set
rf_probs = rf_classifier.predict_proba(X_test)[:, 1]
knn_probs = knn_classifier.predict_proba(X_test)[:, 1]
svm_probs = svm_classifier.predict_proba(X_test)[:, 1]

# Calculate false positive rate, true positive rate,
# and area under the curve for ROC curve
rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_probs)
rf_auc = auc(rf_fpr, rf_tpr)

knn_fpr, knn_tpr, _ = roc_curve(y_test, knn_probs)
knn_auc = auc(knn_fpr, knn_tpr)

svm_fpr, svm_tpr, _ = roc_curve(y_test, svm_probs)
svm_auc = auc(svm_fpr, svm_tpr)

# Plot the ROC curve
plt.figure()
plt.plot(rf_fpr, rf_tpr, label=f'Random Forest (AUC = {rf_auc:.2f})')
plt.plot(knn_fpr, knn_tpr, label=f'KNN (AUC = {knn_auc:.2f})')
plt.plot(svm_fpr, svm_tpr, label=f'SVM (AUC = {svm_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.show()

The provided code generates a synthetic binary classification problem using scikit-learn’s make_classification function. It then applies the StandardScaler to standardize the data and splits it into a train set and a test set.

Three classifiers are then trained on the train set: Random Forest, K-Nearest Neighbors (KNN), and Support Vector Machine (SVM). For evaluation, the classifiers generate probabilities for the test set using the predict_proba method.

The next step involves plotting the Receiver Operating Characteristic (ROC) curve to compare the performance of the classifiers. The False Positive Rate (FPR) and True Positive Rate (TPR) are calculated using the roc_curve function, and the Area Under the Curve (AUC) is computed using the auc function.

Finally, the ROC curve is plotted using matplotlib, with each classifier’s AUC displayed in the legend. The curve represents the trade-off between the true positive rate and the false positive rate for each classifier, allowing visual comparison of their performance.

By examining the ROC curve and comparing the AUC values, you can assess the relative performance of the Random Forest, KNN, and SVM classifiers for the given binary classification problem.

Now, if we copy and paste this code into a Jupyter Notebook we get:

The ROC curve resulting from the above code. Image by Author.

We get what we wanted, so let’s analyze the prompt:

  1. It’s specific. I specified: a) the language (Python), b) the type of problem (binary classification), c) the process to follow (data creation, data normalization, data split), and d) the ML models to use.
  2. It has everything needed. There are no extra words that may create confusion. The prompt goes straight to the point.
  3. It’s short. It’s only 3 lines. Longer prompts may lead to incorrect outputs or can lead the model to add information we may not need. If you have to insert “long prompts” my advice is to subdivide the “big task” into smaller tasks and input different short prompts that can be sequential. To be clear, the above prompt could be subdivided into sequential chats like so: a) “create a binary classification dataset with Python, normalize the data, and split them into the train and the test set”, b) “fit the train test with random forest, SVM, and KNN classifiers. Then, print the ROC curve

Example n°2: research

Suppose we know that polynomial regression generally overfits, but we want a prove because we’ve never seen this. The prompt could be something like that: “ I know that polynomial regression models generally overfit: can you show me something, creating some data in Python?”

And here’s what we get:

Certainly! I can generate some random data and demonstrate how polynomial regression can lead to overfitting. Here’s an example in Python:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# Generate random data
np.random.seed(0)
X = np.linspace(-3, 3, 100)
y = 0.5 * X**2 + X + np.random.normal(0, 1, 100)

# Reshape the input data
X = X.reshape(-1, 1)

# Plot the original data
plt.scatter(X, y, color='b', label='Data')

# Fit polynomial regression models of different degrees
degrees = [1, 4, 15]
colors = ['r', 'g', 'm']
for degree, color in zip(degrees, colors):
# Create polynomial features
polynomial_features = PolynomialFeatures(degree=degree)
X_poly = polynomial_features.fit_transform(X)

# Fit the polynomial regression model
model = LinearRegression()
model.fit(X_poly, y)

# Predict the values
y_pred = model.predict(X_poly)

# Plot the fitted curve
plt.plot(X, y_pred, color=color, linewidth=2, label=f'Degree {degree}')

# Add labels and title to the plot
plt.xlabel('X')
plt.ylabel('y')
plt.title('Polynomial Regression - Overfitting Example')
plt.legend(loc='upper left')

# Display the plot
plt.show()

This code generates random data points with a quadratic relationship and adds some noise to the y values. It then fits polynomial regression models of different degrees (1, 4, and 15 in this example) to the data and plots the resulting curves.

You will observe that as the degree of the polynomial increases, the model starts to fit the noise in the data rather than the underlying trend. This is an example of overfitting. The linear model (degree 1) is too simple to capture the underlying quadratic relationship, while the high-degree polynomial (degree 15) tries to fit every data point precisely, resulting in a highly oscillating curve.

It’s important to note that the example I provided is a simple illustration of overfitting in polynomial regression. In practice, you would typically use more realistic and complex datasets to evaluate and address overfitting issues.

Now, let’s see what we get in a Jupyter Notebook:

A plot showing the difference in fitting the data created with the above code. Image by Author.

So, this plot shows that, given some quadratic distributed data:

  • A linear model can’t grab the path.
  • The 4-degree and the 15-degree polynomials “grab the path too well”.

Now, tell me: does it means that the two polynomials overfit?

Stop reading for a second a think about that.

Well, yes and no, but primarily no. In fact, as we know, overfitting occurs when:

  1. We define metrics to evaluate our data.
  2. The metrics are calculated both on the train and test sets and we find big numerical differences between them.

So, this prompt is not a satisfying one and we have to ask for more. For example, we could ask: “Good, so use the same data and the same polynomials. Then, normalize the data, split the dataset into train and test sets, and calculate R² for both sets”.

And we get:

Certainly! Here’s an updated version of the code that normalizes the data, splits it into train and test sets, and calculates the R² score for both sets:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# Generate random data
np.random.seed(0)
X = np.linspace(-3, 3, 100)
y = 0.5 * X**2 + X + np.random.normal(0, 1, 100)

# Reshape the input data
X = X.reshape(-1, 1)

# Normalize the data
X_normalized = (X - np.mean(X)) / np.std(X)
y_normalized = (y - np.mean(y)) / np.std(y)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X_normalized, y_normalized, test_size=0.2, random_state=0
)

# Plot the original data
plt.scatter(X_normalized, y_normalized, color='b', label='Data')

# Fit polynomial regression models of different degrees
degrees = [1, 4, 15]
colors = ['r', 'g', 'm']
for degree, color in zip(degrees, colors):
# Create polynomial features
polynomial_features = PolynomialFeatures(degree=degree)
X_poly_train = polynomial_features.fit_transform(X_train)
X_poly_test = polynomial_features.transform(X_test)

# Fit the polynomial regression model
model = LinearRegression()
model.fit(X_poly_train, y_train)

# Predict the values for train and test sets
y_pred_train = model.predict(X_poly_train)
y_pred_test = model.predict(X_poly_test)

# Calculate R² scores
r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)

# Plot the fitted curve
plt.plot(X_normalized, model.predict(polynomial_features.transform(X_normalized)), color=color, linewidth=2,
label=f'Degree {degree} (Train R²={r2_train:.2f}, Test R²={r2_test:.2f})')

# Add labels and title to the plot
plt.xlabel('X')
plt.ylabel('y')
plt.title('Polynomial Regression - Overfitting Example (Normalized)')
plt.legend(loc='upper left')

# Display the plot
plt.show()

In this updated code, the data is first normalized by subtracting the mean and dividing by the standard deviation. Then, the data is split into a train set (80% of the data) and a test set (20% of the data) using train_test_split() from scikit-learn.

The polynomial regression models are fitted using the normalized training data, and the R² scores are calculated for both the train and test sets using r2_score() from scikit-learn. The R² scores are then displayed in the legend of each fitted curve.

By normalizing the data and calculating R² scores on both the train and test sets, you can evaluate the performance of the polynomial regression models and assess whether they overfit or generalize well to unseen data.

In this case, we get:

A plot showing the difference in fitting the data created with the above code. Image by Author.

So, what did we get? Well…We can’t see great signs of overfitting because R² slightly increases for the 4-degree polynomial on the test set while it slightly decreases for the 15-degree polynomial.

Here the mistake is that we’ve continued the chat we started. Let’s remember that LLM are neural networks and they have the tendency to overfit. So, sometimes, if you just continue prompting you get the same answer (or something very similar) even if it returns you some “apologies for the mistake”.

Also, I don’t want any plot. So, in this case, we’d better create another chat. Let’s try to be more specific now:” Create a 2-degree data set in Python. Normalize the data, and split it into the train and the test set. Fit the data with: linear regression, 4-degree polynomial, and 15-degree polynomial models. Print r-squared for each model for both train and test sets.

And we get:

To create a 2-degrees dataset in Python, normalize it, and split it into train and test sets, you can use the numpy and sklearn libraries. Here’s an example of how you can achieve this:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Create the dataset
X = np.random.rand(100, 1) # Independent variable
y = 3 * X**2 + np.random.randn(100, 1) # Dependent variable (quadratic relationship)

# Normalize the data
X_normalized = (X - np.mean(X)) / np.std(X)
y_normalized = (y - np.mean(y)) / np.std(y)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_normalized, y_normalized, test_size=0.2, random_state=42)

# Fit the linear regression model
linear_regression = LinearRegression()
linear_regression.fit(X_train, y_train)

# Fit the 4-degree polynomial regression model
poly_features = PolynomialFeatures(degree=4)
X_poly_train = poly_features.fit_transform(X_train)
X_poly_test = poly_features.transform(X_test)
poly_regression = LinearRegression()
poly_regression.fit(X_poly_train, y_train)

# Fit the 15-degree polynomial regression model
poly_features = PolynomialFeatures(degree=15)
X_poly_train = poly_features.fit_transform(X_train)
X_poly_test = poly_features.transform(X_test)
poly_regression_15 = LinearRegression()
poly_regression_15.fit(X_poly_train, y_train)

# Calculate R-squared for train and test sets
linear_train_r2 = linear_regression.score(X_train, y_train)
linear_test_r2 = linear_regression.score(X_test, y_test)

poly_train_r2 = poly_regression.score(X_poly_train, y_train)
poly_test_r2 = poly_regression.score(X_poly_test, y_test)

poly_15_train_r2 = poly_regression_15.score(X_poly_train, y_train)
poly_15_test_r2 = poly_regression_15.score(X_poly_test, y_test)

# Print the results
print("Linear Regression R-squared (Train):", linear_train_r2)
print("Linear Regression R-squared (Test):", linear_test_r2)

print("4-Degree Polynomial Regression R-squared (Train):", poly_train_r2)
print("4-Degree Polynomial Regression R-squared (Test):", poly_test_r2)

print("15-Degree Polynomial Regression R-squared (Train):", poly_15_train_r2)
print("15-Degree Polynomial Regression R-squared (Test):", poly_15_test_r2)

In this example, we generate a random dataset with a quadratic relationship between the independent variable X and the dependent variable y. The data is then normalized using z-score normalization. Next, the dataset is split into a train set (80%) and a test set (20%) using train_test_split from the sklearn.model_selection module.

We then fit three different models: linear regression, 4-degree polynomial regression, and 15-degree polynomial regression. For polynomial regression, we use the PolynomialFeatures class from sklearn.preprocessing to create polynomial features up to the desired degree. The models are fitted using the normalized training data.

Finally, we calculate the R-squared value for both the train and test sets for each model using the score method of the respective models. The R-squared value represents the proportion of the variance in the dependent variable that can be explained by

Good, so we get what we wanted and with some explanations. Now, if we try it we get:

ValueError: X has 16 features, but LinearRegression
is expecting 5 features as input.

Ops! As we said, it can happen to get some errors. So, in these cases, we can simply continue the chat like so: “I get: ValueError: X has 16 features, but LinearRegression is expecting 5 features as input.

And now it returns:

Apologies for the error. It seems that there was an issue with the polynomial features transformation. To fix the problem, we need to ensure that we apply the polynomial transformation on the normalized X data, rather than the original X data. Here’s the corrected code:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Create the dataset
X = np.random.rand(100, 1) # Independent variable
y = 3 * X**2 + np.random.randn(100, 1) # Dependent variable (quadratic relationship)

# Normalize the data
X_normalized = (X - np.mean(X)) / np.std(X)
y_normalized = (y - np.mean(y)) / np.std(y)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_normalized, y_normalized, test_size=0.2, random_state=42)

# Fit the linear regression model
linear_regression = LinearRegression()
linear_regression.fit(X_train, y_train)

# Fit the 4-degree polynomial regression model
poly_features = PolynomialFeatures(degree=4)
X_poly_train = poly_features.fit_transform(X_train)
X_poly_test = poly_features.transform(X_test)
poly_regression = LinearRegression()
poly_regression.fit(X_poly_train, y_train)

# Fit the 15-degree polynomial regression model
poly_features_15 = PolynomialFeatures(degree=15)
X_poly_train_15 = poly_features_15.fit_transform(X_train)
X_poly_test_15 = poly_features_15.transform(X_test)
poly_regression_15 = LinearRegression()
poly_regression_15.fit(X_poly_train_15, y_train)

# Calculate R-squared for train and test sets
linear_train_r2 = linear_regression.score(X_train, y_train)
linear_test_r2 = linear_regression.score(X_test, y_test)

poly_train_r2 = poly_regression.score(X_poly_train, y_train)
poly_test_r2 = poly_regression.score(X_poly_test, y_test)

poly_15_train_r2 = poly_regression_15.score(X_poly_train_15, y_train)
poly_15_test_r2 = poly_regression_15.score(X_poly_test_15, y_test)

# Print the results
print("Linear Regression R-squared (Train):", linear_train_r2)
print("Linear Regression R-squared (Test):", linear_test_r2)

print("4-Degree Polynomial Regression R-squared (Train):", poly_train_r2)
print("4-Degree Polynomial Regression R-squared (Test):", poly_test_r2)

print("15-Degree Polynomial Regression R-squared (Train):", poly_15_train_r2)
print("15-Degree Polynomial Regression R-squared (Test):", poly_15_test_r2)

Now the code should work correctly, applying the polynomial transformation on the normalized data and fitting the models accordingly.

Now, if we test it we get:

Linear Regression R-squared (Train): 0.5287297254229243
Linear Regression R-squared (Test): 0.4627146369316367

4-Degree Polynomial Regression R-squared (Train): 0.5895312356444804
4-Degree Polynomial Regression R-squared (Test): 0.4601670183046247

15-Degree Polynomial Regression R-squared (Train): 0.6256705123172766
15-Degree Polynomial Regression R-squared (Test): 0.26972190687190367

So, here we can see some slight overfitting of the 4-degree model and a clear overfitting of the 15-degree model. In fact, the higher the degree of the polynomial the higher the probability it has to overfit the data. Let’s think about that: we have a 2-degree data distribution and we’re trying to fit it with a 15-degree polynomial. So, it’s easy that the 15-degree polynomial can grab every small modification in the data, leading to overfitting.



Source link

Leave a Comment