How to Predict Player Churn, with Some Help From ChatGPT | by Christian Galea | Jun, 2023

These curves are also useful to determine what threshold we could use in our final application. For example, if it is desired to minimize the number of false positives, then we can select a threshold where the model obtains a higher precision, and check what the corresponding recall will be like.

The importance of each feature for the best model obtained can also be viewed, which is perhaps one of the more interesting results. This is computed using permutation importance via AutoGluon. P-values are also shown to determine the reliability of the result:

Feature Importance Table. Image by author.

Perhaps unsurprisingly, the most important feature is EndType (showing what caused the level to end, such as a win or a loss), followed by MaxLevel(the highest level played by a user, with higher numbers indicating that a player is quite engaged and active in the game).

On the other hand, UsedMoves (the number of moves performed by a player) is practically useless, and StartMoves (the number of moves available to a player) could actually harm performance. This also makes sense, since the number of moves used and the number of moves available to a player by themselves aren’t highly informative; a comparison between them would probably be much more useful.

We could also have a look at the estimated probabilities of each class (either 1 or 0 in this case), which are used to derive the predicted class (by default, the class having the highest probability is assigned as the predicted class):

Table with original values, Shapley values, and predicted values. Image by author.

Explainable AI is becoming ever more important to understand model behaviour, which is why tools like Shapley values are increasing in popularity. These values represent the contribution of a feature on the probability of the predicted class. For instance, in the first row, we can see that a RollingLosses value of 36 decreases the probability of the predicted class (class 0, i.e. that the person will keep playing the game) for that player.

Conversely, this means that the probability of the other class (class 1, i.e. that a player churns) is increased. This makes sense, because higher values of RollingLosses indicate that the player has lost many levels in succession and is thus more likely to stop playing the game due to frustration. On the other hand, low values of RollingLosses generally improve the probability of the negative class (i.e. that a player will not stop playing).

As mentioned, a number of models are trained and evaluated, following which the best one is then selected. It is interesting to see that the best model in this case is LightGBM, which is also one of the fastest:

Information on the models trained. Image by author.

At this point, we can try improving the performance of the model. Perhaps one of the easiest ways is to select the ‘Optimize for quality’ option, and see how far we can go. This option configures several parameters that are known to generally improve performance, at the expense of a potentially slower training time. The following results were obtained (which you can also view here):

Evaluation Metrics when using the ‘Optimize for quality’ option. Image by author.

Again focusing on the ROC AUC metric, performance improved from 0.675 to 0.709. This is quite a nice increase for such a simple change, although still far from ideal. Is there something else that we can do to improve performance further?

As discussed earlier, we can do this using feature engineering. This involves creating new features from existing ones, which are able to capture stronger patterns and are more highly correlated with the variable to be predicted.

In our case, the features in the dataset have a fairly narrow scope since the values pertain to only one single record (i.e. the information on a level played by the user). Hence, it might be very useful to get a more global outlook by summarizing records over time. In this way, the model would have knowledge on the historical trends of a user.

For instance, we could determine how many extra moves were used by the player, thereby providing a measure of the difficulty experienced; if few extra moves were needed, then the level might have been too easy; on the other hand, a high number might mean that the level was too hard.

It would also be a good idea to check if the user is immersed and engaged in playing the game, by checking the amount of time spent playing it over the last few days. If the player has not played the game much, it might mean that they’re losing interest and may stop playing soon.

Useful features vary across different domains, so it is important to try and find any information pertaining to the task at hand. For example, you could find and read research papers, case studies, and articles, or seek the advice of companies or professionals who have worked in the field and are thus experienced and well-versed with the most common features, their relationships with each other, any potentially pitfalls, and which new features that are most likely to be useful. These approaches help in reducing trial-and-error, and speed up the feature engineering process.

Given the recent advances in Large Language Models (LLMs) (for example, you may have heard of ChatGPT…), and given that the process of feature engineering might be a bit daunting for inexperienced users, I was curious to see if LLMs could be at all useful in providing ideas on what features could be created. I did just that, with the following output:

ChatGPT’s answer when asking about what new features can be created to predict player churn more accurately. The reply is actually quite useful. Image by author.

ChatGPT’s reply is actually quite good, and also points to a number of time-based features as discussed above. Of course, keep in mind that we might not be able to implement all of the suggested features if the required information is not available. Moreover, it is well-known that it is prone to hallucination, and as such may not provide fully accurate answers.

We could get more relevant responses from ChatGPT, for example by specifying the features that we’re using or by employing prompts, but this is beyond the scope of this article and is left as an exercise to the reader. Nevertheless, LLMs could be considered as an initial step to get things going, although it is still highly recommended to seek more reliable information from papers, professionals, and so on.

On the Actable AI platform, new features can be created using the fairly well-known SQL programming language. For those less acquainted with SQL, approaches such as utilizing ChatGPT to automatically generate queries may prove useful. However, in my limited experimentation, the reliability of this method can be somewhat inconsistent.

To ensure accurate computation of the intended output, it is advisable to manually examine a subset of the results to verify that the desired output is being computed correctly. This can easily be done by checking the table that is displayed after the query is run in SQL Lab, Actable AI’s interface to write and run SQL code.

Here’s the SQL code I used to generate the new columns, which should help give you a head start if you would like to create other features:

SUM("PlayTime") OVER UserLevelWindow AS "time_spent_on_level",
(a."Max_Level" - a."Min_Level") AS "levels_completed_in_last_7_days",
COALESCE(CAST("total_wins_in_last_14_days" AS DECIMAL)/NULLIF("total_losses_in_last_14_days", 0), 0.0) AS "win_to_lose_ratio_in_last_14_days",
COALESCE(SUM("UsedCoins") OVER User1DayWindow, 0) AS "UsedCoins_in_last_1_days",
COALESCE(SUM("UsedCoins") OVER User7DayWindow, 0) AS "UsedCoins_in_last_7_days",
COALESCE(SUM("UsedCoins") OVER User14DayWindow, 0) AS "UsedCoins_in_last_14_days",
COALESCE(SUM("ExtraMoves") OVER User1DayWindow, 0) AS "ExtraMoves_in_last_1_days",
COALESCE(SUM("ExtraMoves") OVER User7DayWindow, 0) AS "ExtraMoves_in_last_7_days",
COALESCE(SUM("ExtraMoves") OVER User14DayWindow, 0) AS "ExtraMoves_in_last_14_days",
AVG("RollingLosses") OVER User7DayWindow AS "RollingLosses_mean_last_7_days",
AVG("MaxLevel") OVER PastWindow AS "MaxLevel_mean"
MAX("Level") OVER User7DayWindow AS "Max_Level",
MIN("Level") OVER User7DayWindow AS "Min_Level",
SUM(CASE WHEN "EndType" = 'Lose' THEN 1 ELSE 0 END) OVER User14DayWindow AS "total_losses_in_last_14_days",
SUM(CASE WHEN "EndType" = 'Win' THEN 1 ELSE 0 END) OVER User14DayWindow AS "total_wins_in_last_14_days",
SUM("PlayTime") OVER User7DayWindow AS "PlayTime_cumul_7_days",
SUM("RollingLosses") OVER User7DayWindow AS "RollingLosses_cumul_7_days",
SUM("PlayTime") OVER UserPastWindow AS "PlayTime_cumul"
FROM "game_data_levels"
User7DayWindow AS (
ORDER BY "ServerTime"
User14DayWindow AS (
ORDER BY "ServerTime"
UserPastWindow AS (
ORDER BY "ServerTime"
) AS a
UserLevelWindow AS (
PARTITION BY "UserID", "Level"
ORDER BY "ServerTime"
PastWindow AS (
ORDER BY "ServerTime"
User1DayWindow AS (
ORDER BY "ServerTime"
User7DayWindow AS (
ORDER BY "ServerTime"
User14DayWindow AS (
ORDER BY "ServerTime"
ORDER BY "ServerTime";

In this code, ‘windows’ are created to define the range of time to consider, such as the last day, last week, or last two weeks. The records falling within that range will then be used during the feature computations, which are mainly intended to provide some historical context as to the player’s journey in the game. The full list of features is as follows:

  • time_spend_on_level: time spent by a user in playing the level. Gives an indication of level difficulty.
  • levels_completed_in_last_7_days: The number of levels completed by a user in the last 7 days (1 week). Gives an indication of level difficulty, perseverance, and immersion in game.
  • total_wins_in_last_14_days: the total number of times a user has won a level
  • total_losses_in_last_14_days: the total number of times a user has lost a level
  • win_to_lose_ratio_in_last_14_days: Ratio of the number of wins to the number of losses (total_wins_in_last_14_days/total_losses_in_last_14_days)
  • UsedCoins_in_last_1_days: the number of used coins within the previous day. Gives an indication of the level difficulty, and willingness of a player to spend in-game currency.
  • UsedCoins_in_last_7_days: the number of used coins within the previous 7 days (1 week)
  • UsedCoins_in_last_14_days: the number of used coins within the previous 14 days (2 weeks)
  • ExtraMoves_in_last_1_days: The number of extra moves used by a user within the previous day. Gives an indication of level difficulty.
  • ExtraMoves_in_last_7_days: The number of extra moves used by a user within the previous 7 days (1 week)
  • ExtraMoves_in_last_14_days: The number of extra moves used by a user within the previous 14 days (2 weeks)
  • RollingLosses_mean_last_7_days: The average number of cumulative losses by a user over the last 7 days (1 week). Gives an indication of level difficulty.
  • MaxLevel_mean: the mean of the maximum level reached across all users.
  • Max_Level: The maximum level reached by a player in the last 7 days (1 week). In conjunction with MaxLevel_mean, it gives an indication of a player’s progress with respect to the other players.
  • Min_Level: The minimum level played by a user in the last 7 days (1 week)
  • PlayTime_cumul_7_days: The total time played by a user in the last 7 days (1 week). Gives an indication to the player’s immersion in the game.
  • PlayTime_cumul: The total time played by a user (since the first available record)
  • RollingLosses_cumul_7_days: The total number of rolling losses over the last 7 days (1 week). Gives an indication of the level of difficulty.

It is important that only the past records are used when computing the value of a new feature in a particular row. In other words, the use of future observations must be avoided, since the model will obviously not have access to any future values when deployed in production.

Once satisfied with the features created, we can then save the table as a new dataset, and run a new model that should (hopefully) attain better performance.

Time to see if the new columns are any useful. We can repeat the same steps as before, with the only difference being that we now use the new dataset containing the additional features. The same settings are used to enable a fair comparison with the original model, with the following results (which can also be viewed here):

Evaluation Metrics using the new columns. Image by author.

The ROC AUC value of 0.918 is much improved compared with the original value of 0.675. It’s even better than the model optimized for quality (0.709)! This demonstrates the importance of understanding your data and creating new features that are able to provide richer information.

It would now be interesting to see which of our new features were actually the most useful; again, we could check the feature importance table:

Feature importance table of the new model. Image by author.

It looks like the total number of losses in the last two weeks is quite important, which makes sense because the more often a player loses a game, it is potentially more likely for them to become frustrated and stop playing.

The average maximum level across all users also seems to be important, which again makes sense because it can be used to determine how far off a player is from the majority of other players — much higher than the average indicates that a player is well immersed in the game, while values that are much lower than the average could indicate that the player is still not well motivated.

These are only a few simple features that we could have created. There are other features that we can create, which could improve performance further. I will leave that as an exercise to the reader to see what other features could be created.

Training a model optimized for quality with the same time limit as before did not improve performance. However, this is perhaps understandable because a greater number of features is being used, so more time might be needed for optimisation. As can be observed here, increasing the time limit to 6 hours indeed improves performance to 0.923 (in terms of the AUC):

Evaluation metric results when using the new features and optimizing for quality. Image by author.

It should also be noted that some metrics, such as the precision and recall, are still quite poor. However, this is because a classification threshold of 0.5 is assumed, which may not be optimal. Indeed, this is also why we focused on the AUC, which can give a more comprehensive picture of the performance if we were to adjust the threshold.

The performance in terms of the AUC of the trained models can be summarised as follows:

│ Model │ AUC (ROC) │
Original features │ 0.675 │
Original features + optim. for quality │ 0.709 │
Engineered features │ 0.918 │
Engineered features + optim. for quality + longer time │ 0.923 │

It’s no use having a good model if we can’t actually use it on new data. Machine learning platforms may offer this ability to generate predictions on future unseen data given a trained model. For example, the Actable AI platform allows the use of an API that allows the model to be used on data outside of the platform, as is exporting the model or inserting raw values to get an instant prediction.

However, it is crucial to periodically test the model on future data, to determine if it is still performing as expected. Indeed, it may be necessary to re-train the models with the newer data. This is because the characteristics (e.g. feature distributions) may change over time, thereby affecting the accuracy of the model.

For example, a new policy may be introduced by a company that then affects customer behaviours (be it positively or negatively), but the model may be unable to take the new policy into account if it does not have access to any features reflecting the new change. If there are such drastic changes but no features that could inform the model are available, then it could be worth considering the use of two models: one trained and used on the older data, and another trained and used with the newer data. This would ensure that the models are specialised to operate on data with different characteristics that may be hard to capture with a single model.

In this article, a real-world dataset containing information on each level played by a user in a mobile app was used to train a classification model that can predict whether a player will stop playing the game in two weeks’ time.

The whole processing pipeline was considered, from EDA to model training to feature engineering. Discussions on the interpretation of results and how we could improve upon them was provided, to go from a value of 0.675 to a value of 0.923 (where 1.0 is the maximal value).

The new features that were created are relatively simple, and there certainly exist many more features that could be considered. Moreover, techniques such as feature normalisation and standardisation could also be considered. Some useful resources can be found here and here.

With regards to the Actable AI platform, I may of course be a bit biased, but I do think that it helps simplify some of the more tedious processes that need to be done by data scientists and machine learning experts, with the following desirable aspects:

  • Core ML library is open-source, so it can be verified to be safe to use by anyone who has good programming knowledge. It can also be used by anyone who knows Python
  • For those who do not know Python or are not familiar with coding, the GUI offers a way to use a number of analytics and visualisations with little fuss
  • It’s not too difficult to start using the platform (it does not overwhelm the user with too much technical information that may dissuade less knowledgeable people from using it)
  • Free tier allows running of analytics on datasets that are publicly available
  • A vast number of tools are available (apart from classification considered in this article)

That said, there are a few drawbacks while several aspects could be improved, such as:

  • Free tier does not allow running ML models on private data
  • User interface looks a bit dated
  • Some visualisations can be unclear and sometimes hard to interpret
  • App can be slow to respond at times
  • A threshold other than 0.5 can’t be used when computing and displaying the main results
  • No support for imbalanced data
  • Some knowledge of data science and machine learning is still needed to extract the most out of the platform (although this is probably true of other platforms too)

In other future articles, I will consider using other platforms to determine their strengths and weaknesses, and thereby which use cases best fit each platform.

Until then, I hope this article was an interesting read! Please feel free to leave any feedback or questions that you may have!

Source link

Leave a Comment