Create and Explore the Landscape of Roles and Salaries in Data Science | by Erdogan Taskesen | Jun, 2023

The data science salary data set is derived from [1] and is also open as a Kaggle competition [2]. The data set contains 11 features for 4134 samples. The samples are collected worldwide and weekly updated from 2020 to the present time (somewhere beginning of 2023). The dataset is published in the public domain, and free of use. Let’s load the data and have a look at the variables.

# Import library
import datazets as dz
# Get the data science salary data set
df = dz.get('')

# The features are as following

# 'work_year' > The year the salary was paid.
# 'experience_level' > The experience level in the job during the year.
# 'employment_type' > Type of employment: Part-time, full time, contract or freelance.
# 'job_title' > Name of the role.
# 'salary' > Total gross salary amount paid.
# 'salary_currency' > Currency of the salary paid (ISO 4217 code).
# 'salary_in_usd' > Converted salary in USD.
# 'employee_residence' > Primary country of residence.
# 'remote_ratio' > Remote work: less than 20%, partially, more than 80%
# 'company_location' > Country of the employer's main office.
# 'company_size' > Average number of people that worked for the company during the year.

# Selection of only European countries
# countries_europe = ['SM', 'DE', 'GB', 'ES', 'FR', 'RU', 'IT', 'NL', 'CH', 'CF', 'FI', 'UA', 'IE', 'GR', 'MK', 'RO', 'AL', 'LT', 'BA', 'LV', 'EE', 'AM', 'HR', 'SI', 'PT', 'HU', 'AT', 'SK', 'CZ', 'DK', 'BE', 'MD', 'MT']
# df['europe'] = np.isin(df['company_location'], countries_europe)

A summary of the top job titles together with the distribution of the salaries is shown in Figure 1. The two top panels are worldwide whereas the bottom two panels are only for Europe. Although such graphs are informative, they show averages and it is unknown how location, experience level, remote work, country, etc are related in a particular context. As an example: Is the salary of an entry-level data engineer that works remotely for a small company more or less similar to an experienced data engineer with other properties? Such questions can be better answered with the analysis as shown in the next sections.

Figure 1. The top-ranked job titles. The two top panels are worldwide statistics whereas the bottom two panels are for Europe. (image by author)


The data science salary data set is a mixed data set containing continuous, and categorical variables. We will perform an unsupervised analysis and create the data science landscape. But before doing any preprocessing, we need to remove redundant features such as salary_currency and salary to prevent multicollinearity issues. In addition, we will exclude the variable salary_in_usd from the data set and store it as a target variable y because we do not want that grouping occurs because of the salary itself. Based on the clustering, we can investigate whether any of the detected groupings can be related to salary. The cleaned data set results in 8 features with the same 4134 samples.

# Store salary in separate target variable.
y = df['salary_in_usd']

# Remove redundant variables
df.drop(labels=['salary_currency', 'salary', 'salary_in_usd'], inplace=True, axis=1)

# Make the catagorical variables better to understand.
df['experience_level'] = df['experience_level'].replace({'EN':'Entry-level', 'MI':'Junior Mid-level', 'SE':'Intermediate Senior-level', 'EX':'Expert Executive-level / Director'}, regex=True)
df['employment_type'] = df['employment_type'].replace({'PT':'Part-time', 'FT':'Full-time', 'CT':'Contract', 'FL':'Freelance'}, regex=True)
df['company_size'] = df['company_size'].replace({'S':'Small (less than 50)', 'M':'Medium (50 to 250)', 'L':'Large (>250)'}, regex=True)
df['remote_ratio'] = df['remote_ratio'].replace({0:'No remote', 50:'Partially remote', 100:'>80% remote'}, regex=True)
df['work_year'] = df['work_year'].astype(str)

# (4134, 8)

The next step is to get all measurements into the same unit of measurement. In order to do this, we will carefully perform one-hot encoding and take care of multicollinearity that we unknowingly can introduce. In other words, when we transform any categorical variable into multiple one-hot variables, we introduce a bias that allows us to perfectly predict a feature based on two or more features from the same categorical column (aka the sum of one-hot encode features is always one). This is called a dummy trap and we can prevent it by breaking the chain of linearity by simply dropping one column. The df2onehot package contains the dummy trap protection feature. This feature is slightly more advanced than simply dropping a one-hot column pér category because it only removes a one-hot column if the chain of linearity is not yet broken due to other cleaning actions, such as a minimum number of samples pér one-hot feature or the removal of the False state in boolean features.

# Import library
from df2onehot import df2onehot

# One hot encoding and removing any multicollinearity to prevent the dummy trap.
dfhot = df2onehot(df,

# work_year_2021 ... company_size_Small (less than 50)
# 0 False ... False
# 1 False ... False
# 2 False ... False
# 3 False ... False
# 4 False ... False
# ... ... ...
# 4129 False ... False
# 4130 True ... False
# 4131 False ... True
# 4132 False ... False
# 4133 True ... False

# [4134 rows x 115 columns]

In our case, we will remove one-hot encoded features that contain less than 5 samples (y_min=5), and remove multicollinearity to prevent the dummy trap (remove_multicollinearity=True). This results in 115 one-hot encoded features for the same 4134 samples.

Source link

Leave a Comment