What Exactly Does a Data Scientist Do? | by Matt Chapman | Jun, 2023


My honest reflections after working in 3 different Data Science teams (hint: there’s a lot more PowerPoint than you think)

Image by Hermansyah on Unsplash

Data Scientists have been called many things:

  • “A Data Scientist is a statistician who lives in San Francisco”
  • “Professional modellers, but not like that”
  • “I get paid to Google Stack Overflow”
  • “I sell magic to executives”

Or, my personal favourite:

  • “Data Science is statistics on a Mac”

As this smorgasbord of job descriptions shows, it can be really hard to get a clear picture on what a Data Scientist role actually involves day-to-day. Lots of the existing articles out there — while excellent — date from 2012–2020, and in a field that evolves as fast as Data Science these can quickly become outdated.

In this article, my aim is to peel back the proverbial covers and give a personal insight into life as a Data Scientist in 2023.

By drawing on my experiences of working in 3 different Data Science teams, I’ll try to help three types of people:

  1. Aspiring Data Scientists: I’ll give a realistic insight into what the job involves, so you can make a more informed decision about whether it’s for you and what skills to work on
  2. Data Scientists: Spark new ideas for things to try in your team and/or give you a way to answer the question “So what is it you actually do?”
  3. People who work with (or want to hire) Data Scientists: Get to know what the heck we actually do (and, perhaps more importantly, what we don’t do)

The Head of AI at a large tech company once told me that the biggest misconception he encounters about Data Scientists is that we’re always building deep learning models and doing “fancy AI stuff.”

Now don’t get me wrong — Data Science can get very fancy indeed, but it encompasses a lot more than Artificial Intelligence and its flashy use cases. Equating Data Science with AI is sort of like assuming that lawyers spend all their days shouting “I object!” in court; there’s a lot more that goes on behind the scenes.

There’s more to it than “fancy AI stuff”

One of my favourite descriptions of Data Science comes from Jacqueline Nolis, a Principal Data Scientist based in Seattle. Nolis divides Data Science into three streams:

  1. Business Intelligence — “taking data that the company has and getting it in front of the right people
  2. Decision Science — “taking data and using it to help a company make a decision
  3. Machine Learning — which she describes as “taking data science models and putting them continuously into production,” although I would probably take a broader view and include the actual development of ML models.

Different companies will emphasise different streams, and even within these streams the methods and goals will vary. For example:

  • If you’re a Data Scientist working in Decision Science, your day-to-day tasks could include anything from running A/B tests to solving linear programming problems.
  • If you’re a Data Scientist who spends most of their time building ML models, those could be either product-focused (e.g., building a recommendation algorithm which will be incorporated into an app) or business-operations-focused (e.g., building a pricing or forecasting model, used to improve commercial operations in the company’s backend).

Personally, one of the things that I find most enjoyable about Data Science is getting to dip my toes in all three of these areas, and so in the Data Science roles I’ve done, I’ve always tried to make sure there’s lots of variety. It’s a good way to try and build the “jack of all trades, master of one” mindset that I’ve previously advocated for as a way to frame your career as a Data Scientist.

Image by Teemu Paananen on Unsplash

Ah, PowerPoint. If you thought Data Scientists were spared from it, how wrong you were.

Making and presenting slides is a key part of any Data Scientist role because your models ain’t goin’ anywhere if you can’t communicate their value. As Andrew Young puts it:

Over the years, I have seen many PhD-holding data scientists spend weeks or months building highly effective machine learning pipelines that (theoretically) will deliver real-world value. Unfortunately, these fruits of labor can die on the vine if they fail to effectively communicate the value of their work

In my team, we place a lot of emphasis on stakeholder communication and so PowerPoint tends to feature quite heavily in our day-to-day jobs.

For every project, we build a master slide deck which different team members can add to, and then we select relevant slides from this deck whenever it’s time to present to stakeholders. Where necessary, we try to create multiple versions of the key slides so that we’re able to tailor our messages to different audiences, who have different levels of technical expertise.

If I’m being honest, I actually don’t mind spending time in PowerPoint (please don’t cancel me), as I find that making slides is a great way to distill your key ideas. Honestly, it helps me remember big picture questions like: (1) what problem am I solving, (2) how does my solution compare to the baseline one, and (3) what are the dependencies and timelines.

It’s commonly said that data science is 80% preparing data…

… and 20% complaining about preparing data.

And I’m not just talking about companies where Data Science is the “new thing.”

Even in established companies with established datasets, data preparation and validation can take a substantial amount of time. At the very least, you’ll likely find that datasets are (1) stored on different platforms, (2) published at different cadences, or (3) in need of substantial wrangling to get into the right format. Even once your models are in production, you need to be continually checking that your datasets aren’t drifting, breaking or missing information.

And don’t even get me started on user-input data.

In one of my old jobs, we had an online form where users were required to input their address, and our users used 95 different ways of spelling “Barcelona”: I’m talking everything from “barcalona” to “BARÇA” and “Barna.”

95 different ways of spelling “Barcelona”

The moral of the story: don’t have free-text fields unless you want to spend your coming weeks crying over regex documentation.

Image by Christina @ wocintechchat.com on Unsplash

One of the things I love most about Data Science is the fact that it involves continual learning.

For me, I’ve always dreaded the idea of getting stuck in a job where I just do the same things all the time, and I’m thankful to say that Data Science is not one of those careers. As a Data Scientist, you’ll discover is that there’s no such thing as a “standard” project. All of them require a slightly bespoke approach, so you’ll always be needing to adapt your existing knowledge and learn new things.

And I’m not just talking about “formal” learning like attending conferences or doing online courses.

More likely, you’ll spend a substantial amount of your days doing “micro-learning” by reading coding documentation, Towards Data Science articles, and Stack Overflow answers. If you’re interested in how I approach the task of continual learning and staying up-to-date, you might be interested in reading one of my recent articles where I talk about this in a bit more depth:

Image by Marvin Meyer on Unsplash

Data Scientists don’t exist in a bubble.

We’re embedded in teams, and to work effectively you have to be able to work together. I really like the way that Megan Lieu puts this:

The biggest disappointment I had when I finally became a data scientist was learning that it’s not just heads-down work all day.

“I can’t wait to not talk to anyone, build models and just do technical data science-y things by myself all the time!”

Much to my introverted horror, I realized I not only had to collaborate with, but also actually TALK to business and external stakeholders everyday

While I feel a little less strongly than Megan (I’m more of an extrovert by nature), I too was initially surprised by how team-based the role can often be. In my role, “collaboration” means things like: having daily stand-ups to discuss tasks and blockers, doing regular pair-programming sessions to debug and optimise code, and having well-balanced discussions (read: arguments) about the merits of different technical approaches.

All in all, I reckon I spend about 50–70% of my time working solo and the rest of the time doing pair or group work, although the exact ratio will depend a lot on your company and level of seniority.

Thank you for reading this small insight into my life as a Data Scientist.

I hope you’ve found it helpful, and please feel free to reach out if you fancy a chat 🙂

Less than 1% of my readers on Medium click my ‘Follow’ button, so it really means a lot when you do, whether here on Medium, Twitter or LinkedIn.

If you’d like to get unlimited access to all of my stories (and the rest of Medium.com), you can sign up via my referral link for $5 per month. It adds no extra cost to you vs. signing up via the general signup page, and helps to support my writing as I get a small commission.





Source link

Leave a Comment