Intro to Data Analytics: The “Google Method” | by Col Jung | Jun, 2023


Ask, Analyse & Act

11 min read

10 hours ago

Image by Adam Satria (Unsplash)

Data analytics at large companies is messy.

The data is frequently elusive, hiding in various corners of the company, requiring soft skills and a Sherlock-like investigation to locate and integrate it together.

There’s the formidable fortress of data governance that protects the company and its customers, but complicates your aim to get your grubby hands on said data. Again, negotiation and soft skills matter here.

As if all that wasn’t challenging enough, you’ll need to muster those technical chops for effective analytics and master the art of data storytelling to sell the impact of your data insights.

Reality of skillsets needed. Image by author

Having worked as a data scientist and engineer at one of Australia’s largest banks over the past five years, I’ve discovered that online courses don’t adequately prepare students for the realities of enterprise analytics.

I aim to kill three birds with this article.

For those…

  1. With little industry experience — I’ll share my own end-to-end experiences of typical data analysis projects at my bank, warts and all.
  2. Learning data analytics — we’ll perform a basic pivot table analysis in Excel that concludes with visualised insights. Dataset included!
  3. Looking for a data analysis process to follow — I’ll employ Google’s own six-stage process for the contemporary data analyst: Ask, Prepare, Process, Analyse, Share and Act.
From descriptive analysis to predictive modelling. Image by author

Let’s dive in.

Unlike academia, analyses in industry is driven by a business need.

What problem is the stakeholder trying to solve? Your analysis will typically enable the company to do something better, faster or cheaper.

My bank: In my industry of banking, here are some typical questions thrown our way.

  • How can we better identify fraudulent transactions?
  • What factors contribute to customer churn?
  • What are the key drivers behind customer satisfaction?
  • What are the key factors influencing credit card spending?
  • How can we optimise our loan approval process so that it’s faster yet still timely?

We’ll work with business stakeholders to ask the right questions, define the problem and formulate a plan of attack.

We’ll leverage some domain-expertise and consider the whole problem in order to capture the situation right.

And we’ll keep an open line of communication with these stakeholders throughout the project.

Skills needed: project and stakeholder management; domain-expertise.

Exercise

Suppose you work for a small online retailer that sells various merchandise to Aussie and New Zealand customers. Your boss wants to understand the sales performance of umbrellas:

“Which months do they sell the best?”

“Where in Australia and NZ are they most popular?

Great! Some meaty questions for you to dive into.

The business need now motivates a data need.

We need to prepare the data for the analytics to drive the insights that answer the business question. (Whew, that was a mouthful!)

  • What data do we need? Where does it lie? Is it operational data or analytical data? Which source systems do the operational data lie in? Does the analytical data sit in a data warehouse or data lake?
  • What tools can we use to access the data? You might use something like Teradata Studio for a SQL database hosted by warehouse vendor Teradata. You might use Azure Synapse Analytics for big data hosted on Microsoft Azure Cloud. You might use Excel or Power BI for smaller tables, particularly those small enough to download to your local machine. What if you need to join datasets from all these sources? Maybe a data science workbench like Dataiku.
  • What governance processes do we need to navigate? Who owns the data? What approvals do we need to access it, particularly if the data contains sensitive customer information? How will you guarantee its security? How will the lineage of the data transformed during the lifecycle of this project be tracked? (…a massive contributor to enterprise tech debt).

On governance, data discovery projects at my bank typically requires buy-in from a multitude of stakeholders, all with their own interests and KPIs:

  • The Business — obviously! They commissioned me in the first place…
  • Architecture — do the patterns needed for this project align with the strategic technology direction of the bank? Or is my work largely an ad hoc tactical solution that will cease to become valuable later? Will my team be contributing to technical debt?
  • Data Platforms — do we have the data and tooling services to achieve the aims of the project? Does it align with our funding constraints?
  • Governance — what paperwork is needed to access and use the data? Are there privacy and ethical considerations?
  • Risk — what data risks are present? What are the controls in response? How will sensitive data be protected?

Skills needed: understand company’s enterprise data architecture; data modelling; experience with different platforms and tools; ability to align various stakeholder interests.

Exercise (continued)

Suppose the aggregated sales data for the online business is stored in a CRM platform and downloadable to your laptop here.

Here’s what it looks like in Excel.

Image by author

Ah, the fun part.

Data wrangling!

(Half-joking here…)

During the data processing stage, you’ll clean the data and make it ready for analysis.

This includes addressing data quality issues — such as duplicate and missing data — and decide on how to deal with things like irrelevant data and outliers.

My bank: In large organisations, data quality issues are almost impossible to eliminate.

There’s simply too much data created by too many people and source systems.

Data warehouses often boast a significant amount of technical debt from bespoke data pipelines that are built for specific projects, creating redundant data assets that are not reusable.

In the 2020’s, the solution organisations are adopting worldwide is moving from a project-oriented mindset to a product-oriented mindset where data is treated as a quality reusable product.

Meanwhile data lakes run into bottleneck issues due to funnelling an enormous amount of work on an overburdened centralised data team who are data engineering experts but not data-domain experts.

However, the benefit of having professional data engineers building ETL pipelines for data analysts and data scientists is that a significant amount of data problems are identified and fixed.

Skills needed: data wrangling.

Exercise (continued)

Time for you to prepare your spreadsheet above for analysis.

Since you’re looking at the popularity of umbrella sales (i.e. volume sold is what matters), I suggest you hide columns like the Sale Price and Manufacturing Cost.

Image by author

You can then filter the sales for just Umbrellas…since that’s what your boss cares about!

Image by author

Oops, we can already see some data quality issues, which you can directly fix in the spreadsheet.

Ideally, data quality issues are programmatically addressed with code, as manual human efforts are unreliable for anything beyond small datasets.

Thankfully, the dataset for this exercise is small.

Image by author

After addressing these issues, your dataset is ready for analysis.

Time for the fun bit.

Certainly the bit that all the online courses focus on — the analytics!

The primary goal here is to find the patterns, relationships and trends that solve the business problem.

You’ll make sense of the data and try to unlock the value that sits latent within.

My bank: We have pockets of analytics excellence, which includes data science squads that perform advanced analytics and predictive modelling. This includes machine learning prototyping with traditional Python notebooks, or newer no-code ML platforms that democratise advanced capabilities across the organisation.

How the world of AI, big data and data science intersect. Image by author

By far, however, the majority of the organisation sit at a data maturity level closer towards descriptive analytics, which means tools like Excel and Power BI.

Skills needed: analytics chops!

Exercise (continued)

To get a better sense of what’s happening in the data, you can slice and dice the dataset and its fields in useful ways by leveraging pivot tables.

The idea is you’ll only select the variables you’re interested in and create visualisations to drive insights.

To create a pivot table in Excel, select the source data and Insert a PivotTable next to it.

Image by author

You’ll start with a blank PivotTable, which you can populate by dragging and dropping fields as shown below.

Image by author

You’re interested in measuring Umbrella sales numbers, so drag the:

  • Product field into Filters, so you can filter by Umbrellas;
  • Units Sold field into Values, as that’s your main measure.
  • Year and Month Name into Rows, so you know the Umbrella sales by year and month.

You now have your PivotTable!

Image by author

Time for the grand finale — visualising your PivotTable in a chart.

To create it, select your PivotTable and Insert a PivotChart. I’d suggest picking a Line chart, but feel free to try others in order to best represent your data.

Image by author

Aha, the magic moment…

Insights!

Image by author

Your PivotChart clearly tells an interesting data story.

In 2022, umbrella sales spiked in April, reached the first pinnacle around May to June, dropped a little bit in July, then reached its ultimate peak in August before dropping.

In any case, May to August saw significantly higher sales than the rest of the year.

This addresses the first business problem of working out which times of the year boast the best sales.

By dragging and dropping fields in our PivotTable, you can create another PivotChart that answers the second business problem of which parts of Australia and NZ drives the best sales.

Image by author

We can clearly see that the majority of Umbrella sales actually come from New Zealand. This actually sheds further insight onto the first business problem — May and August happens to be the country’s wet seasons!

In short, you cleaned the data, created some pivot tables to aggregate the data and then crafted some charts to visually show that our Kiwi friends were ultimately responsible for generating the majority of umbrella sales during just a few wet months of the year!

Nice!

If your insights don’t inform some sort of downstream decision-making, were they insights at all?

Alright, that’s a bit harsh, but the point I’m trying to make is a lot of analytics insights don’t get used, typically because they’re:

  • Not shared by analysts to stakeholders or
  • Not articulated well or
  • Not actioned by the decision-makers.

In this stage, your job is to share your findings with the aid of effective visuals that put data storytelling at the forefront.

Skills needed: data visualisation; stakeholder management.

Exercise (continued)

Arrange a meeting to present your findings to your boss. He’ll need to know just how valuable your Kiwi customers are!

In the finale of the end-to-end data analytics process, your communications and storytelling skills reign supreme.

Many data scientists argue that these are the most important skills, as without it your work will have little impact on stakeholders and shall fade into insignificance.

The reason data storytelling is important is as follows:

  • Your data doesn’t speak for itself. Sorry!
  • Humans are wired to believe stories.
  • Stories create audience empathy, which generates an oxytocin response that results in their trust in you.
  • Once you’ve established this connection, you can sell them whatever you want.
  • Bottom line: Sell stories, not data.

My bank: Unfortunately, many analytics and data science projects aren’t ultimately actioned. This can be due to a number of reasons:

  • No buy-in from stakeholders. Your results aren’t strong enough — and this could come from the poor communication of results.
  • Insights difficult to action. For instance, it might be exorbitantly expensive to action certain insights, or it doesn’t align with current business priorities.
  • If you disagree with any inaction, it is up to you to assemble the right support from colleagues and management to push ahead for action.

Skills needed: effective communication and storytelling; stakeholder management; people soft skills.

Exercise (continued)

At a meeting with your boss, you string together a strong narrative that leverages some of the interactive Excel PivotCharts you crafted. Your boss is convinced and commits to building a stronger business in New Zealand while ensuring there that umbrella inventories are stocked up for next year’s wet season.

Congratulations on persuading decision-makers to take action on the insights from your analytics!

Online data analysis courses are commonly light on the various soft skills required to succeed as an effective data analyst.

That’s why I like Google’s six-stage framework:

  • Ask. What’s the business problem?
  • Prepare. How do we get the data we need?
  • Process. Reliable insights can only come from clean data.
  • Analysis. Time to show off your technical chops!
  • Share. Show off your work to stakeholders.
  • Act. Impact comes from downstream decisions informed by your work.

In this article, I delved into my personal industry experiences, shedding light on the considerations and challenges inherent in each stage.

For the aspiring data analysts among you, we consolidated some vital skills, such as proficiently wrangling small datasets in Excel into pivot tables and crafting some visualisations to expose reportable insights.

Enjoyed this story? Get an email when I post similar articles.

Find me on Linkedin, Twitter & YouTube.

Gain unlimited access to Medium here and directly support my writing.

  • AI Revolution: Fast-paced Intro to Machine Learning — here
  • ChatGPT & GPT-4: How OpenAI Won the NLU War — here
  • Generative AI Art: Midjourney & Stable Diffusion Explained — here
  • Power of Data Storytelling — Sell Stories, Not Data — here
  • From Data Warehouses & Data Lakes to Data Mesh — here
  • Data Warehouses & Data Modelling — a Quick Crash Course — here
  • From Data Lakes to Data Mesh: A Guide to Latest Architecture — here
  • Data Products: Building a Strong Foundation for Analytics — here
  • Cloud Computing 101: Harness Cloud for Your Business — here
  • Intro to Data Analysis: The Google Method — here
  • Power BI — From Data Modelling to Stunning Reports — here
  • Machine Learning versus Mechanistic Modelling — here
  • Popular Machine Learning Performance Metrics Explained — here
  • Future of Work: Is Your Career Safe in Age of AI — here
  • Beyond ChatGPT: Search for a Truly Intelligence Machine — here
  • Regression: Predict House Prices using Python — here
  • Classification: Predict Employee Churn using Python — here
  • Python Jupyter Notebooks versus Dataiku DSS — here





Source link

Leave a Comment