Great Applied (Data) Science, or a definition of done


What helps solve real-life problems end-to-end, from business requirements to convincing presentation of results

Advanced data science work in industry is sometimes also known as “applied science,” reflecting the reality that it’s about more than just data and that many former academics work in the field. I find that “applied science” has different expectations than research science. So I wrote up what, in my experience, helps produce great applied science work. I use this as a “definition of done” for data science work, but many points will also benefit analysts, engineers, and other technical roles.

Great applied scientists solve valuable real-world problems end-to-end, by finding clever uses of data and models. Sometimes, the first step to this is discovering the most impactful business problem which is likely to afford feasible scientific solutions; sometimes, the business problem is well-understood, and scientific work begins at the formulation of a well-defined technical problem statement.

In either case, successful scientific work begins with understanding a real-world problem. Scientists need to understand complex business challenges well enough to translate them into technical formulations that can be solved in finite time. They cut through ambiguity and create appropriate structural assumptions to enable solutions.

Successful scientific work then finds technically appropriate and pragmatic solutions: this can mean state-of-the-art deep learning, but very senior scientific work can also consist of a few clever SQL queries. Great scientists know how to choose the right tool for the job.

Great scientists understand that it is easy to get stuck on bad technical approaches. To avoid this, they structure their work incrementally: they are able to break a large problem into smaller sub-problems; they validate individual approaches through the production of intermediate results, and they actively elicit feedback from peers.

Great science work incorporates feedback, because good inductive biases dramatically speed up learning. But great scientists also know to spot an empirical question when they see one, and to insist on using data to answer it.

Great scientific work means documenting the required steps to reproduce a solution, and presenting the results in an audience-appropriate manner. And it includes seeing to it that results are used — whether that’s a change in software, a strategic decision, or a published paper. Because only then the valuable real-world problem has been solved, end-to-end.

The following four principles underpin these recommendations for applied science work in industry:

  1. Ownership: Our job is to solve ambiguous problems end-to-end.
  2. Efficient curiosity: We like to learn. Ideally more efficiently than through brute-force experimentation.
  3. Measure twice, cut once: In exploratory work, explicit planning prevents getting lost.
  4. Iterative results: Frequent feedback reduces ambiguity.

The following sections offer concrete suggestions that in my experience improve scientific work, structured along the scientific process.

1. The role of Applied Science

Applied science work is inherently social. To produce great work, applied scientists need to work well in teams.

Working with People

A lot of science work requires collaborating with others; understanding previous work, finding relevant datasets, asking for explanations, communicating progress to your stakeholders, convincing your teammates to support your project. Ultimately you are responsible for delivering a result — managing the collaboration is part of the job. This may mean that you need to convince a teammate to review your Pull Request, and it may mean that you need to find a way to prioritize your data engineering needs into another team’s backlog. When this process gets stuck, you can escalate and your lead can help clarify priorities — but you are responsible for doing this.

Follow-through

As a part of science work, teams often brainstorm ideas in group settings or one-on-one. These sessions can be extremely valuable and help to produce much higher quality work than any one individual could do on their own. But if the brainstorming sessions are not directly connected to follow-up work, they are often a waste of time. You’ll want to avoid the latter — that means following up pro-actively on discussed ideas and tasks: if the team agreed that something should be done, do it (and communicate results). If it’s too much to be done immediately, write a ticket for the ideas (and communicate the ticket). It is often very helpful in brainstorming sessions to proactively ask what the concrete take-aways are, and who is responsible for next steps — and anyone can do this, and share their notes with the team.

Transparency and trust

Not everything goes according to plan, and the nature of scientific work is that most experiments fail. It is expected that plans do not work out as hoped. This makes scientific work a high-variance activity: even if you can accurately predict the “expected value” of your work, surprises are likely to happen.

When something doesn’t go as planned, be fully transparent. Most importantly, inform your team immediately when something is amiss. This allows more efficient coordination of roadmaps and deliverables. In return, expect trust: failed experiments are part of our daily work.

The most annoying of experimental outcomes is the “inconclusive” result: An experiment didn’t exactly fail, but it didn’t succeed, either. Those, too, are part of scientific work, and they, too, deserve presentation and sharing: can we hypothesize why the scientific problem couldn’t be solved? If we were to start over again, what would we do differently?

When working in teams, it is natural not to understand everything in a presentation, a conversation, or a ticket. It is problematic, however, to “nod along” — when others assume you have understood something that you didn’t, this will likely lead to misunderstandings and misalignment in work outcomes. As scientists, “why” should be our favorite question: when something is unclear, keep asking until you feel there is mutual understanding. If in doubt, rephrasing something in your own words is a powerful tool to check you really understood the right thing.

Properly understanding business problems is often “messier” than pure research science: concepts aren’t well-defined, quantities aren’t measurable, stakeholder objectives are misaligned. But misunderstanding the problem to be solved leads to disappointment, and diminishes the value of applied science work — no scientific sophistication can later save this.

Keep asking questions until you fully understand the business problem. In some cases, your questions may lead to a sharpening or even shift of the business problem formulation.

When formulating a business problem in technical terms, we often need to make some base assumptions: we need to choose a specific definition of a certain concept, we ignore edge cases, we need to decide what potential side-effects are out-of-scope. Take note of these assumptions so that you can revisit them later.

As a result of making assumptions, you may end up solving the wrong problem, because a plausible assumption turned out to be inaccurate. The most effective way to prevent this is to work incrementally and to frequently verify that increments are moving in the right direction. Taking this to an extreme, building a mockup solution (e.g. a quick spreadsheet) can often help generate valuable business feedback.

Before starting deep modeling work, double-check that your approach will solve the right business problem. But don’t forget to repeatedly verify this again along the way. If in doubt, go for frequent, smaller iterations.

Once you start digging into data and models, it’s easy to get lost. Writing down a clear research roadmap helps avoid this. Experienced scientists often consciously or unconsciously follow a clear structure of hypotheses in their work, breaking an ambiguous problem into sequentially solvable sub-problems. I recommend writing a full draft of the hypothesis structure, and getting feedback on it, before writing the first line of code.

The important thing is that you write down some form of plan, and that you orient yourself in it as you make progress. One format I’ve found helpful are mind-maps or bullet-points that impose a hypothesis tree structure explicitly. This is a rough “algorithm” for creating one:

  1. Brainstorm a few different approaches to your problem; don’t forget to look for existing solutions from other teams, in open source, and in published material. Write them down as “candidates.” The idea here is breadth-first: collect many rough ideas.
  2. Roughly estimate the effort needed to “validate” an approach: this is not the effort needed to solve the problem using a given approach, but the effort needed to find out whether an approach will very likely work or not.
  3. Order your approaches by estimated validation effort.
  4. Starting at the lowest validation effort approach, brainstorm how to validate the approach, and to ultimately solve your (sub-)problem using it. Recursively continue down your tree (i.e. for every sub-problem, re-start at 1). Repeat this for the levels of the tree as necessary.

Using this method, you ideally end up with a well-structured, prioritized plan for what to try first, and what to do when an idea works (follow that branch) and when it doesn’t (continue work on the next branch below the current one). The “leaves” of your tree should ideally be relatively easy-to-test, answerable-by-data questions. The structure of your plan should also make it easier to describe progress, and to get feedback on interim results.

The most elusive advice of all: great scientists have an uncanny intuition about what approaches might work, and which ones do not even warrant closer consideration. This sometimes leads to the impression that certain people “just make everything work” — it is often more accurate to say that these people know what not to try out, and that they spend the majority of their time on productive ideas in the first place. Of course, building this level of intuition is difficult and a life-long career. Good intuition means you spend most of your time on productive hypotheses — this is important because the universe of potential hypotheses and ideas to follow is massive, and intuition reduces the search space in your hypothesis tree.

Intuition is social

When you find someone whose intuition you trust, ask them for advice on what approaches to follow. Ask them to justify that advice. Try to understand how they reason about problem-solution-mappings, beyond the immediate technical question.

Building intuition especially benefits from interactive learning: consider pair-programming days with your peers, and explain concepts to each other. Try to meet in person and not just remotely: At least I haven’t yet found a full replacement for a whiteboard or two and being in the same room.

Strong fundamentals help

Invest in understanding the fundamentals: you should build mental models of how things work. These need to be “correct enough,” yet simple enough to be applicable to real-world situations. You should be able to switch between “black box thinking” at the architecture level, and to understand the inner workings of the black boxes when it gets to details. To make this more concrete: when dealing with image or text data, the idea of using “embeddings” is a good intuition that allows you to quickly build a mental architecture of potential models. But to accurately judge the feasibility of such approaches, you should fully understand how embeddings are trained, and what the resulting encoding of information is.

Curiosity

Be curious about similar, but different, problems. Think about how they are similar, and how they are different. Think about how the solutions to your problem may or may not apply to these similar problems. Some examples: experimentation on substitutable products relates to experimentation on social networks (“spillover effects”). Fashion pricing relates to airline pricing (“perishable goods”). Product/entity matching relates to music copyright enforcement (“coarse + precise matching steps”).

Reflect on your previous work: when you had to try something out, because you didn’t have a strong intuition, what can you learn from your experiment? Are there general truths to be learned from the experiment that can help you improve your intuition?

Actively seek criticism of your approaches: whether as part of a “Research Roadmap Review,” or as part of your reflection of a finished project, dissenting opinions can help you sharpen your intuitions and uncover blind spots.

Clean code is a special challenge when applied to exploratory/experimental work that is so typical for applied science. But it is similarly important: clean code avoids mistakes, partially because it forces hygiene, partially because readers of your code will be more able to spot errors, and partially because it makes it easier for yourself to iterate on ideas when the first experiments inevitably fail. Variable names are so much more important than most university courses suggest. Encapsulation in functions, classes, and modules can help navigate varying levels of details and abstraction.

Premature “productionization,” however, can also slow you down: until the solution is clear, it should be easy to replace parts of your approach.

Write code with a reader in mind

When you’re writing analysis notebooks, write them for a reader, not just for yourself. Explain what you are doing. Use good variable names. Clean up plots. Markup cells were invented for a reason.

Think about DRY code. This is especially challenging when doing exploratory work typical of applied science. When you catch yourself copy/pasting code from previous investigations, it’s probably a good time to refactor.

When exploratory work is performed with a reader in mind, it can be reviewed as a Pull Request much like any other piece of code. In fact, all necessary steps for the final analytical answer should be reviewed by a second pair of eyes. Do your reviewers a favor and remove (or clearly mark) purely exploratory code before submitting for a review.

Documentation

Organizing and updating a central knowledge base is one of the most ubiquitous problems in tech organizations that I know of. I am not aware of simple solutions. But I know that investments in good documentation pay off in the long run. For central knowledge, there should be one (and only one!) central source. This document should be the source of truth: if the code doesn’t do what the documentation says, the code is wrong (not the documentation). This requires frequent and easy updating of documentation: badly-written, but correct and complete documentation is infinitely better than well-written, but outdated documentation. Investing in documentation is promotable work, and I believe in its impact.

Presentations are an opportunity to take a big step back from your work and to reflect on what it means in the grand scheme of things. This is true for a final presentation of results, but perhaps even more true for presentations of interim results.

Every time you present results, think about your audience’s expectations. For every point you make (for every slide; for every section in a text), you should answer an implicit “so what?” for the audience. Different audiences will have different expectations here: A senior business leader may be most interested in a straightforward narrative that captures the essence of your findings and can easily be shared with other senior leaders. Your manager may be most interested in understanding how a given problem will be solved, and when. A colleague may be most interested in what they can learn for their own work from your approach. A stakeholder or customer will want to know what new decisions or actions your work enables them to make.

I find that many scientists are tempted to follow their discovery process in their presentation, starting at the first experiment. I strongly advise against this, because it often leads to losing people’s attention before you even get to the interesting part: instead, start with the original business question you’re trying to answer, and with your best answer to the original question. Then describe your high-level approach, and explain why you think your answer is the best one you can give. Anticipate what questions your presentation raises, and have answers prepared for them. Most importantly, answer the question “so what?” for each point you make.

Interesting experiments that did not ultimately contribute to your answer belong in an appendix — they may help an in-depth discussion, but are not required for the main presentation.

Corollary: preparing materials for presentations often feels like busywork. I have found that, on the contrary, frequent production of presentable interim results helps maintain focus and mental clarity, because you force yourself to take several steps back. The production of clear plots and narratives for presentation is extremely helpful to remain focused on the end-goal, and to reach clear conclusions; but optimizing slide layouts is not. Therefore, for interim results, form follows function. It is important to have clear take-away messages clearly communicated. It is not important that the design is perfect. For example, it is perfectly acceptable to hand-draw a figure on paper and present a photo of this.

Clear visualization

All plots should be self-explanatory, and have a clear message. I strongly recommend following a few basic things even in plots you create just for yourself.

  1. Label your axes with self-explanatory descriptions: use words, not just letters.
  2. Use clear chart titles that explain what is shown and the main message (again, “so what?”).
  3. Reduce the data you show to the necessary: e.g. the data may contain a “Dummy” category, which is clearly not intended to be useful. Don’t let this clutter visual space in your plots.
  4. When displaying many series of data differentiated only by color, make sure that the color legend is clearly differentiated (bonus points for color-blind-friendly).
  5. Visualization helps understand patterns in data. If a plot merely shows a chaotic cloud of points, it can probably be removed (unless you want to prove that particular point).
  6. Log-scales can often help clean up plots of (positive) count data.

Clear numeric results

Whenever presenting numeric results (e.g. in tables):

  1. Optimize for, and present, an appropriate success metric. Many applied scientists spend too little time on this: know the difference of when to use RMSE/MAPE/MAE, log-scales, F1 versus ROC versus Area under the Precision-Recall curve.
  2. Almost all real-world problems are about weighted success metrics, yet most ML courses hardly cover the topic: a sales-forecast success metric, for example, may need to be weighted by prices, inventory-value, or package dimensions, depending on the use case.
  3. If success means estimating counterfactuals (“what if” analysis), make that explicit and find a clear reasoning how your success metric captures such counterfactuals. (Natural) experiments are a popular choice.
  4. Provide reasonable benchmarks for any number you present. Often, identifying the “right” benchmark requires hard thinking — but it’s always worth it. You fit a fancy ML model? How much better is it than linear regression? You built a forecast for next week? How much better is it than assuming that next week is equal to this week? You are presenting A/B-test results? How much is the uplift relative to our monthly revenue, or relative to the last improvement?

Thank you for reading this far! I’d love to hear your feedback: What resonated? Where does your experience differ? And what else helps solve valuable real-world problems end-to-end?



Source link

Leave a Comment