A Modest Introduction to Analytical Stream Processing | by Scott Haines | Aug, 2023


Foundations are the unshakable, unbreakable base upon which structures are placed. When it comes to building a successful data architecture, the data is the core central tenant of the entire system and the principal component of that foundation.

Given the common way in which data now flows into our data platforms via stream processing platforms like Apache Kafka and Apache Pulsar, it is critical to ensure we (as software engineers) provide hygienic capabilities and frictionless guardrails to reduce the problem space related to data quality “after” data has entered into these fast-flowing data networks. This means establishing api-level contracts surrounding our data’s schema (types, and structure), field-level availability (nullable, etc), and field-type validity (expected ranges, etc) become the critical underpinnings of our data foundation especially given the decentralized, distributed streaming nature of today’s modern data systems.

However, to get to the point where we can even begin to establish blind-faith — or high-trust data networks — we must first establish intelligent system-level design patterns.

Building Reliable Streaming Data Systems

As software and data engineers, building reliable data systems is literally our job, and this means data downtime should be measured like any other component of the business. You’ve probably heard of the terms SLAs, SLOs and SLIs at one point or another. In a nutshell, these acronyms are associated to the contracts, promises, and actual measures in which we grade our end-to-end systems. As the service owners, we will be held accountable for our successes and failures, but a little upfront effort goes a long-way, and the metadata captured to ensure things are running smooth from an operations perspective, can also provide valuable insights into the quality and trust of our data-in-flight, and reduces the level of effort for problem solving for data-at-rest.

Adopting the Owners Mindset

For example, Service Level Agreements (SLAs) between your team, or organization, and your customers (both internal and external) are used to create a binding contract with respect to the service you are providing. For data teams, this means identifying and capturing metrics (KPMs — key performance metrics) based on your Service Level Objectives (SLOs). The SLOs are the promises you intend to keep based on your SLAs, this can be anything from a promise of near perfect (99.999%) service uptime (API or JDBC), or something as simple as a promise of 90-day data retention for a particular dataset. Lastly, your Service Level Indicators (SLIs) are the proof that you are operating in accordance with the service level contracts and are typically presented in the form of operational analytics (dashboards) or reports.

Knowing where we want to go can help establish the plan to get there. This journey begins at the inset (or ingest point), and with the data. Specifically, with the formal structure and identity of each data point. Considering the observation that “more and more data is making its way into the data platform through stream processing platforms like Apache Kafka” it helps to have compile time guarantees, backwards compatibility, and fast binary serialization of the data being emitted into these data streams. Data accountability can be a challenge in and of itself. Let’s look at why.

Managing Streaming Data Accountability

Streaming systems operate 24 hours a day, 7 days a week, and 365 days of the year. This can complicate things if the right up front effort isn’t applied to the problem, and one of the problems that tends to rear its head from time to time is that of corrupt data, aka data problems in flight.

There are two common ways to reduce data problems in flight. First, you can introduce gatekeepers at the edge of your data network that negotiate and validate data using traditional Application Programming Interfaces (APIs), or as a second option, you can create and compile helper libraries, or Software Development Kits (SDKs), to enforce the data protocols and enable distributed writers (data producers) into your streaming data infrastructure, you can even use both strategies in tandem.

Data Gatekeepers

The benefit of adding gateway APIs at the edge (in-front) of your data network is that you can enforce authentication (can this system access this API?), authorization (can this system publish data to a specific data stream?), and validation (is this data acceptable or valid?) at the point of data production. The diagram in Figure 1–1 below shows the flow of the data gateway.

A Distributed Systems Architecture showing authentication and authorization layers at a Data Intake Gateway. Flowing from left to right, approved data is published to Apache Kafka for downstream processing
Figure 1–1: A Distributed Systems Architecture showing authentication and authorization layers at a Data Intake Gateway. Flowing from left to right, approved data is published to Apache Kafka for downstream processing. Image Credit by Scott Haines

The data gateway service acts as the digital gatekeeper (bouncer) to your protected (internal) data network. With the main role of controlling , limiting, and even restricting unauthenticated access at the edge (see APIs/Services in figure 1–1 above), by authorizing which upstream services (or users) are allowed to publish data (commonly handled through the use of service ACLs) coupled with a provided identity (think service identity and access IAM, web identity and access JWT, and our old friend OAUTH).

The core responsibility of the gateway service is to validate inbound data before publishing potentially corrupt, or generally bad data. If the gateway is doing its job correctly, only “good” data will make its way along and into the data network which is the conduit of event and operational data to be digested via Stream Processing, in other words:

“This means that the upstream system producing data can fail fast when producing data. This stops corrupt data from entering the streaming or stationary data pipelines at the edge of the data network and is a means of establishing a conversation with the producers regarding exactly why, and how things went wrong in a more automatic way via error codes and helpful messaging.”

Using Error Messages to Provide Self-Service Solutions

The difference between a good and bad experience come down to how much effort is required to pivot from bad to good. We’ve all probably worked with, or on, or heard of, services that just fail with no rhyme or reason (null pointer exception throws random 500).

For establishing basic trust, a little bit goes a long way. For example, getting back a HTTP 400 from an API endpoint with the following message body (seen below)

{
"error": {
"code": 400,
"message": "The event data is missing the userId, and the timestamp is invalid (expected a string with ISO8601 formatting). Please view the docs at http://coffeeco.com/docs/apis/customer/order#required-fields to adjust the payload."
}
}

provides a reason for the 400, and empowers engineers sending data to us (as the service owners) to fix a problem without setting up a meeting, blowing up the pager, or hitting up everyone on slack. When you can, remember that everyone is human, and we love closed loop systems!

Pros and Cons of the API for Data

This API approach has its pros and cons.

The pros are that most programming languages work out of box with HTTP (or HTTP/2) transport protocols — or with the addition of a tiny library — and JSON data is just about as universal of a data exchange format these days.

On the flip side (cons), one can argue that for any new data domain, there is yet another service to write and manage, and without some form of API automation, or adherence to an open specification like OpenAPI, each new API route (endpoint) ends up taking more time than necessary.

In many cases, failure to provide updates to data ingestion APIs in a “timely” fashion, or compounding issues with scaling and/or api downtime, random failures, or just people not communicating provides the necessary rational for folks to bypass the “stupid” API, and instead attempt to directly publish event data to Kafka. While APIs can feel like they are getting in the way, there is a strong argument for keeping a common gatekeeper, especially after data quality problems like corrupt events, or accidentally mixed events, begin to destabilize the streaming dream.

To flip this problem on its head (and remove it almost entirely), good documentation, change management (CI/CD), and general software development hygiene including actual unit and integration testing — enable fast feature and iteration cycles that don’t reduce trust.

Ideally, the data itself (schema / format) could dictate the rules of their own data level contract by enabling field level validation (predicates), producing helpful error messages, and acting in its own self-interest. Hey, with a little route or data level metadata, and some creative thinking, the API could automatically generate self-defining routes and behavior.

Lastly, gateway APIs can be seen as centralized troublemakers as each failure by an upstream system to emit valid data (eg. blocked by the gatekeeper) causes valuable information (event data, metrics) to be dropped on the floor. The problem of blame here also tends to go both ways, as a bad deployment of the gatekeeper can blind an upstream system that isn’t setup to handle retries in the event of gateway downtime (if even for a few seconds).

Putting aside all the pros and cons, using a gateway API to stop the propagation of corrupt data before it enters the data platform means that when a problem occurs (cause they always do), the surface area of the problem is reduced to a given service. This sure beat debugging a distributed network of data pipelines, services, and the myriad final data destinations and upstream systems to figure out that bad data is being directly published by “someone” at the company.

If we were to cut out the middle man (gateway service) then the capabilities to govern the transmission of “expected” data falls into the lap of “libraries” in the form of specialized SDKS.

SDKs are libraries (or micro-frameworks) that are imported into a codebase to streamline an action, activity, or otherwise complex operation. They are also known by another name, clients. Take the example from earlier about using good error messages and error codes. This process is necessary in order to inform a client that their prior action was invalid, however it can be advantageous to add appropriate guard rails directly into an SDK to reduce the surface area of any potential problems. For example, let’s say we have an API setup to track customer’s coffee related behavior through event tracking.

Reducing User Error with SDK Guardrails

A client SDK can theoretically include all the tools necessary to manage the interactions with the API server, including authentication, authorization, and as for validation, if the SDK does its job, the validation issues would go out the door. The following code snippet shows an example SDK that could be used to reliably track customer events.

import com.coffeeco.data.sdks.client._
import com.coffeeco.data.sdks.client.protocol._

Customer.fromToken(token)
.track(
eventType=Events.Customer.Order,
status=Status.Order.Initalized,
data=Order.toByteArray
)

With some additional work (aka the client SDK), the problem of data validation or event corruption can just about go away entirely. Additional problems can be managed within the SDK itself like for example how to retry sending a request in the case of the server being offline. Rather than having all requests retry immediately, or in some loop that floods a gateway load balancer indefinitely, the SDK can take smarter actions like employing exponential backoff. See “The Thundering Herd Problem” for a dive into what goes wrong when things go, well, wrong!



Source link

Leave a Comment