The Docker Compose of ETL: Meerschaum Compose | by Bennett Meares | Jun, 2023


An exampe Meerschaum Compose file.
An example Meerschaum Compose project for ETL on weather data.

Pipes

The humble pipe is Meerschaum’s abstraction for incremental ETL. Pipes have input and output connectors and store parameters to configure the behavior of their syncing processes. This may be as simple as a SQL query or may include custom keys for use in your plugins.

Meerschaum pipes created by Meerschaum Compose.
Pipes from the above Compose project displayed by the web UI

Because pipes’ metadata are stored alongside their tables, they’re easily editable (whether via edit pipes or on the web UI), which facilitates prototyping. But this dynamic nature introduces the same problem described at the beginning of this article: in order to scale development, a Compose file is needed to define a project’s components in a way that can be easily version-controlled.

According to the Meerschaum Compose specification, pipes are defined in a list under the keys sync:pipes. Each item defines the keys and parameters needed to construct the pipe, like a blueprint for what you expect the pipes in the database to reflect.

For example, the following snippet would define a pipe that would sync a table weather from a remote PostgreSQL database (defined below as sql:source) to a local SQLite file (sql:dest in this project).

sync:
pipes:
- connector: "sql:source"
metric: "weather"
target: "weather"
columns:
datetime: "timestamp"
station: "station"
parameters:
fetch:
backtrack_minutes: 1440
query: |-
SELECT timestamp, station, temperature
FROM weather

config:
meerschaum:
instance: "sql:dest"
connectors:
sql:
source: "postgresql://user:pass@host:5432/db"
dest: "sqlite:////tmp/dest.db"

This example would incrementally update a table named weather using the datetime axis timestamp for range bounding (1 day backtracking), and this column plus the ID column station together would make up a composite primary key used for de-duplication.

The URI is written literally just as an example; if you are committing a compose file, either reference an environment variable (e.g. $SECRET_URI) or your host Meerschaum configuration (e.g. MRSM{meerschaum:connectors:sql:source}).

Connectors

First, a quick refresher on Meerschaum connectors: you can define connectors through several ways, the most popular of which being through environment variables. Suppose you define your connection secrets in an environment file:

export MRSM_SQL_REMOTE='postgresql://user:pass@host:5432/db'
export MRSM_FOO_BAR='{
"user": "abc",
"password": "def"
}'

The first environment variable MRSM_SQL_REMOTE would define the connector sql:remote. If you sourced this file, you could verify this connector with the command mrsm show connectors sql:remote.

The second variable is an example of how to define a custom FooConnector, which you could create using the @make_connector decorator in a plugin. Custom connectors are a powerful tool, but for now, here’s the basic structure:

from meerschaum.connectors import make_connector, Connector

@make_connector
class FooConnector(Connector):
REQUIRED_ATTRIBUTES = ['username', 'password']

def fetch(pipe, **kwargs):
docs = []
return docs

So we’ve just reviewed how to define connectors in our host environment. Let’s see how to make these host connectors available in a Meerschaum project. In the compose file, all of the connectors we need for our project are defined under config:meerschaum:connectors. Use the MRSM{} syntax to reference the keys from your host environment and pass them into the project.

config:
meerschaum:
instance: "sql:app"
connectors:
sql:
app: MRSM{meerschaum:connectors:sql:remote}
foo:
bar: MRSM{meerschaum:connectors:foo:bar}

Plugins

Meerschaum is easily extendable via plugins, which are Python modules. Plugins may fetch data, implement custom connectors, and/or extend Meerschaum (e.g. custom actions, flags, API endpoints, etc.).

Meerschaum supports multiple plugins directories (via MRSM_PLUGINS_DIR), which may be set under the plugins_dir key in mrsm-compose.yaml (the default is a directory plugins).

Storing your plugins within a Compose project makes it clear how you expect your plugins to be used. For example, the Compose file within the MongoDBConnector project demonstrates how the custom connector is used as both a connector and as an instance.

Package Management

When you first start using Meerschaum Compose, the first thing you’ll notice is that it will start installing a fair amount of Python packages. Don’t worry about your environment ― everything is installed into virtual environments within your project’s root subdirectory (a bit ironic, right?). You can install your plugins’ dependencies with mrsm compose init.

To share packages between projects, set the key root_dir in mrsm-compose.yml to a new path. Deleting this root directory will effectively uninstall all of the packages that Compose downloaded, keeping your host environment intact.



Source link

Leave a Comment