Real-time data pipelines with Python: Bytewax cheatsheet

This cheatsheet provides an overview of the key concepts and operators in Bytewax, helping you to quickly grasp its features and integrate them into your workflows effectively.

Bytewax Cheatsheet (1).gif

About Bytewax

Bytewax is an open-source, Python-native streaming framework designed for building and managing distributed data processing applications.

At its core, Bytewax operates within a cluster architecture, where multiple Workers independently process different segments of the data stream. This setup facilitates efficient parallel processing across multiple nodes, enhancing scalability and performance. Bytewax also includes robust recovery mechanisms, allowing dataflows to resume from the last checkpoint in case of failure, ensuring consistent and reliable processing.

Additionally, Bytewax’s partitioning capabilities ensure that related data elements are processed by the same Worker, which is crucial for maintaining state consistency in stateful operations.

It is built on Rust's timely dataflow and provides a Python API, enabling developers to build real-time dataflows while leveraging the Python ecosystem.

To get started, you can install Bytewax via the command:

pip install bytewax

Let's get started with a simple dataflow.

Structuring Your Pipeline with Bytewax Dataflow, Operators and Connectors

Dataflows in Bytewax are structured as Directed Acyclic Graphs (DAGs), with each node representing a processing step. This structure is key to defining the sequence of operations that transform input data into the desired output.

Bytewax connectors enable seamless integration with external systems, allowing for efficient data ingestion and output. This example uses a standard input/output connector, but you can replace it with specific connectors like Kafka or files.

Operators in Bytewax define the logic applied to data as it flows through the pipeline. Operators can be stateful or stateless.

Copy of introducing the Azure AI Search Bytewax sink.png

When combined, these elements enable the creation of real-time end-to-end processes that allow you to transform raw data, and store it and serve it in real time.

Sample dataflow

The dataflow below will set up a simple Bytewax dataflow that processes a list of strings by converting each string to uppercase.

from bytewax.dataflow import Dataflow
import bytewax.operators as op
from bytewax.testing import TestingSource
from bytewax.connectors.stdio import StdOutSink
from bytewax.testing import run_main

# Create and configure the Dataflow
flow = Dataflow("upper_case")

# Input source for dataflow
inp = op.input("inp",flow, TestingSource(["apple", "banana", "cherry"]))

# Define your data processing logic
def process_data(item):
    return item.upper()

# Apply processing logic
out = op.map("process", inp, process_data)

# Output the results to stdout through an StdOutSink, 
# the StdOutSink can easily be changed to an output 
# such as a database through the use of connectors 
op.output("out", out, StdOutSink())

This dataflow can be executed as follows, assuming it is stored in a file named dataflow.py:

python -m bytewax.run dataflow:flow

You can visualize its mermaid graph as follows:

python -m bytewax.visualize dataflow:flow

TestingSource, and Bytewax source input connectors in general, use "lazy" loading behind the scenes through the use of generators, optimizing performance.

The dataflow can also be rewritten using lambda functions:

# Create and configure the Dataflow
flow = Dataflow("stateless_example")

# Input source for dataflow
inp = op.input("inp", flow, TestingSource(["apple", "banana", "cherry"]))

# Apply stateless mapping
out = op.map("uppercase", inp, lambda x: x.upper())

op.output("out", out, StdOutSink())

Deployment

Bytewax allows for easy deployment and management of streaming applications across various environments, from local setups to cloud-based deployments.

https://docs.bytewax.io/stable/guide/deployment/waxctl.html

Bytewax enables you to deploy and manage your dataflows through a CLI.

Waxctl allows you to manage the entire dataflow program lifecycle which includes these phases:

Deployment
Getting Status
Modification
Deletion

waxctl dataflow --help
Manage dataflows in Kubernetes.

Usage:
  waxctl dataflow [command]

Aliases:
  dataflow, df

Available Commands:
  delete      delete a dataflow
  deploy      deploy a dataflow to Kubernetes creating or upgrading it resources
  list        list dataflows deployed

Flags:
  -h, --help   help for dataflow

Global Flags:
      --debug   enable verbose output

Use "waxctl dataflow [command] --help" for more information about a command.

Bytewax also provides a platform to manage dataflows, and the platform may be deployed locally, on cloud and even on a Raspberry Pi.

Conclusion

Bytewax is a framework designed for building real-time data processing pipelines using Python. By integrating with the Python ecosystem and leveraging Rust's Timely Dataflow, Bytewax provides both flexibility and performance for a wide range of data processing tasks.

The framework offers a clear and intuitive API, a diverse set of operators, and deployment tools like waxctl to facilitate the development, deployment, and management of real-time dataflows. Bytewax is suitable for handling both straightforward data streams and more complex distributed workflows, making it a versatile option for various use cases.

In our next blog, we will provide an in-depth overview of Bytewax operators and windowing.