Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 11 additions & 7 deletions content/embeds/rdi-when-to-use.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,20 +9,24 @@ RDI is a good fit when:
- Your app can tolerate *eventual* consistency of data in the Redis cache.
- You want a self-managed solution or AWS based solution.
- The source data changes frequently in small increments.
- There are no more than 10K changes per second in the source database.
- RDI throughput during
[full sync]({{< relref "/integrate/redis-data-integration/data-pipelines#pipeline-lifecycle" >}}) would not exceed 30K records per second (for an average 1KB record size) and during
[CDC]({{< relref "/integrate/redis-data-integration/data-pipelines#pipeline-lifecycle" >}})
would not exceed 10K records per second (for an average 1KB record size).
- The total data size is not larger than 100GB (since this would typically exceed the throughput
limits just mentioned for full sync).
- The source database has no more than 10K changes per second.
- RDI throughput during [full sync]({{< relref "/integrate/redis-data-integration/data-pipelines#pipeline-lifecycle" >}})
stays below 30K records per second, assuming an average record size of 1KB and a pipeline without transformations.
- RDI throughput during [CDC]({{< relref "/integrate/redis-data-integration/data-pipelines#pipeline-lifecycle" >}})
stays below 10K records per second, assuming an average record size of 1KB and a pipeline without transformations.
- The total data size is no larger than 100GB, so a full sync completes in under an hour without exceeding the throughput
limits above.
- You don’t need to perform join operations on the data from several tables
into a [nested Redis JSON object]({{< relref "/integrate/redis-data-integration/data-pipelines/data-denormalization#joining-one-to-many-relationships" >}}).
- RDI supports the [data transformations]({{< relref "/integrate/redis-data-integration/data-pipelines/transform-examples" >}}) you need for your app.
- Your data caching needs are too complex or demanding to implement and maintain yourself.
- Your database administrator has reviewed RDI's requirements for the source database and
confirmed that they are acceptable.

{{< note >}}The throughput and data-size limits above assume the
[classic processor]({{< relref "/integrate/redis-data-integration/architecture/classic-vs-flink" >}}).
The Flink processor (currently in Preview) roughly doubles each limit.{{< /note >}}

### When not to use RDI

RDI is not a good fit when:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ categories:
description: Discover the main components of RDI
group: di
headerRange: '[2]'
hideListLinks: false
linkTitle: Architecture
summary: Redis Data Integration keeps Redis in sync with the primary database in near
real time.
Expand Down Expand Up @@ -127,29 +128,45 @@ It includes:
and exports them as [Prometheus](https://prometheus.io/) metrics.

The *data plane* contains the processes that actually move the data.
It includes the *CDC collector* and the *stream processor* that implement
It includes the *CDC collector* and the *stream processor* that implement
the two phases of the pipeline lifecycle (initial cache loading and change streaming).

The *management plane* provides tools that let you interact
with the control plane.
with the control plane.

- Use the CLI tool to install and administer RDI and to deploy
and manage a pipeline.
- Use the pipeline editor included in Redis Insight to design
- Use the CLI tool to install and administer RDI and to deploy
and manage a pipeline.
- Use the pipeline editor included in Redis Insight to design
or edit a pipeline.

The diagram below shows all RDI components and the interactions between them:

{{< image filename="images/rdi/ingest/ingest-control-plane.webp" >}}

## Stream processor implementations

RDI provides two implementations of the stream processor, *classic* and
*Flink*. You select the implementation per pipeline through the
[`processors.type`]({{< relref "/integrate/redis-data-integration/data-pipelines/pipeline-config#processors" >}})
property in `config.yaml`. The default is `classic`, so existing pipelines
keep their behavior unchanged.

See
[Differences between the classic and Flink processors]({{< relref "/integrate/redis-data-integration/architecture/classic-vs-flink" >}})
for a side-by-side comparison and
[Migrate from the classic processor to the Flink processor]({{< relref "/integrate/redis-data-integration/installation/migration-classic-to-flink" >}})
for guidance on migrating an existing pipeline to the Flink processor.

## VM and Kubernetes deployments

The following sections describe the VM configurations you can use to
deploy RDI.

### RDI on your own VMs

For this deployment, you must provide two VMs. The collector and stream processor
are active on one VM, while on the other they are in standby to provide high availability.
The two operators running on both VMs use a leader election algorithm to decide which
For this deployment, you must provide two VMs. The collector and stream processor
are active on one VM, while on the other they are in standby to provide high availability.
The two operators running on both VMs use a leader election algorithm to decide which
VM is the active one (the "leader").
The diagram below shows this configuration:

Expand All @@ -166,25 +183,25 @@ on [Kubernetes (K8s)](https://kubernetes.io/), including Red Hat

- A K8s [namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/) named `rdi`.
You can also use a different namespace name if you prefer.
- [Deployments](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) and
[services](https://kubernetes.io/docs/concepts/services-networking/service/) for the
- [Deployments](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) and
[services](https://kubernetes.io/docs/concepts/services-networking/service/) for the
[RDI operator]({{< relref "/integrate/redis-data-integration/architecture#how-rdi-is-deployed" >}}),
[metrics exporter]({{< relref "/integrate/redis-data-integration/observability" >}}), and API server.
- A [service account](https://kubernetes.io/docs/concepts/security/service-accounts/)
- A [service account](https://kubernetes.io/docs/concepts/security/service-accounts/)
and [RBAC resources](https://kubernetes.io/docs/reference/access-authn-authz/rbac) for the RDI operator.
- A [ConfigMap](https://kubernetes.io/docs/concepts/configuration/configmap/) with RDI database details.
- [Secrets](https://kubernetes.io/docs/concepts/configuration/secret/)
with the RDI database credentials and TLS certificates.
- Other optional K8s resources such as [ingresses](https://kubernetes.io/docs/concepts/services-networking/ingress/)
- Other optional K8s resources such as [ingresses](https://kubernetes.io/docs/concepts/services-networking/ingress/)
that can be enabled depending on your K8s environment and needs.

See [Install on Kubernetes]({{< relref "/integrate/redis-data-integration/installation/install-k8s" >}})
for more information.

### Secrets and security considerations

The credentials for the database connections, as well as the certificates
The credentials for the database connections, as well as the certificates
for [TLS](https://en.wikipedia.org/wiki/Transport_Layer_Security) and
[mTLS](https://en.wikipedia.org/wiki/Mutual_authentication#mTLS) are saved in K8s secrets.
[mTLS](https://en.wikipedia.org/wiki/Mutual_authentication#mTLS) are saved in K8s secrets.
RDI stores all state and configuration data inside the Redis Enterprise cluster
and does not store any other data on your RDI VMs or anywhere else outside the cluster.
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
---
Title: Differences between the classic and Flink processors
alwaysopen: false
categories:
- docs
- integrate
- rs
- rdi
description: Compare the classic and Flink stream processor implementations.
group: di
linkTitle: Classic vs. Flink processor
summary: Redis Data Integration keeps Redis in sync with the primary database in near
real time.
type: integration
weight: 10
---

RDI ships with two stream processor implementations. Both consume the same
source streams, share the same job-level configuration model, and write to
the same Redis target, but they differ in architecture, supported features,
configuration, observability, error handling, and performance.

This page summarizes those differences. See
[Which processor should I use?]({{< relref "/integrate/redis-data-integration/faq#which-processor-should-i-use" >}})
in the FAQ for the recommendation, and
[Migrate from the classic processor to the Flink processor]({{< relref "/integrate/redis-data-integration/installation/migration-classic-to-flink" >}})
for a step-by-step migration guide.

## At a glance

| Aspect | Classic processor | Flink processor |
|---|---|---|
| Implementation | Python | Java on top of [Apache Flink](https://flink.apache.org/) |
| Deployment targets | VM and Kubernetes | Kubernetes only |
| Scaling | Single replica | Horizontal: TaskManager replicas × task slots per TaskManager |
| Fault tolerance | Source-stream consumer-group replay | Source-stream consumer-group replay plus Flink checkpointing |
| Supported `data_type` outputs | `hash`, `json`, `set`, `sorted_set`, `stream`, `string` | `hash`, `json` |
| Metrics endpoint | `rdi-metrics-exporter` service | Flink JobManager `/metrics` (no metrics exporter) |
| Metric naming | `rdi_*` (e.g., `rdi_incoming_entries`) | `flink_*` (e.g., `flink_jobmanager_job_operator_coordinator_stream_type_rdiRecords`) |
| End-to-end latency | Bounded by the per-batch read-process-write cycle | Records flow through pipelined operator chains without a per-batch barrier |
| Snapshot throughput | Limited by single shared reader and writer | Parallelized across all task slots |
| Expression and `redis.lookup` result caching | Not supported | Optional, opt-in per transformation |

## Architecture and deployment

The classic processor runs as a single pod managed by the operator
and can be deployed on either VMs or Kubernetes through the RDI Helm
chart.

The Flink processor runs as an Apache Flink application cluster: one
JobManager pod plus one or more TaskManager pods. Source,
transformation, and sink operators run as parallel subtasks across
all task slots in the cluster. The Flink processor scales
horizontally by changing the number of TaskManager replicas
(`advanced.resources.taskManager.replicas`); with adaptive
parallelism, the default parallelism is the product of TaskManager
replicas and task slots per TaskManager. The Flink processor
currently runs on Kubernetes only; VM support is planned for a future
release.

Both processors retain at-least-once delivery semantics; the Flink
processor adds Flink checkpointing on top of the shared
consumer-group replay mechanism.

See
[Configure the Flink processor]({{< relref "/integrate/redis-data-integration/installation/install-k8s#configure-the-flink-processor" >}})
for the Helm settings.

## Configuration

The two processors share the same `config.yaml` envelope and the same
`connections`, `sources`, `targets`, and `jobs` sections. The only
differences are inside the `processors:` block, which is selected via
`processors.type` (`classic` or `flink`, default `classic`). Properties
that apply to only one implementation are annotated with
**Classic processor only.** or **Flink processor only.** in the
[pipeline configuration reference]({{< relref "/integrate/redis-data-integration/data-pipelines/pipeline-config#processors" >}}),
and are silently ignored by the other implementation. The Flink
processor exposes additional fine-grained tuning under
`processors.advanced.*`.

## Supported output formats

The classic processor supports all `data_type` values: `hash`, `json`,
`set`, `sorted_set`, `stream`, and `string`. The Flink processor
currently supports only `hash` and `json`. Pipelines that use any other
output type must remain on the classic processor or rewrite the
affected jobs. Support for the remaining output types is planned for a
future release.

## Transformation extensions

The two processors support the same set of transformation blocks
(`filter`, `map`, `add_field`, `remove_field`, `rename_field`,
`redis.lookup`) and the same expression languages (JMESPath and SQL).
Pipelines written for one processor generally execute on the other
without changes.

The Flink processor adds three optional, performance-oriented
extensions that are not available with the classic processor:

- **Expression result caching** through a per-expression `cache:`
block on `filter`, `map`, `add_field`, and `redis.lookup` arguments.
- **`redis.lookup` result caching** through a `lookup_cache:` block.
- **`redis.lookup` batching**, which groups lookups into a single
Redis pipeline. Batching is enabled by default with sensible
defaults; the optional `batch:` block lets you override them.

See
[Caching expression results]({{< relref "/integrate/redis-data-integration/data-pipelines/transform-examples/caching-expression-results" >}})
for examples and
[`redis.lookup`]({{< relref "/integrate/redis-data-integration/reference/data-transformation/lookup" >}})
for the full property list.

## Metrics

The two processors expose different Prometheus metric sets and use
different naming schemes, so dashboards and alerts cannot be reused
as-is between them. The classic processor exposes its metrics through
the `rdi-metrics-exporter` service. The Flink processor emits metrics
directly from the JobManager and TaskManager pods through Flink's
native Prometheus reporter; no metrics exporter is deployed.

See
[Observability — Flink processor metrics]({{< relref "/integrate/redis-data-integration/observability#flink-processor-metrics" >}})
for the customer-facing list of metrics.

## Error handling and DLQ

Both processors implement a dead-letter queue (DLQ) at
`dlq:{stream_name}` and honor the same top-level `error_handling`
(`dlq` or `ignore`) and `dlq_max_messages` properties. The Flink
processor surfaces a few corner cases as DLQ entries that the classic
processor logs and skips (for example, missing parent
keys in nested writes and exceptions thrown by `when` expressions on
`redis.lookup`). The DLQ entry field set and value encoding also
differ: the classic processor uses Python-stringified values,
while the Flink processor uses JSON.

## Performance

The Flink processor delivers significantly higher throughput during
the initial snapshot and lower end-to-end latency in steady state.
The classic processor uses a sequential read-process-write batching
cycle, so each record waits for its batch to complete before being
written to the target. The Flink processor pipelines records through
operator chains without a per-batch barrier, and parallelizes work
across all task slots, which both lowers per-record latency and
raises throughput.

The Flink processor has a larger baseline memory footprint (JVM plus
Flink runtime overhead per TaskManager) but, for most pipelines, the
performance gains and the additional features (horizontal scaling, caching)
outweigh that cost.
Loading
Loading