1. Overview & Use Cases
1.1 What is Milvus CDC
Milvus CDC (Change Data Capture) is the built-in cross-cluster replication mechanism in Milvus 2.6. It subscribes to the source cluster’s WAL (Write-Ahead Log) and forwards changes (DDL / DML / DCL) in order to one or more target clusters, building a near-real-time primary-standby topology.
Important: 2.5 vs 2.6 distinctionMilvus 2.6 CDC is not the same component as the standalone zilliztech/milvus-cdc service used with 2.5. The 2.6 implementation is a complete rewrite:
- From a standalone external service to a built-in module inside the Milvus main repo (
internal/cdc/). - From tightly-coupled msgstream consumption to a log forwarder on top of the new Streaming Service WAL.
- From HTTP task-management APIs to a single unifying RPC:
UpdateReplicateConfiguration.
See Section 10 for full comparison.
1.2 Core Properties
| Property | Detail |
|---|---|
| Real-time | WAL-driven subscription; source-to-target visibility is sub-second to seconds in steady state. |
| Consistency | Relies on strictly monotonic WAL TimeTick plus Target-side ReplicateCheckpoint filtering. Provides exactly-once delivery semantics. |
| Resilience | Supports checkpoint-based resume. After CDC restart or RPC failure, replication resumes from the latest checkpoint with no loss or duplication. |
| Extensibility | Single-primary multi-standby star topology supported. New MessageType additions in Milvus require no CDC changes. |
1.3 Use Cases
| Use Case | Description | Status |
|---|---|---|
| Primary-standby DR | One primary writer + one or more hot read replicas. Switch traffic on failure. | Primary focus |
| Multi-region DR | Star topology: one primary fan-out to geographically distributed standbys. | Supported |
| Data fan-out | Production cluster → multiple analytical / test clusters. | Supported |
| Workload isolation | Online write cluster → read-only analytical cluster. | Supported |
| Historical data migration | Requires the Backup Tool for the full bootstrap; CDC handles the incremental tail. | Requires Backup |
| Bidirectional active-active | Multiple clusters accepting writes simultaneously. | Not supported |
1.4 Version Requirement
- Milvus v2.6.16 or later required for production. Key features (force_promote / Data Salvage / GetReplicateConfiguration / Coordinated Failover) were merged across several patch versions through this baseline.
- Milvus Operator v1.3.4 or later for Kubernetes deployment.
- Message store: Woodpecker, Pulsar, or Kafka are all supported by Milvus 2.6. CDC itself does not require any particular choice; pick what your infrastructure standard dictates.
2. Architecture & Design Philosophy
2.1 Overall Architecture
| Role | Responsibility |
|---|---|
| Source Cluster (Primary) | Source of truth. Handles user DDL / DCL / DML. Writes land in the source WAL via Streaming Nodes. Read-write. |
| CDC Service | Replication engine. A Controller watches the source cluster’s ETCD for ReplicateConfiguration changes; per-pchannel Channel Replicators subscribe to the source WAL and forward messages to the target via a bidirectional ReplicateStream RPC. Deployment principle: CDC is co-located with the Source Milvus cluster (same Kubernetes cluster), minimizing the network cost of ETCD watching and WAL consumption. |
| Target Cluster (Standby) | Replication target. Receives forwarded messages from CDC and writes them into its own WAL. Reads are served; direct writes are rejected while in standby state to prevent split-brain. |
| SDK / Control Plane | Sends UpdateReplicateConfiguration RPCs to source and target clusters to manage the replication topology. |
flowchart LR
subgraph Source["Source Cluster Primary"]
direction TB
SMS["MilvusProxy / StreamingNode"]
SETCD[("ETCD meta")]
SWAL[("WAL")]
SCDC["CDC ServiceWatcher + Replicator"]
SMS -->|save config| SETCD
SMS -->|produce| SWAL
SETCD -.->|watch| SCDC
SWAL -->|sub| SCDC
end
subgraph Target["Target Cluster Standby"]
direction TB
TMS["MilvusProxy / StreamingNode"]
TETCD[("ETCD")]
TWAL[("WAL")]
TCDC["CDC Servicestandby"]
TMS -->|save config| TETCD
TMS -->|produce| TWAL
end
SDK(["SDK / Console"]) -->|UpdateReplicateConfiguration| SMS
SDK -->|UpdateReplicateConfiguration| TMS
SCDC ==>|ReplicateStream RPC pub| TMS
TMS -->|append| TWAL
style Source fill:#e6f4ea,stroke:#1a7f37
style Target fill:#dbeafe,stroke:#1f6feb
style SCDC fill:#fff3bf,stroke:#d4a72c
Why install CDC on both sides?
- Single primary + single standby (A↔B): after a switchover or coordinated failover, the standby becomes the new primary and starts replicating reverse direction; its CDC pod, previously idle, now has to do work. If you didn’t pre-install CDC on the standby, replication will not resume after switchover.
- Single primary + multiple standbys (A→B, A→C): behavior splits by failover mode.
- Graceful / Coordinated (Fence Message reaches all standbys): every standby’s CDC is rewired automatically — C’s
secondaryStateswitches source from A to B and resumes from the Fence position in B’s WAL. No rebuild needed. See §6.2 for the mechanism. - Force Failover (A unreachable, no Fence): C never received a sync barrier, its state may diverge from B in either direction, and there’s no general reconciliation path. C must be rebuilt against the new primary. This matches AWS Aurora Global Database’s stance — once the original primary is gone unceremoniously, sibling secondaries are inconsistent and must be re-seeded.
- Graceful / Coordinated (Fence Message reaches all standbys): every standby’s CDC is rewired automatically — C’s
So: install CDC on every cluster that might become primary, and plan rebuilds for siblings.
Minimal Milvus CR (Kubernetes Custom Resource managed by the Milvus Operator) with CDC enabled:
apiVersion: milvus.io/v1beta1
kind: Milvus
metadata:
name: source-cluster
spec:
mode: cluster
components:
image: milvusdb/milvus:v2.6.16
cdc:
replicas: 1 # CDC currently supports a single replica
dependencies:
msgStreamType: woodpecker # or pulsar / kafka
2.2 Design Philosophy: CDC as a Log Forwarder
Milvus 2.5 CDC was deeply coupled with the old msgstream layer—it owned TtMsg management and Collection / VChannel mapping logic. In 2.6 CDC is reduced to a pure log forwarder:
- Decoupled from Milvus internals. CDC has no knowledge of Collections, VChannels, or timestamp semantics.
- Simple. CDC’s only job is to subscribe to the source WAL in order and replay to the target in order.
- Extensible. When Milvus adds a new MessageType, CDC needs no change.
Enabling TtMsg replication: the key shift
In 2.5, to keep multi-stream ingest consistent, the legacy CDC disabled Target RootCoord’s TtMsg production and ran an in-CDC TtManager to inject TtMsgs. This made CDC inseparable from Milvus internals.
In 2.6 the new WAL guarantees strictly increasing TimeTick on write, so CDC no longer needs to fabricate TtMsgs—it simply forwards source messages and the WAL stamps timestamps automatically. Consequences:
- No CDC dependency on RootCoord behavior.
- Order consistency falls out of the WAL’s own write ordering.
- The CDC code path is light enough to debug and iterate quickly.
2.3 PChannel Mapping
When forwarding to the target, CDC routes each source PChannel to a target PChannel under a strict 1:1 mapping. The validator (pkg/util/replicateutil/config_validator.go) enforces this at every UpdateReplicateConfiguration:
- Equal counts: at any moment in time, every cluster in a
ReplicateConfigurationmust have the same number of pchannels (line 133–139). Heterogeneous configurations such as A=16 / B=24 are rejected outright. - Append-only growth: across successive
UpdateReplicateConfigurationcalls, each cluster’spchannelslist may grow but never shrink, and existing entries must remain at the same positions (line 271–283). When the new total exceeds the previous total, the validator marks the change asisPChannelIncreasing. - Topology-wide expansion: a growth operation applies to all clusters together — both source and target grow from N to N+k simultaneously in the same submission. You cannot grow A alone.
So “v2.6.12+ relaxes pchannel constraints” really means you can grow the whole topology together over time, not that you can run with mismatched counts. Mismatched counts have never been and are not supported.
flowchart LR
S0[("s-pchannel-0")] --> T0[("t-pchannel-0")]
S1[("s-pchannel-1")] --> T1[("t-pchannel-1")]
Sn[("s-pchannel-n")] --> Tn[("t-pchannel-n")]
style S0 fill:#e6f4ea
style S1 fill:#e6f4ea
style Sn fill:#e6f4ea
style T0 fill:#dbeafe
style T1 fill:#dbeafe
style Tn fill:#dbeafe
Default PChannel name convention: {cluster_id}-rootcoord-dml_{i} for i in 0..pchannel_num-1. The mapping is persisted in the Target’s Meta (ReplicatePChannelMeta) so failover, restart, and scale-out reuse it without manual reconstruction.
What controls the count: rootCoord.dmlChannelNum
The number of PChannels in a cluster is a cluster-level config (one number per Milvus cluster, not per collection):
# configs/milvus.yaml
rootCoord:
dmlChannelNum: 16 # The number of DML-Channels to create at root coord startup.
Default 16, defined in pkg/util/paramtable/component_param.go:1882, available since v2.0. The PChannel names you put into ReplicateConfiguration must list exactly these channels for the cluster ({cluster_id}-rootcoord-dml_0 through {cluster_id}-rootcoord-dml_{dmlChannelNum-1}).
Don’t confuse three related concepts:
| Concept | What it is | Controlled by |
|---|---|---|
| PChannel (physical) | DML message channels created at RootCoord startup. Cluster-wide. | rootCoord.dmlChannelNum (cluster config) |
| VChannel (virtual) | Per-collection shard; each shard maps to exactly one VChannel. | shardNum at create_collection time |
| Mapping | Multiple collections’ VChannels are hashed onto the fixed set of PChannels: hash(collection_id) mod dmlChannelNum | Implicit |
This is why CDC validates pchannel lists at the cluster level: one list per cluster, not per collection.
POC recommendationFor the initial POC, use the same pchannel count on both sides. This:
- Matches the design documents’ canonical configuration.
- Avoids triggering edge cases in the relaxed heterogeneous mode.
- Makes failure diagnosis simpler (1:1 channel correspondence in metrics and logs).
If pchannel-count flexibility ends up being required, request feature confirmation from the Milvus team before going to production.
2.4 ReplicateCheckpoint & Exactly-Once
In 2.5, CDC persisted a checkpoint on the CDC side after every successful message replication. Two issues followed:
- I/O overhead: every message paid a disk write.
- Duplicate-write risk: if a message landed in the Target but the RPC dropped before ack returned, CDC would re-send it, and the Target would write it again.
In 2.6 the ReplicateCheckpoint mechanism flips the responsibility — and the IO model changes entirely:
- Target owns the checkpoint: when a replicated message lands in the Target’s WAL, the Target tracks the corresponding
ReplicateCheckpoint(sourceMessageID+TimeTick) in its in-memory recovery state. - Persistence is batched, not per-message: a background task in
recovery_background_task.gowakes on apersistIntervalticker (default 10s, configurable) and flushes the dirty recovery snapshot in a single batch write to the StreamingNode catalog — vchannels + segment assignments + checkpoint go through one batch transaction. Hot-path writes per replicated message are just the WAL append on the target side, the same write the Target would do anyway for any local write. - WAL is the source of truth, catalog is a snapshot: even if the node crashes between two persist ticks, no data is lost — the WAL already has every message. On restart the Target replays its WAL to rebuild in-memory state and reads the most recently persisted checkpoint from the catalog as the resume anchor.
- Deduplication on resend: if CDC re-sends a message that already landed, the Target’s WAL interceptor sees
incoming.tt ≤ ReplicateCheckpoint.ttand drops it. No double-write. - CDC keeps no local checkpoint: at startup it calls
GetReplicateInfoon the Target to learn the resume position. The CDC side has zero checkpoint persistence — the whole 2.5-style overhead is gone.
Net effect on IO2.5 paid one CDC-side checkpoint write per replicated message (thousands per second under load). 2.6 pays one catalog batch write per persistInterval (~once per 10s by default), shared across multiple recovery state changes. The hot path becomes “append to target WAL, that’s it” — same as a normal local write, no extra checkpoint syscalls per message.
type WALCheckpoint struct {
MessageId *messagespb.MessageID
TimeTick uint64
// For Replicate messages, the source-side MessageID and TimeTick
ReplicateMessageId *messagespb.MessageID
ReplicateTimeTick uint64
}
Two-layer filtering — where deduplication actually happens
The ReplicateCheckpoint participates in two filters at different layers; together they give exactly-once. They are not redundant, they cover different failure modes.
| Layer | Where | What it does | What it covers |
|---|---|---|---|
| ① Source CDC — read-offset optimization | internal/cdc/replication/replicatemanager/channel_replicator.go:127–135 | At startup CDC calls GetReplicateInfo on Target to get the checkpoint, then subscribes the Source WAL with DeliverFilterTimeTickGT(cp.TimeTick). The WAL scanner skips messages whose tt ≤ cp.tt — CDC never even reads them, so it never forwards them. | CDC restart, scale-up, leader change — anything where CDC re-establishes a Scanner from scratch. |
| ② Target ReplicateInterceptor — idempotent fence | internal/streamingnode/server/wal/interceptors/replicate/replicates/impl.go:170–177 | When a message reaches the Target’s WAL write path, the interceptor compares incoming tt against the current ReplicateCheckpoint. If incoming.tt ≤ cp.tt (or < for txn-body messages) it returns IgnoreOperation and the message is dropped before append. | RPC drops between Target WAL append and ack arriving at CDC; CDC’s in-memory send buffer re-pushes a message already written. Source-side filtering can’t catch this — the message is already past the Source WAL read. |
Concretely:
- No-failure steady state: CDC’s Source-side filter is enough; the Target interceptor sees only fresh messages.
- RPC mid-flight drop: CDC’s resend hits the Target interceptor; the interceptor’s idempotent fence drops the duplicate.
- CDC process crash + restart: New CDC re-reads checkpoint, sets
DeliverFilterTimeTickGT, resumes — neither layer sees a duplicate because nothing was in flight.
The Target-side check is the source of truth-level exactly-once; the Source-side check is the source of efficiency-level exactly-once.
The two typical scenarios this mechanism handles:
| Scenario | Flow |
|---|---|
| CDC service restart | 1. Source WAL contains S100, S200, S300. 2. CDC has forwarded S100 and S200 to Target. Target’s Checkpoint = S200. 3. The CDC service itself crashes and restarts. 4. CDC calls GetReplicateInfo on Target and learns the checkpoint is S200. 5. CDC resubscribes from S200 (Exclude) and resumes at S300. |
| RPC connection drop | 1. CDC is in the middle of forwarding S200 when the RPC drops, but S200 actually reached the Target. 2. CDC reconnects and re-sends S200. 3. Target sees S200.tt ≤ Checkpoint.tt and drops it—no duplicate write. 4. CDC forwards S300 normally. |
3. Data Flow & Message Replication
3.1 Startup / Restart Recovery
Role of each stepExcept step 2 (initiated by the control plane / SDK), all subsequent steps below are executed by the Source-side CDC service. Since CDC is co-located with the Source Milvus cluster, the ETCD watch in step 3 and the WAL subscription in step 5 are in-cluster operations; only steps 4 and 6 are cross-cluster RPCs to the Target.
Initial startup
- Start Source Milvus cluster A, Target Milvus cluster B, and CDC service. (operator / K8s)
- Control plane sends
UpdateReplicateConfigurationto both A and B, carrying ClusterID / URI / Token / PChannels for both and a topology edge A→B. (SDK / control plane) - CDC watches A’s ETCD and detects the new ReplicateConfiguration. (Source-side CDC Controller)
- CDC calls B’s
GetReplicateInfofor each PChannel to get the ReplicateCheckpoint. (For two empty clusters, the checkpoint is the WAL’s earliest.) (Source-side CDC Channel Replicator) - CDC creates a WAL Scanner at A starting from the checkpoint and begins consuming. (Source-side CDC, in-cluster WAL Read)
- CDC calls B’s
CreateReplicateStreamto establish the bidirectional gRPC stream and starts forwarding consumed messages. (Source-side CDC Stream Client)
Recovery after CDC restart
- A, B, and CDC are in a healthy replicating state.
- The CDC service crashes and is restarted.
- CDC reads ReplicateConfiguration from A’s ETCD.
- CDC calls B’s
GetReplicateInfofor the pre-crash ReplicateCheckpoint. - CDC resubscribes A’s WAL from that checkpoint and rebuilds the Scanner.
- CDC re-establishes the ReplicateStream RPC with B and resumes forwarding.
3.2 Message Replication Details
DDL example: CreateCollection
- Source Milvus writes CreateCollectionMsg to multiple VChannels.
- CDC forwards the first CreateCollectionMsg to the Target.
- Target resolves VChannelName to PChannelName and locates the WAL.
- Target writes CreateCollectionMsg to its WAL.
- The ACK path triggers RootCoord’s CreateCollection.
- RootCoord creates the collection meta in Creating state with just one vchannel registered.
- CDC forwards CreateCollectionMsg from the remaining VChannels; RootCoord fills in the other vchannels.
- When the meta’s vchannel count equals shardNum, the collection transitions to Created.
DropCollection, CreatePartition, DropPartition, CreateDatabase, DropDatabase, CreateRole, and similar DDLs follow the same pattern: RootCoord processes the first message and subsequent messages are idempotent.
DML: Insert / Delete
- CDC consumes InsertMsg / DeleteMsg.
- CDC calls the Target’s Replicate RPC.
- The Target’s WAL interceptor:
- Rewrites CollectionID / PartitionID / VChannel (source-side IDs are not valid on the target).
- Allocates a new SegmentID.
- Once rewriting is done, the message is appended to the WAL.
3.3 Message Type Replication Matrix
Below is the current matrix of MessageTypes and whether CDC replicates them.
| MessageType | Category | Replicated | Note |
|---|---|---|---|
| MessageTypeTimeTick | System | N | WAL self-stamps tt; nothing to replicate |
| MessageTypeInsert | DML | Y | |
| MessageTypeDelete | DML | Y | |
| MessageTypeImport | DML | N | Not yet supported; on roadmap (in progress) |
| MessageTypeManualFlush | DML | Y | |
| MessageTypeCreateCollection | DDL | Y | |
| MessageTypeDropCollection | DDL | Y | |
| MessageTypeAlterCollection | DDL | Y | |
| MessageTypeCreatePartition | DDL | Y | |
| MessageTypeDropPartition | DDL | Y | |
| MessageTypeCreateDatabase | DDL | Y | |
| MessageTypeAlterDatabase | DDL | Y | |
| MessageTypeDropDatabase | DDL | Y | |
| MessageTypeAlterAlias / DropAlias | DDL | Y | |
| MessageTypeSchemaChange | DDL | Y | |
| MessageTypeAlterLoadConfig / DropLoadConfig | DDL | Y | |
| MessageTypeAlterReplicateConfig | DDL | Y | Topology change carrier (incl. force_promote) |
| MessageTypeCreateIndex / AlterIndex / DropIndex | Index | Y | |
| MessageTypeAlterUser / DropUser | RBAC | Y | |
| MessageTypeAlterRole / DropRole | RBAC | Y | |
| MessageTypeAlterUserRole / DropUserRole | RBAC | Y | |
| MessageTypeAlterPrivilege / DropPrivilege | RBAC | Y | |
| MessageTypeAlterPrivilegeGroup / DropPrivilegeGroup | RBAC | Y | |
| MessageTypeRestoreRBAC | RBAC | Y | |
| MessageTypeAlterResourceGroup / DropResourceGroup | RG | Y | RG-cross-cluster optimization on roadmap |
| MessageTypeBeginTxn / CommitTxn / Txn | System | Y | Transaction messages replicated as a group |
| MessageTypeRollbackTxn | System | N | Filtered before reaching Scanner/CDC |
| MessageTypeCreateSegment | System | N | Segments are managed by the standby itself |
| MessageTypeFlush | System | N | System flush is local to each cluster |
4. API & SDK
gRPC only — no REST endpointThe CDC control plane (UpdateReplicateConfiguration, GetReplicateConfiguration) is exposed solely via gRPC. Clients use PyMilvus or the Go SDK, which speak gRPC under the hood. The Milvus Proxy’s RESTful gateway (v1 / v2) does not route these RPCs.
The older “Manage CDC Tasks” HTTP API found in legacy documentation belongs to the standalone zilliztech/milvus-cdc service used with Milvus 2.5 — a different system from 2.6’s built-in CDC. Do not assume the HTTP API surface carries over.
4.1 UpdateReplicateConfiguration
CDC’s control flow has a single entry point: UpdateReplicateConfiguration. It accepts a ReplicateConfiguration object and the server atomically replaces the persisted configuration. The control plane must call this RPC on every participating cluster so each cluster sees the same topology.
Proto
// milvus.proto
service MilvusService {
rpc UpdateReplicateConfiguration(UpdateReplicateConfigurationRequest)
returns (common.Status) {}
// Added in v2.6.12+: read-only API for the current topology (tokens redacted)
rpc GetReplicateConfiguration(GetReplicateConfigurationRequest)
returns (GetReplicateConfigurationResponse) {}
}
message UpdateReplicateConfigurationRequest {
common.ReplicateConfiguration replicate_configuration = 1;
bool force_promote = 2; // For Force Failover
}
// common.proto
message ReplicateConfiguration {
repeated MilvusCluster clusters = 1;
repeated CrossClusterTopology cross_cluster_topology = 2;
}
message MilvusCluster {
string cluster_id = 1;
ConnectionParam connection_param = 2;
repeated string pchannels = 3;
}
message ConnectionParam {
string uri = 1;
string token = 2;
}
message CrossClusterTopology {
string source_cluster_id = 1;
string target_cluster_id = 2;
}
Behavior
- Full replacement: the new
ReplicateConfigurationreplaces the current persisted one atomically. - Idempotent: re-submitting the same config is a no-op.
- Validated: the server runs validation (see next section) before persisting; invalid requests return a typed error.
4.2 Validation Rules
ReplicateConfiguration represents a directed topology graph: MilvusCluster entries are vertices, CrossClusterTopology entries are edges.
- Per-cluster format: each
MilvusClustermust have a non-emptycluster_id(no whitespace), a non-emptyconnection_param.uri(valid URI), and a non-emptypchannelslist where each entry is prefixed by the cluster_id. - Vertex uniqueness: cluster_ids must be globally unique.
- Edge uniqueness: a given
source → targetedge may appear at most once. - Self-reference required: the cluster receiving the request must appear in
clusters. - Star topology only: at most one vertex has out-degree (= total clusters − 1) and in-degree 0; all others have in-degree 1 and out-degree 0. A single-vertex configuration (no edges) is also legal—and is how you remove replication.
- pchannels count: all clusters must have identical pchannels counts at any moment in time. Across successive updates, the count can grow but only via topology-wide append-only expansion (every cluster from N to N+k together). Mismatched counts between clusters are always rejected.
- pchannels uniqueness: no pchannel may repeat across clusters.
flowchart TB Center((Source)) L1((Target-1)) L2((Target-2)) L3((Target-3)) L4((Target-N)) Center --> L1 Center --> L2 Center --> L3 Center --> L4 style Center fill:#1f6feb,color:#fff
4.3 PyMilvus Example
from pymilvus import MilvusClient
source_cluster_id = "source-cluster"
target_cluster_id = "target-cluster"
pchannel_num = 16
source_cluster_addr = "http://source-cluster-milvus.milvus.svc.cluster.local:19530"
target_cluster_addr = "http://target-cluster-milvus.milvus.svc.cluster.local:19530"
source_cluster_token = "root:Milvus"
target_cluster_token = "root:Milvus"
source_pchannels = [f"{source_cluster_id}-rootcoord-dml_{i}" for i in range(pchannel_num)]
target_pchannels = [f"{target_cluster_id}-rootcoord-dml_{i}" for i in range(pchannel_num)]
replicate_config = {
"clusters": [
{
"cluster_id": source_cluster_id,
"connection_param": {"uri": source_cluster_addr, "token": source_cluster_token},
"pchannels": source_pchannels,
},
{
"cluster_id": target_cluster_id,
"connection_param": {"uri": target_cluster_addr, "token": target_cluster_token},
"pchannels": target_pchannels,
},
],
"cross_cluster_topology": [
{"source_cluster_id": source_cluster_id, "target_cluster_id": target_cluster_id}
],
}
# Submit the same config to both sides
source_client = MilvusClient(uri=source_cluster_addr, token=source_cluster_token)
target_client = MilvusClient(uri=target_cluster_addr, token=target_cluster_token)
try:
source_client.update_replicate_configuration(**replicate_config)
target_client.update_replicate_configuration(**replicate_config)
finally:
source_client.close()
target_client.close()
Operational tipIn production, use short-lived clients dedicated to control-plane operations like this. Avoid sharing a gRPC channel with DML traffic, since role changes briefly enter a write-fenced state.
Why submit to both clusters — strong-consistency semantics
The two update_replicate_configuration calls above are not a “primary forwards to standby” mechanism. They are two independent RPCs initiated by the application; the source cluster and the target cluster do not directly exchange configuration RPCs over the wire.
The two RPCs do different things server-side:
| Recipient | What the cluster does on receipt |
|---|---|
| Primary (the one currently allowed to broadcast) | Acquires the exclusive cluster resource key, broadcasts the AlterReplicateConfigMessage into its WAL, the ACK callback persists the new ReplicatePChannelMeta into its own ETCD, and returns success. |
| Standby (any non-primary in the topology) | broadcaster.StartBroadcastWithResourceKeys fails with ErrNotPrimary. The cluster then enters waitUntilPrimaryChangeOrConfigurationSame — it blocks the RPC until the local ETCD’s ReplicateConfiguration matches the incoming one. The configuration arrives via CDC replicating the AlterReplicateConfigMessage from the primary’s WAL into the standby’s WAL; the standby’s SC consumes that message and writes its own ETCD through the same ACK callback. Only then does the standby’s blocked RPC return success. |
Strong-consistency guarantee when both calls succeed
When both update_replicate_configuration calls return success, the application can rely on the following invariants:
- The new
ReplicateConfigurationhas been broadcast into the primary’s WAL and persisted in the primary’s ETCD. - CDC has already replicated the configuration change to every cluster the application called.
- Each cluster’s ETCD
ReplicatePChannelMetareflects the new topology. - Each cluster’s CDC controller has been notified via its ETCD watch and has had a chance to reconcile its channel replicators.
This both-success = both-applied property is what makes the two-call pattern a strong-consistency fence, not just a redundancy.
Sequence — what actually happens on the wire
Application (PyMilvus) Primary (A) Standby (B)
│ │ │
├──UpdateReplicateConfig──────────►│ │
│ │ broadcaster.StartBroadcast ✓ │
│ │ broadcast AlterReplicateConfigMsg │
│ │ │ │
│ │ ▼ │
│ │ A's WAL ─┐ │
│ │ │ ACK │
│ │ ▼ │
│ │ alterReplicateConfiguration cb │
│ │ │ │
│ │ ▼ │
│ │ A's ETCD (ReplicatePChannelMeta) │
│ │ │ │
│ │ ▼ etcd watch fires │
│ │ A's CDC Controller → CreateReplicator (ACTIVE)
│ │ │
│◄────────────success──────────────┤ │
│ │
│ │ (A's CDC forwards the msg) ─────────►│
│ ReplicateStream RPC
│ │
├──UpdateReplicateConfig──────────────────────────────────────────────────►│
│ │ broadcaster fails ErrNotPrimary
│ │ waitUntilPrimaryChangeOrConfigurationSame
│ │ (blocking watch loop on local config)
│ │ │
│ │ (still forwarding... msg lands)─────►│ B's WAL ─┐
│ │ │ ACK
│ │ ▼
│ │ alterReplicateConfiguration cb
│ │ │
│ │ ▼
│ │ B's ETCD (ReplicatePChannelMeta)
│ │ │
│ │ ▼ etcd watch fires
│ │ B's CDC Controller — sees event
│ │ but sourceChannel ≠ self → stays idle
│ │ │
│ │ watch loop above sees config now matches
│ │ │
│◄─────────────────────────────────────────────────────────success─────────┤
│
▼
At this point both ETCDs are consistent; the application may proceed.
If you only call the primaryThe system still converges — A’s CDC will eventually push the AlterReplicateConfigMessage into B’s WAL and B’s ETCD will catch up via the same ACK callback. But there is no synchronous fence: when the primary call returns, B’s ETCD has not yet been updated. Any application logic that immediately reads back the configuration on B will see the old state. The two-call pattern exists specifically to make this fence synchronous and observable.
Direction agnosticThis pattern is robust to not knowing which cluster is currently primary — useful right after a switchover or in automation where the caller can’t easily determine the role. Both endpoints accept the same payload; only the actual primary executes the broadcast, the others just wait for convergence.
4.4 Go SDK Example (builder pattern)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
cli, err := milvusclient.New(ctx, &milvusclient.ClientConfig{
Address: "localhost:19530",
})
if err != nil { log.Fatal(err) }
sourceCluster := milvusclient.NewMilvusClusterBuilder("source-cluster").
WithURI("localhost:19530").
WithToken("source-token").
WithPchannels("source-cluster-rootcoord-dml_0", "source-cluster-rootcoord-dml_1").
Build()
targetCluster := milvusclient.NewMilvusClusterBuilder("target-cluster").
WithURI("localhost:19531").
WithToken("target-token").
WithPchannels("target-cluster-rootcoord-dml_0", "target-cluster-rootcoord-dml_1").
Build()
config := milvusclient.NewReplicateConfigurationBuilder().
WithCluster(sourceCluster).
WithCluster(targetCluster).
WithTopology("source-cluster", "target-cluster").
Build()
if err := cli.UpdateReplicateConfiguration(ctx, config); err != nil {
log.Printf("Failed to update replicate configuration: %v", err)
return
}
log.Println("Replicate configuration updated successfully")
4.5 GetReplicateInfo & CreateReplicateStream
Internal RPCs—application developers normally do not call these, but understanding them helps with debugging.
GetReplicateInfo
Returns the persisted ReplicateCheckpoint metadata on the target. CDC calls this at startup to learn where to resume.
message GetReplicateInfoRequest {
string source_cluster_id = 1;
string target_pchannel = 2;
}
message GetReplicateInfoResponse {
common.ReplicateCheckpoint replicate_checkpoint = 1;
common.ReplicateCheckpoint salvage_checkpoint = 2; // Captured during force failover
}
CreateReplicateStream
The bidirectional gRPC stream between CDC and the target. Send: ReplicateRequest. Recv: ReplicateResponse with ReplicateConfirmedMessageInfo, used to drain the in-flight queue by TimeTick for back-pressure.
5. Deployment & Quick Start
5.1 Prerequisites
- Milvus v2.6.16 or later.
- Milvus Operator v1.3.4 or later.
- A Kubernetes cluster.
- Network connectivity between source and target clusters.
- Admin credentials for both clusters.
5.2 Install Milvus Operator
helm repo add zilliztech-milvus-operator https://zilliztech.github.io/milvus-operator/
helm repo update zilliztech-milvus-operator
helm -n milvus-operator upgrade --install milvus-operator \
zilliztech-milvus-operator/milvus-operator \
--create-namespace
kubectl get pods -n milvus-operator
5.3 Deploy Source and Target Clusters
Two near-identical CRs—the only difference is metadata.name. Both sides must enable the CDC component.
apiVersion: milvus.io/v1beta1
kind: Milvus
metadata:
name: source-cluster # change to target-cluster on the other side
namespace: milvus
spec:
mode: cluster
components:
image: milvusdb/milvus:v2.6.16
cdc:
replicas: 1
dependencies:
msgStreamType: woodpecker # or pulsar / kafka
kubectl create namespace milvus
kubectl apply -f milvus_source_cluster.yaml
kubectl apply -f milvus_target_cluster.yaml
# Confirm CDC pods are Running
kubectl get pods -n milvus | grep cdc
# source-cluster-milvus-cdc-66d64747bd-sckxj 1/1 Running 0 2m
# target-cluster-milvus-cdc-7f8c9d6b8c-q4t9m 1/1 Running 0 2m
5.4 Discover Cluster Addresses
kubectl get svc -n milvus | grep -E 'NAME|source-cluster|target-cluster'
# source-cluster-milvus ClusterIP 10.98.124.90 19530/TCP,9091/TCP
# target-cluster-milvus ClusterIP 10.109.234.172 19530/TCP,9091/TCP
Two kinds of addresses
- Cluster-internal address: written into
ReplicateConfiguration; must be reachable from CDC pods. - Client address: only used when running PyMilvus to push
update_replicate_configuration; can be a port-forward / LB / Ingress.
Never put a port-forward 127.0.0.1 address into ReplicateConfiguration—CDC pods cannot reach it.
5.5 Apply the Replication Topology
See 4.3 PyMilvus example. Submit the same config to both sides.
5.6 Verify Replication
- Create a collection on the source, insert data.
- Load and run query / search on source.
- Connect to the target—do not manually
load_collectionthere (DDL replicates automatically). - Run the same query / search on the target; identical results = replication healthy.
6. Primary-Standby Failover
Milvus 2.6 provides three failover modes, ordered by their consistency guarantees:
| Mode | Primary goal | Use case | Data loss | Status |
|---|---|---|---|---|
| Graceful (Planned Switchover) | Consistency first | Original primary still reachable; planned maintenance | RPO = 0 | Available |
| Coordinated Failover | Consistency + HA | Primary partially degraded (e.g. StreamingNode CrashLoop) | RPO = 0 | Available |
| Force Failover | Availability (AP) | Primary fully unreachable (network partition, region loss) | Bounded by CDC lag (salvageable manually) | Available |
6.1 Graceful Failover (Planned Switchover)
Use when the original primary is still reachable—maintenance, version rollout, planned traffic relocation. Replication uses an AlterReplicateConfigMessage as the synchronization point; no data is lost.
When to use
- Original primary A is fully healthy and reachable. No CrashLoop, no degraded components.
- You initiated the switch — it’s planned: version upgrade, maintenance window, traffic redistribution.
- You can tolerate “switch may take seconds to minutes to drain the remaining lag” — RPO=0 has a price in RTO.
- Do not use if A has any failure signal. Pick Coordinated if WAL still writes but some component is down, or Force if A is unreachable.
Pre-check (verify before submitting the switch)
CDC lag is low and steady. Query:
clamp_min(max by (channel_name)(milvus_wal_last_confirmed_time_tick) - min by (channel_name)(milvus_cdc_last_replicated_time_tick), 0)Expect a small, stable number (sub-second to a few seconds). A high or climbing lag means the switch will block until the lag drains — runs the operation for longer than expected.
Stream RPC connections all healthy:
milvus_cdc_stream_rpc_connections{connection_status="disconnected"} == 0.Both clusters reachable from PyMilvus. A simple
get_replicate_configuration()on both should succeed and return the current topology.Network between A and B is healthy. Recent reconnect counter should be flat:
rate(milvus_cdc_stream_rpc_reconnect_times[5m]) ~ 0.
Flow (initial A→B, target B→A)
- Cluster A: control plane calls
UpdateReplicateConfigurationon A with the new topology B→A- A writes
AlterReplicateConfigMessageinto its WAL. - A demotes to standby; its WAL enters write-fenced state.
- A writes
- Cluster B: B receives the replicated
AlterReplicateConfigMessagevia CDC; once confirmed, its WAL unfences and B promotes to primary. - CDC: A’s CDC stops once it has forwarded the message; B’s CDC takes over the reverse direction (B→A).
PyMilvus
switchover_config = {
"clusters": [
{
"cluster_id": cluster_a_id,
"connection_param": {"uri": cluster_a_addr, "token": cluster_a_token},
"pchannels": cluster_a_pchannels,
},
{
"cluster_id": cluster_b_id,
"connection_param": {"uri": cluster_b_addr, "token": cluster_b_token},
"pchannels": cluster_b_pchannels,
},
],
"cross_cluster_topology": [
{"source_cluster_id": cluster_b_id, "target_cluster_id": cluster_a_id}
],
}
# Submit to the current primary (A) first, then to the current standby (B).
# Both calls are synchronous and block until success (or RPC timeout).
# When both return success, the switch is committed — see §4.3.
client_a.update_replicate_configuration(**switchover_config)
client_b.update_replicate_configuration(**switchover_config)
Verify the switch succeeded
- Topology shows the new direction. On either side:
get_replicate_configuration()should reportcross_cluster_topology = [B → A]. - Old primary A rejects writes. Try a small insert / delete via PyMilvus against A — should fail with a “not primary” / write-fenced error.
- New primary B accepts writes. Same insert / delete against B — should succeed.
- Reverse CDC is forwarding. On B (now source), watch
milvus_cdc_replicated_messages_total— the counter for target_cluster=Ashould start incrementing. CDC lag for the new direction should drop into normal range within a few cycles. - Test data round-trips. Insert one row on B → wait a few seconds → query A; you should see the row replicated back.
Failure handling
- One of the two RPCs failed, the other succeeded: state is half-committed. Run
get_replicate_configuration()on both sides — whichever side hasn’t applied the new topology, retry submitting there. The receiving side that already saw the AlterReplicateConfigMessage via CDC will short-circuit to success viawaitUntilPrimaryChangeOrConfigurationSame(see §4.3) — retrying is safe. - Both RPCs hang for a long time: usually means CDC lag is high and B is waiting for the Fence Message to arrive. Watch lag — once it drains to the Fence position, both RPCs return. If it doesn’t drain, CDC has a problem (check pod status + reconnect counter) — abort by canceling the calls and investigate.
- Rolling back (you want A primary again): submit another
UpdateReplicateConfigurationwith the original topology A→B. This is just another Graceful Switchover — RPO=0 still holds because the just-switched-to B never lost anything and the reverse direction back to A will also drain a Fence. Customers running automation should always treat Graceful Switchover as bidirectional from day one.
6.2 Coordinated Failover
For cases where the primary cluster has partial failure—for example, the primary’s StreamingNode is in CrashLoop due to segcore SEGSIGV, but MixCoord / Proxy / WAL remain alive. Still achieves RPO=0.
When to use
- Primary A has a degraded component but its WAL can still accept writes. Typical signals:
- StreamingNode pod is in CrashLoop (rising
restart_counton the SN pod) but the WAL service stays up between restarts. - Delegator / Flusher are throwing fatal errors (e.g. segcore SIGSEGV in logs) but MixCoord / Proxy responses are still healthy.
- Some metric or alert flagged the problem and you’ve manually confirmed A is degraded but reachable.
- StreamingNode pod is in CrashLoop (rising
- You verified A’s WAL is still writable: try a small
insert()on A through PyMilvus. If it succeeds (even slowly), WAL works → Coordinated is the right mode. If it times out or returns a service-unavailable type error → A’s WAL is unreachable → escalate to Force Failover instead. - You can afford a few seconds to tens of seconds RTO (drain remaining lag).
- Do not use if A is fully healthy (that’s Graceful) or A is fully unreachable (that’s Force).
Pre-check
- WAL writability probe:
milvus_client.insert()against A. Success = good, proceed. Time-out or 503 = wrong mode, switch to Force. - CDC stream connection between A and B:
milvus_cdc_stream_rpc_connections{connection_status="connected", target_cluster="B"} > 0. If 0, CDC can’t carry the Fence Message — Coordinated cannot work, escalate to Force. - Standby B is healthy:
get_replicate_configuration()on B returns current topology successfully. - Decide your retry budget. Recommendation: 10 attempts with 2-second backoff between attempts (≈ 20s total) before flipping
pauseConsumption. Adjust based on your RTO target.
Fence Message
AlterReplicateConfigMessage serves as a Fence Message. Two properties exploit WAL ordering:
- Write fencing: once the message is in the WAL, the original primary rejects new writes.
- Sync barrier: when CDC has consumed this message, all prior data messages have been propagated → RPO = 0.
Feasibility of writing the Fence Message
The StreamingNode’s startup order is WAL → Flusher → Delegator. The WAL module finishes OpenWAL and Recover first—it is Ready to accept writes before Delegator/Flusher fully come up. Even if Delegator/Flusher will crash again immediately, there is a window in which the SN can accept the Fence Message write.
Last-resort: pauseConsumption
If the SN crashes so fast that the window is insufficient, the config:
streaming:
walScanner:
pauseConsumption: true # default: false
is in Milvus (pkg/util/paramtable/component_param.go:6887, refreshable). When set, Delegator and Flusher pause WAL consumption, reducing the SN to “WAL-write only” mode and breaking the crash loop, giving the Fence Message enough time to land.
Milvus does not do this automaticallyMilvus only provides the pauseConsumption switch — when you set it, the SN responds to it. Milvus does NOT automatically retry, detect failure, or flip the switch on its own. The whole “try, fail, flip pauseConsumption, retry” loop is something you (or your control-plane script / operator) have to write yourself. The customer’s runbook owns the retry-and-flip-switch logic; Milvus only owns the switch.
Full flow (you / your control plane does this)
- Failure detection. Operator decides to fail over.
- Fence Old Primary: control plane calls
UpdateReplicateConfigurationon A with the reversed topology.- If it succeeds: proceed.
- If it fails after some retries (e.g. 10 — your choice): your script flips
walScanner.pauseConsumption=trueon A and retries until success. Milvus does not do this for you.
- Promote New Primary: call
UpdateReplicateConfigurationon B. After B receives and confirms the Fence Message from A via CDC, B promotes to primary. - Post-handling:
- If A is recoverable: reset
pauseConsumption=false; A auto-rejoins as B’s standby. - If A is unrecoverable: tear down the topology with B and run Standby Cluster Rebuild to provision a new standby C.
- If A is recoverable: reset
Verify the switch succeeded
- Topology shows new direction:
get_replicate_configuration()on either side reportsB → A. - Old primary A no longer accepts writes: a small
insert()against A should fail with a write-fenced error. (Note this is different from the pre-check probe — that was before the switch and writes succeeded.) - New primary B accepts writes: same insert against B succeeds.
- Reverse CDC is live: on B as source,
milvus_cdc_replicated_messages_total{target_cluster="A"}begins to climb. CDC lag for the new direction normalizes within a few cycles. - Test data round-trips: insert one row on B → wait a few seconds → query A; the row should be there.
- If you flipped
pauseConsumption=true: A’s SN is now WAL-write-only — its Delegator/Flusher are not consuming. You must resetpauseConsumption=falseonce A’s underlying issue is fixed (e.g. segcore upgrade) so A can re-join as standby. Leaving it on means A will never catch up reading the replicated stream from B; the auto-rejoin can’t happen.
Failure handling
- Fence Old Primary attempt fails 10 times even with pauseConsumption: A’s WAL is unreachable, not just degraded. Abort Coordinated, escalate to Force Failover.
- B’s promotion appears stuck (Fence Message hasn’t arrived after > 30s): A’s CDC isn’t forwarding. Check
milvus_cdc_stream_rpc_connectionsfor the A→B link. If disconnected, you have a CDC delivery problem on top of the primary failure — likely also needs Force Failover. - Auto-rejoin of A doesn’t work after pauseConsumption reset: query B’s
get_replicate_configuration()to confirm A is incross_cluster_topologyas target. If yes but data isn’t flowing, check A’s CDC stream connection metric. If A’s underlying fault still recurs (re-crashing), treat A as unrecoverable and run Standby Cluster Rebuild on a fresh node. - Rolling back to A as primary: same as a Graceful Switchover from B back to A. Only safe if A is now genuinely fixed.
Multi-standby topology — sibling standbys rewire automatically
In a one-primary-many-standby topology (e.g. A→B, A→C), Coordinated Failover handles all sibling standbys without rebuild. Because the Fence Message is broadcast through A’s CDC to every standby, all of them — not just the chosen new primary — receive the same sync barrier and reach the same state at the cutover moment (RPO = 0 for B and C).
The mechanism:
- The new topology in the
UpdateReplicateConfigurationrequest lists the full edge set (e.g.B→A,B→C). - A’s CDC forwards the Fence Message to both B and C. Both consume it and run
SwitchReplicateMode— B becomes primary, C rebuilds itssecondaryStatewithSourceClusterID=Band a reset checkpoint (TT=0, MessageID=nil). - B’s StreamingCoord writes the new
ReplicatePChannelMetafor the B→C edge, settingInitializedCheckpointto the Fence Message’s position in B’s own WAL (the moment B was promoted). - B’s CDC, on creating the new channel replicator, calls C’s
GetReplicateInfo; C returnsMessageID=nilbecause secondaryState was just reset; the CDC falls back toInitializedCheckpoint. - B’s CDC scanner reads B’s WAL starting from after the Fence — skipping every message B accumulated while it was still a secondary (those entries were already replicated to C from A and don’t need re-sending).
Net result: C continues without rebuild, no data overlap, no data gap. The Fence Message simultaneously serves as write fence, sync barrier, source-switch signal, and new-source start anchor.
6.3 Force Failover
For the worst case: the original primary is fully unreachable (network partition, region outage). Explicitly AP-leaning—does not promise strong consistency, prioritizes availability.
When to use
- Primary A is completely unreachable. Specific signals:
- All RPCs to A time out for > N minutes (N from your RTO target, typically 1–5 min).
- A’s pods are in
Pending/Unknownstate across the cluster, or the K8s namespace itself is unreachable. - Network partition isolates A’s region — confirmed by independent connectivity probes (ping / TCP) from outside the customer’s primary path.
- You’ve already tried Coordinated and the pre-check probe (write to A’s WAL) failed.
- You explicitly accept data loss up to the current CDC lag. This is a business decision — confirm before pulling the trigger.
- You have a plan to remove A from all write endpoints (DNS / LB / Ingress) so that if A recovers later, no application traffic accidentally hits it.
- Do not use if A’s WAL still writes (use Coordinated instead — RPO=0) or if A is healthy (use Graceful).
Pre-check
- Confirm A is really unreachable. From the same network the application uses, run a basic Milvus health probe against A — should hard-fail (timeout, connection refused, DNS not resolving). A slow/degraded response means A is actually reachable; reconsider Coordinated.
- Record the current CDC lag — this is the upper bound on data loss. Capture the metric value before promoting so post-mortem can quantify what was lost.
- Confirm B is healthy and was in
secondarystate.get_replicate_configuration()on B should respond and show A→B topology (about to be torn down by force_promote). - Plan the application-side endpoint cutover. Decide concretely:
- Which DNS record / LB / Ingress points application writes at A?
- How will you flip it to B (and remove A entirely)?
- If using GlobalEndpoint (Zilliz Cloud), this is automatic; otherwise it’s your runbook step.
- For multi-standby topology (A→B, A→C): plan to rebuild C after promoting B. Sibling standbys cannot be salvaged automatically (see §7) — make sure operator time + storage capacity are available for the rebuild.
force_promote
message UpdateReplicateConfigurationRequest {
common.ReplicateConfiguration replicate_configuration = 1;
bool force_promote = 2;
}
Constraints
| Constraint | Validation | Rationale |
|---|---|---|
| Standby cluster only | WithSecondaryClusterResourceKey(); primary call returns ErrNotSecondary | Emergency promotion only applies to standbys |
| clusters must be empty | len(config.Clusters) == 0 | Config is rebuilt from existing meta |
| topology must be empty | len(config.CrossClusterTopology) == 0 | Same as above |
Internal mechanics
- Client calls
UpdateReplicateConfiguration(config={}, force_promote=true). - StreamingCoord validates the empty-config requirement and acquires the secondary-cluster lock via
WithSecondaryClusterResourceKey(). - A standalone-primary config is reconstructed from existing meta.
- A
force_promote=trueAlterReplicateConfigMessageis broadcast. - TxnBuffer: on force_promote, runs
RollbackAllUncommittedTxn()—any in-flight transactions are discarded. - Replicate Interceptor: switches the cluster to primary mode.
- Incomplete broadcasts: prior in-flight
AlterReplicateConfigMessagebroadcasts are marked with theignoreflag so they don’t overwrite the new config. - Config is persisted with
force_promoted=true.
PyMilvus
force_failover_config = {
"clusters": [
{
"cluster_id": cluster_b_id,
"connection_param": {"uri": cluster_b_addr, "token": cluster_b_token},
"pchannels": cluster_b_pchannels,
}
],
"cross_cluster_topology": [],
"force_promote": True,
}
client_b.update_replicate_configuration(**force_failover_config)
Data loss
- Loss bound = CDC lag at the moment of cutover. Zero lag = no loss.
- After promoting B, remove A from every write endpoint (DNS / LB / Ingress) to prevent split-brain.
- Do not reattach A to the old topology even if it recovers—A may contain unreplicated writes and B may already have new writes; reconciliation is not automatic.
Verify the promotion succeeded
- B is now a standalone primary:
get_replicate_configuration()on B returns a config with B as the only cluster, no topology edges, andforce_promoted=truein the metadata. - B accepts writes: a small
insert()through PyMilvus targeting B succeeds (apply the application’s normal write path). - Application traffic landed on B, not on A: confirm the DNS / LB / Ingress cutover actually took effect — application metrics should show writes hitting B’s address.
- A is removed from all write endpoints: any attempt by mis-routed traffic to write to A should fail at the network layer (connection refused / DNS not resolving), not silently succeed.
- SalvageCheckpoint is captured (for later Data Salvage if A recovers):
get_replicate_infoon B returns a non-emptysalvage_checkpointper pchannel.
Post-promotion (next 24 hours)
- Multi-standby: rebuild sibling C. C was previously A’s secondary; it cannot be rewired to B automatically (no Fence Message ever reached C). Use
milvus-backup restore secondaryagainst C with--source_cluster_id B— see §7 for the full procedure. - If A recovers, do not reattach. Run Data Salvage against A to dump any unreplicated messages, then decommission A. Reattaching it would cause split-brain because both A and B believe they are primary.
- SalvageCheckpoint window is 7 days. If you want the option to salvage from A, recover A within that window. After 7 days the salvage_checkpoint may be pruned by Milvus and you lose the opportunity to drain unreplicated data.
Failure handling
- force_promote RPC returns
ErrNotSecondary: B is already in primary role (maybe from a prior failover you forgot about), or you submitted to the wrong cluster. Checkget_replicate_configuration()on B; resolve before retrying. - force_promote RPC returns config-must-be-empty error: you accidentally included
clustersorcross_cluster_topology. Resubmit with empty clusters and topology — onlyforce_promote=true. - B accepts the RPC but writes still fail afterward: usually the application client is still pinned to A’s address. Restart application clients or verify DNS TTL has expired.
- A “recovers” and starts seeing writes too: split-brain in progress, this is dangerous. Immediately remove A from networking (firewall rule, scale to 0, etc.) before any further writes hit it. Plan to discard whatever A wrote post-recovery; only B’s data is authoritative.
- No way to roll back: force_promote is one-way. Once B is promoted, you cannot “unpromote” back to A. If you regret promoting B, your only option is a new Graceful Switchover from B back to a freshly-rebuilt A (with A’s data wiped).
6.4 Data Salvage
After force_promote, the original primary may hold data that was never replicated. Data Salvage provides a manual reconciliation path.
SalvageCheckpoint
At force_promote time, StreamingCoord captures the current ReplicateCheckpoint per pchannel as the salvage checkpoint:
- Key:
$rootPath/streamingnode-meta/salvage-checkpoint/$clusterID/$ChannelName - Value:
commonpb.ReplicateCheckpoint - Retention: 7 days
DumpMessages API
rpc DumpMessages(DumpMessagesRequest) returns (stream DumpMessagesResponse) {}
message DumpMessagesRequest {
string pchannel = 1; // required
common.MessageID start_message_id = 2; // required (from salvage_checkpoint)
uint64 start_timetick = 3; // optional filter
uint64 end_timetick = 4; // optional, 0 = unbounded
bool start_position_exclusive = 5;
}
message DumpMessagesResponse {
oneof response {
common.Status status = 1;
common.ImmutableMessage message = 2;
}
}
How to do it today
Milvus exposes a gRPC streaming RPC DumpMessages on the old primary. You connect to it, drain the stream of raw protobuf messages, and do whatever you want with the data on your side. The detailed steps:
- Wait for old primary A to recover (operator action).
- Get the salvage checkpoint from new primary B: call
GetReplicateInfoRPC on B, the response includes asalvage_checkpointper source cluster — this is your starting position. - Stream messages from A: call the
DumpMessagesRPC on A with thatsalvage_checkpoint.message_idasstart_message_id(plus optionalstart_timetick/end_timetickfilters). The server opens a WAL scanner exclusive ofstart_message_id, filters out system messages (TimeTick,CreateSegment,Flush,RollbackTxn), and streams every other message back ascommon.ImmutableMessageproto. Code:internal/proxy/impl.go:7150–7248. - You write the consumer client: read the stream, persist however your environment needs (S3, Parquet, plain file, etc.). Milvus does not write the data to object storage for you — you write a client to do that.
- Decide replay policy and replay: inspect the dump, apply business-defined conflict resolution (skip / overwrite / merge on primary-key collision), then write back to the new primary using normal SDK calls. Milvus does not do this for you either — conflict policy is business decision.
What you have to buildSteps 4 and 5 are on you. Milvus’s contract here is: expose the raw data via gRPC, no built-in object-storage persistence, no automatic replay. The reason for not automating replay is sound — primary-key collision resolution is business-specific and cannot be safely automated. The reason for not automating persistence is just “not built yet”. For POC planning, budget engineering time for a small consumer client (couple hundred lines of Go / Python that drains a gRPC stream into your preferred storage).
6.5 Mode Comparison
| Aspect | Graceful | Coordinated | Force |
|---|---|---|---|
| Primary state | Healthy | Partial failure | Unreachable |
| RPO | 0 | 0 | ≤ CDC lag (salvageable) |
| RTO | Drain remaining lag | Seconds (fence + fallback) | Seconds (no waiting) |
| Trigger | Operator-initiated | Failure detected + confirmed | Operator decision (emergency) |
| Fence Message needed | No | Yes | No (force_promote) |
| Data Salvage needed | No | No | Optional |
| Old primary handling | Auto-demoted | Auto-demote if recoverable; otherwise rebuild | Must be removed; can be decommissioned |
| Sibling standby (multi-standby) handling | Auto-rewire, no rebuild — Fence Message broadcasts to all standbys; each SwitchReplicateModes to new source | Auto-rewire, no rebuild — same mechanism as Graceful | Must rebuild — no Fence Message reached siblings; their state may diverge from the promoted standby. See §7 |
7. Standby Cluster Rebuild
Two scenarios require rebuilding a standby:
- After a Force Failover, when the topology was multi-standby: sibling standbys (other than the one that was promoted) cannot be safely rewired because they never saw a Fence Message — the old primary was unreachable, so no sync barrier propagated. Their state may have diverged from the promoted standby (either ahead or behind), and no general-purpose mechanism exists to reconcile them in place. Each sibling must be rebuilt from scratch against the new primary. This does not apply to Graceful or Coordinated Failover — those propagate a Fence Message to every sibling and rewire them automatically (see §6.2 and the comparison in §6.5).
- Adding a standby to a non-empty primary: the primary already has historical data; a fresh standby must catch up to the primary before CDC takes over the incremental tail.
7.1 Recovery Checkpoint
The rebuild needs a precise boundary between “data restored from backup” and “incremental replicated via CDC”. Since CDC operates per-PChannel, the boundary is a PChannel-level consistency checkpoint.
The Checkpoint is atomic: data before it is fully persisted to object storage; data after it is still in the log stream. Restoration splits into two phases:
- Bulk import: messages before the CP are imported via Backup.
- Incremental replication: messages after the CP are taken over by CDC’s Replicator.
7.2 FlushAll Refactoring
Milvus 2.6 redefines FlushAll on top of the Streaming Service. A new FlushAllMsg MessageType forces every vchannel to flush; the message itself becomes the Recovery Checkpoint when consumed.
service MilvusService {
rpc FlushAll(FlushAllRequest) returns (FlushAllResponse) {}
rpc GetFlushAllState(GetFlushAllStateRequest) returns (GetFlushAllStateResponse) {}
}
message FlushAllResponse {
common.Status status = 1;
map flush_all_tss = 4; // pchannel -> flush_all_ts
}
message GetFlushAllStateResponse {
common.Status status = 1;
bool flushed = 2;
}
7.3 Tool: zilliztech/milvus-backup
The rebuild is performed by a separate tool that lives in its own repository: zilliztech/milvus-backup. It is not part of the Milvus main repo and not a component bundled with the cluster — install / upgrade it independently.
- Install: download a binary from Releases, or
brew install zilliztech/tap/milvus-backupon macOS. - Modes: CLI (
./milvus-backup ...) or REST API server (./milvus-backup server, default port 8080, Swagger UI at/api/v1/docs/index.html). - Storage backends:
local / minio / s3 / aws / gcp / aliyun / azure / tencent / gcpnative; cross-storage backup supported (e.g. S3 region A → S3 region B). - Version rule: a backup can only be restored to the same or newer Milvus version. A 2.6 backup must restore to 2.6.
The right subcommand for CDC standby rebuild
milvus-backup has a dedicated subcommand for restoring into a cluster that should become a CDC secondary — restore secondary. Do not use plain restore for this scenario — that one creates standalone collections without setting up the CDC seam.
# On source cluster A — create the snapshot
./milvus-backup create -n bkp_xxx
# Move backup files to target side (or use crossStorage)
# On target cluster B — restore into secondary mode
./milvus-backup restore secondary \
--name bkp_xxx \
--source_cluster_id A \
--target_cluster_id B
# Finally, on source cluster A — wire up CDC
# (pymilvus.MilvusClient.update_replicate_configuration with topology A→B)
What restore secondary does under the hood
Code reference: cmd/restore/secondary.go + core/restore/secondary/task.go. The tool opens a CreateReplicateStream RPC to the target’s Milvus proxy, declaring itself as the source cluster — i.e. it really does impersonate CDC. Then it pushes messages in a specific order:
| Step | What gets sent | Transport |
|---|---|---|
| ① Open stream | NewStreamClient(sourceClusterID, taskID, pchannels, grpc) → opens a bidirectional CreateReplicateStream to target | gRPC stream |
| ② DB DDL | Database create messages reconstructed from backup meta | Stream |
| ③ Collection DDL | For each collection: CreateCollection / CreatePartition / CreateIndex messages | Stream |
| ④ Collection DML | Not via stream — uses BulkInsert REST API. The tool points BulkInsert at the backup’s segment files in object storage. Splits into L0 (delta) and non-L0 (insert) batches, sequenced per partition; runs concurrently across collections (controlled by backup.parallelism.RestoreCollection) | BulkInsert RESTful |
| ⑤ Collection Load | Load collection messages | Stream |
| ⑥ RBAC restore | Reconciles backup RBAC vs current; sends RestoreRBACMessage via broadcast | Stream broadcast |
| ⑦ FlushAllMsg replay (★ the seam) | For each pchannel, the FlushAllMsg captured at backup time (stored base64 in backup meta) is forwarded into target’s WAL — this materializes the ReplicateCheckpoint on the target at exactly the source’s FlushAll moment | Stream forward |
| ⑧ Wait confirm | streamCli.WaitConfirm() blocks until all sent messages are acked by target | — |
Design vs implementation noteThe PDF design doc describes DML as “ImportMsg pushed via ReplicateStream”. The implementation uses BulkInsert REST instead for the data phase — stream is reserved for DDL / RBAC / Load / FlushAll. The end result is the same (target’s WAL has all the data and a clean checkpoint seam), but the data plane path is BulkInsert, not stream. This also means the §3.3 matrix entry “MessageTypeImport is not replicated by CDC” is consistent — restore secondary doesn’t generate Import messages on the wire.
7.4 End-to-end flow
| Stage | Where it runs | What happens |
|---|---|---|
| ① Backup | Source cluster A (operator runs milvus-backup create) | - Tool calls A’s FlushAll → captures one FlushAllMsg per pchannel; stores base64-encoded in backup meta. - Tool reads segment-level binlogs directly from A’s object storage and copies them to the backup storage (the configured backupBucketName / backupRootPath). - RBAC, schema, collection / partition / index meta exported. - gcPause on by default (2h) — pauses Milvus GC so segments don’t get cleaned mid-backup. |
| ② Move backup files | Operator | If A and B share storage, nothing to do. Otherwise either use crossStorage: true to back up directly into target storage, or transfer files (rclone / aws s3 sync / etc.) so they’re reachable by B’s object store reads. |
| ③ Restore into secondary | Target cluster B (operator runs milvus-backup restore secondary) | - Tool opens CreateReplicateStream to B, declaring source cluster A — B sees this stream as a regular CDC client. - Reproduces DB / Collection / Index DDLs through the stream. - Data restored via BulkInsert REST, partition by partition, L0 / non-L0 sequenced; uses backup=true or l0_import=true options to identify origin. - Load + RBAC restored via stream. - Final step: replays the FlushAllMsgs into B’s WAL via stream — B’s secondaryState now holds a ReplicateCheckpoint pointing at A’s FlushAll position. - WAL retention guard rail: A’s WAL keeps messages a bounded window (default 3 days). If steps ①–③ together exceed the window, A may have GC’d messages beyond the FlushAll checkpoint that B needs next. For very large datasets, extend WAL retention before starting, or be prepared to redo. |
| ④ Wire CDC | Control plane (PyMilvus / Go SDK) | - Call update_replicate_configuration on A and B with the topology A→B. - A’s CDC controller picks up the new ReplicatePChannelMeta, starts a channel replicator per pchannel. - Each replicator queries B’s GetReplicateInfo — gets back the FlushAll-materialized checkpoint, set up by step ③. - CDC reads A’s WAL from after that checkpoint, forwards every subsequent message to B. No gap (FlushAll is the precise seam), no overlap (target-side deduplication filters anything ≤ checkpoint). |
| ⑤ Serving | Target cluster B | B catches up the tail (writes between FlushAll moment and present). Once CDC lag < configured threshold (e.g. 1s), B is marked Serviceable — DR is fully restored. |
8. Monitoring & Operations
8.1 Prometheus Metrics
All milvus_cdc_* metrics are exposed by the Source-side CDC process (the one co-located with the Source Milvus cluster). Defined in internal/cdc/replication/replicatestream/metrics.go.
| Metric | Type | Unit | Labels | Description |
|---|---|---|---|---|
milvus_cdc_replicated_messages_total | Counter | count | channel_name, target_channel_name, msg_type | Total messages successfully forwarded by CDC |
milvus_cdc_replicated_bytes_total | Counter | bytes | channel_name, target_channel_name, msg_type | Total bytes forwarded |
milvus_cdc_replicate_end_to_end_latency | Histogram | ms | channel_name, target_channel_name | End-to-end latency from Source WAL read to Target WAL ack |
milvus_cdc_last_replicated_time_tick | Gauge | ms | channel_name, target_channel_name | Latest replicated TimeTick |
milvus_cdc_stream_rpc_connections | Gauge | count | target_cluster, connection_status (connected / disconnected) | Stream RPC connection counts to a target cluster, grouped by status. Per-replicator transitions Disconnected↔Connected; sum across both label values is the total channel replicator count for that target. |
milvus_cdc_stream_rpc_reconnect_times | Counter | count | target_cluster | Number of stream RPC reconnects (failure / timeout) |
8.2 CDC Lag — PromQL
CDC lag is the time-tick gap between the primary’s WAL frontier and the latest replicated position. It is the single most important metric for steady-state replication health and the data-loss risk window for force failover.
clamp_min(
max by (channel_name) (
milvus_wal_last_confirmed_time_tick
)
-
min by (channel_name) (
milvus_cdc_last_replicated_time_tick
),
0
)
Units: seconds. Note milvus_wal_last_confirmed_time_tick is exposed by the Source Milvus StreamingNode (not CDC); Prometheus must scrape both the Source Milvus pods and the Source CDC pod to compute lag. For multi-standby topologies the min by (channel_name) term picks the slowest-progressing target. Add namespace / app_kubernetes_io_instance label filters when scraping multiple cluster pairs.
8.3 What Increases CDC Lag
- High primary write rate.
- Cross-cluster network latency / packet loss.
- Overloaded standby.
- Under-provisioned CDC pods.
- Large DDLs or imports in flight.
8.4 Alerting Guidance
All alert sources below are Source-side CDC process metrics (CDC lag combines source-WAL and source-CDC scrapes).
| Alert | Threshold | Response |
|---|---|---|
| CDC lag sustained | > 5 minutes | Check network, standby load, CDC pod resources |
| stream_rpc_connections — disconnected sustained | milvus_cdc_stream_rpc_connections{connection_status="disconnected"} > 0 sustained for 1 minute | Diagnose stream RPC reconnect failures (a per-replicator counter — non-zero means at least one channel can’t connect) |
| stream_rpc_reconnect_times rate | rate(milvus_cdc_stream_rpc_reconnect_times[5m]) * 300 > 10 | Network instability — RPO at risk |
| e2e_latency P99 | sustained > 3 s | System saturation or network jitter |
9. Limitations & Boundaries
| Limitation | Detail |
|---|---|
| No active-active | CDC supports single-primary topology only. Multi-cluster simultaneous writes lead to split-brain. Standbys reject direct writes. |
| Star topology only | Currently only one source fan-out to N targets; chained A→B→C→D is on roadmap, not yet supported. |
| PChannel count | All clusters must have identical counts at any moment. Topology-wide append-only growth is supported (all clusters grow N→N+k together via UpdateReplicateConfiguration). Mismatched counts (A:16 / B:24) are not supported and never have been. |
| No automatic historical sync | CDC captures only changes after it is enabled. For non-empty primaries, use Standby Cluster Rebuild (Backup Tool) for the initial bootstrap. |
| No pause API | UpdateReplicateConfiguration has no pause semantics. Workaround: scale CDC pods to 0 (it resumes from checkpoint when scaled back up), or rewrite the topology to remove the edge. |
| CDC is single-replica | components.cdc.replicas: 1 only. Multi-replica CDC is on roadmap. |
| Whole-cluster granularity | Replication is at the cluster level. Database / Collection-level filtering is on roadmap. |
| WAL retention | Default 3 days. Affects standby rebuild (Import must finish within the window unless retention is extended) and replication deletion (re-creating replication after retention expiry causes data loss). |
| Resource Group cross-cluster | Mismatched primary/standby RG configuration can cause load failures during RG message replication. Optimization on roadmap. |
| Import message not yet replicated | MessageTypeImport replication is in development; do not rely on imports being mirrored to the standby today. |
10. 2.5 vs 2.6 CDC Comparison
| Aspect | Milvus 2.5 CDC | Milvus 2.6 CDC |
|---|---|---|
| Code location | Standalone repo zilliztech/milvus-cdc | Built-in internal/cdc/ in the main repo |
| Underlying messaging | Tightly coupled with 2.5 msgstream | Built on Streaming Service WAL |
| Role | Owned business logic (TtManager, vchannel management) | Pure log forwarder; ignores collections / vchannels |
| Management surface | HTTP API to create / pause / delete tasks | Single RPC: UpdateReplicateConfiguration (full replace), gRPC only |
| Checkpoint | CDC persists locally per message | Target persists ReplicateCheckpoint in WAL |
| Consistency | At-least-once (duplicate-write risk) | Exactly-once (target filters by tt) |
| TtMsg handling | CDC injects TtMsgs via TtManager | WAL stamps tt natively; CDC just forwards |
| PChannel mapping | Multi-to-one allowed | 1:1 mapping (with v2.6.12+ relaxation for count increases) |
| Failover modes | None native | Three modes: Graceful, Coordinated, Force + Data Salvage |
| Deployment | Separate milvus-cdc service | Milvus Operator components.cdc.replicas |
11. POC Guidance
11.1 Validation Plan
- Environment: two K8s clusters (or dual-namespace), Milvus v2.6.16 + CDC on each side. Verify CDC pods Running.
- Basic functions: build A→B, exercise DDL (CreateCollection / CreateIndex / CreatePartition), DML (Insert / Delete / Upsert), DCL (CreateUser / Role / Privilege).
- Failover:
- Graceful: with steady write traffic, A→B → B→A → A→B; verify zero data loss across cycles.
- Coordinated: inject StreamingNode CrashLoop, verify the Fence Message + pauseConsumption fallback.
- Force: simulate region partition, run force_promote, record CDC lag and observed data loss.
- Data Salvage: after the original primary is recovered, run
DumpMessagesand verify the salvaged data range is what’s expected. Heads-up: today only the raw gRPC stream is implemented — there is no Backup-Tool-driven JSON/Parquet export or object-storage persistence yet. Plan to write a small consumer client that drains the stream and stores it however your environment expects (S3, Parquet, plain file, etc.). See §6.4. - Standby Rebuild: start from a primary with existing data, build a fresh standby, run all 5 stages.
- Performance:
- Sustained primary writes at production-like vector dimension + batch size; record CDC lag P50/P99.
- Large DDL bursts (e.g. dozens of partitions) and observe lag spikes.
- Observability: ingest all 6 CDC metrics into Grafana, configure thresholds.
- Chaos: kill -9 CDC pod, drop cross-cluster network briefly, briefly stop ETCD; verify recovery.
- Security: deploy per-cluster mTLS between source and target, verify cross-CA isolation by intentionally misconfiguring a cert (should fail).
11.2 Pre-Production Checklist
- ☐ Both clusters on Milvus v2.6.16+, Operator v1.3.4+.
- ☐ PChannel counts planned in advance and consistent across clusters.
- ☐ CDC component enabled on both primary and standby (standby will need it after switchover).
- ☐ Cross-cluster network reachability validated (CDC pods can reach target ClusterIP / DNS).
- ☐ Per-cluster mTLS enabled on cross-cluster links.
- ☐
ReplicateConfigurationtemplates code-reviewed (the delete-then-rebuild path can lose data — see FAQ). - ☐ Dashboards live, alerts configured against the 6 metrics + the CDC-lag PromQL.
- ☐ All three failover modes drilled.
- ☐ Standby Cluster Rebuild drilled end-to-end.
12. CDC Metadata Reference
This section catalogs every piece of persistent state CDC depends on — keys, schemas, owners, lifecycle. Useful for SREs operating CDC in production, debugging desync, or auditing what survives a restart.
12.1 Where each piece lives
Three storage layers participate:
| Layer | Backing store | Prefix | What it holds |
|---|---|---|---|
| StreamingCoord catalog | ETCD | streamingcoord-meta/ | Cluster-wide replication topology & per-pchannel CDC tasks |
| StreamingNode catalog | ETCD | streamingnode-meta/wal/{channel}/ | Per-PChannel WAL recovery state & replicate checkpoints |
| Streaming messages | WAL (Woodpecker / Pulsar / Kafka) | — | In-flight control-plane messages, e.g. AlterReplicateConfigMessage |
12.2 StreamingCoord meta (cluster-wide topology)
| Key | Proto | Purpose | Lifecycle |
|---|---|---|---|
streamingcoord-meta/replicate-configuration | ReplicateConfigurationMeta | The single source of truth for the current cross-cluster topology on this cluster: list of clusters, edges, and a force_promoted flag indicating whether the current state came from a Force Failover. | Written by the broadcaster ACK callback when an AlterReplicateConfigMessage commits. Replaces the previous value atomically. Read at SC startup to restore in-memory state. |
streamingcoord-meta/replicating-pchannel/{targetClusterID}-{sourceChannelName} | ReplicatePChannelMeta | One entry per source → target PChannel edge. Contains the source/target channel names, the target cluster’s connection info, and an initialized_checkpoint used when no prior progress exists (e.g. brand-new replication, or pchannel-increasing growth). | Written by the same ACK callback as above (SaveReplicateConfiguration writes both keys in one call). Deleted when an AlterReplicateConfigMessage removes the edge. This is the prefix CDC Controller watches to start/stop channel replicators. |
streamingcoord-meta/broadcast-task/... | Internal broadcast task state | Tracks in-flight broadcasts (e.g. an AlterReplicateConfigMessage being propagated to all pchannels). Force Failover scans this prefix to mark stale in-flight broadcasts with the ignore flag, preventing them from overwriting the new force-promote config. | Created when broadcast starts; removed when all pchannels ACK or the broadcast is force-ignored. |
Schema highlights
message ReplicateConfigurationMeta {
common.ReplicateConfiguration replicate_configuration = 1; // clusters + edges
AckedResult acked_result = 2; // broadcast progress
bool force_promoted = 3; // true if from force_promote=true
}
message ReplicatePChannelMeta {
string source_channel_name = 1; // e.g. clusterA-rootcoord-dml_3
string target_channel_name = 2; // e.g. clusterB-rootcoord-dml_3
common.MilvusCluster target_cluster = 3; // target uri / token / cluster_id
common.ReplicateCheckpoint initialized_checkpoint = 4; // start position for new edges
bool skip_get_replicate_checkpoint = 5; // true for pchannel-increasing tasks
}
12.3 StreamingNode meta (per-PChannel WAL recovery)
These keys are scoped under streamingnode-meta/wal/{pchannel}/ on each PChannel’s catalog. The StreamingNode hosting that PChannel owns them; CDC reads them via RPC, not by direct ETCD access.
| Key suffix | Proto | Purpose | Lifecycle |
|---|---|---|---|
consume-checkpoint | WALCheckpoint | Why it’s in this reference: this key is the only persistent record of CDC replication progress on a target cluster. If you want to audit “where exactly has the standby caught up to from the primary?” off-line — without calling RPCs, just inspecting etcd — this is where you look. Its replicate_checkpoint field is the durable mirror of the in-memory secondaryState.checkpoint that GetReplicateInfo returns. On node restart, the target rebuilds its in-memory CDC state from this key. The proto carries two checkpoints serving two audiences: - message_id / time_tick — local WAL position; used by StreamingNode to resume WAL replay. - replicate_checkpoint — source-side position; used by CDC (indirectly via GetReplicateInfo) to know where to resume forwarding. | Batched persistence: recovery_background_task.go wakes every persistInterval (default 10s) and writes the in-memory dirty snapshot. Overwrite semantics — only the latest is kept. Read at node restart to rebuild in-memory state. |
salvage-checkpoint/{sourceClusterID} | commonpb.ReplicateCheckpoint | Captured at the moment of Force Failover. Records the last source-side position this cluster had received from the (now-lost) source, keyed by source cluster ID. Enables Data Salvage: when the old primary comes back, this checkpoint tells Backup Tool from where to dump unreplicated messages. | Written during the force_promote broadcast handling. Retention 7 days. After that the salvage opportunity is lost. |
WALCheckpoint schema highlights
message WALCheckpoint {
common.MessageID message_id = 1; // local WAL position
uint64 time_tick = 2; // local TT
int64 recovery_magic = 3; // schema version hint
common.ReplicateConfiguration replicate_config = 4; // joined topology, if any
common.ReplicateCheckpoint replicate_checkpoint = 5; // SOURCE-SIDE MessageID + TT
AlterWALState alter_wal_state = 6; // in-progress WAL alter ops
}
// Defined in milvus-proto:
message ReplicateCheckpoint {
string cluster_id = 1; // source cluster id this checkpoint belongs to
string pchannel = 2;
common.MessageID message_id = 3; // source-side
uint64 time_tick = 4; // source-side
}
The two checkpoint perspectives in one WALCheckpoint entry — local vs replicate — are the bridge between “where am I in my own WAL” and “where am I in the source’s WAL”.
12.4 Streaming messages (transient, in-WAL)
Not persistent in a catalog — these flow through the WAL itself, but they’re CDC’s control-plane “events” and worth listing because operators see them in WAL traces / logs.
| MessageType | Carrier | Purpose |
|---|---|---|
MessageTypeAlterReplicateConfig (V2) | Source cluster’s WAL (broadcast across all pchannels) | The actual cross-cluster topology change event. Carries the new ReplicateConfiguration, plus force_promote and ignore flags. Doubles as the Fence Message for Coordinated Failover and Graceful Switchover (RPO=0 barrier). |
MessageTypeFlushAll | Source cluster’s WAL | Forces flushing of all vchannels. Used by Standby Rebuild Stage ④ to materialize a ReplicateCheckpoint on the new standby exactly at the snapshot boundary. |
12.5 Who reads / writes each piece
Two groupings here: (a) keys the CDC pod directly touches via etcd; (b) keys CDC never touches but whose contents are part of the CDC data path (CDC interacts with them only through RPC to other components).
(a) etcd keys CDC pod directly operates on
The CDC pod is a reader + conditional deleter of one prefix and never a writer/creator of any etcd key.
| Meta key | CDC pod’s etcd op | Code location | Purpose |
|---|---|---|---|
replicating-pchannel/* | Watch (long-poll, prefix) | internal/cdc/controller/controller.go:90 | Reacts to PUT events to start channelReplicator for new edges; reacts to DELETE events to tear them down. |
replicating-pchannel/* | List/Get (prefix, on startup) | controller.go:107 → meta/replicate_meta.go:96 (etcdCli.Get(prefix, ...)) | Full snapshot on startup or reconnect — reconciles in-memory replicator set with current truth. |
replicating-pchannel/{specific key} | Conditional Delete (txn with ModRevision compare) | replicate_stream_client_impl.go:354 → meta/replicate_meta.go:105 (etcdCli.Txn().If(cmps).Then(delOp)) | When the AlterReplicateConfigMessage indicates this edge is removed (IsReplicationRemovedByAlterReplicateConfigMessage returns true), CDC cleans up its own ReplicatePChannelMeta and exits. The revision compare ensures it only deletes what it was managing. |
That’s all of CDC’s direct etcd footprint. No Put, no Create, no other prefixes.
(b) etcd keys CDC never touches (but their content matters for CDC behavior)
These keys are written and read by other components (StreamingCoord, StreamingNode). CDC interacts with them only via RPC to those owners, never by reading etcd directly.
| Meta key | Writer | Reader(s) | How CDC interacts |
|---|---|---|---|
replicate-configuration | StreamingCoord (ACK callback of AlterReplicateConfigMessage) | StreamingCoord (restart recovery, validation); SDK GetReplicateConfiguration (token-redacted) | Not at all — CDC doesn’t need full topology, only its per-pchannel slice from replicating-pchannel/* |
broadcast-task/* | StreamingCoord (broadcaster) | StreamingCoord (force-promote incomplete-broadcast cleanup) | Not at all — purely SC-internal |
consume-checkpoint | StreamingNode (background batch task, every persistInterval) | StreamingNode itself (WAL recovery; rebuilds in-memory secondaryState.checkpoint on restart) | Indirect — CDC calls GetReplicateInfo RPC; Proxy returns the in-memory secondaryState.checkpoint (whose persistent mirror is this etcd key). CDC never reads etcd here. |
salvage-checkpoint/* | StreamingNode (during force_promote broadcast handling) | StreamingNode (loaded into in-memory SalvageCheckpoints) | Indirect — Backup Tool (not CDC) calls GetReplicateInfo + DumpMessages RPC during Data Salvage |
12.6 Operational implications
- If
replicating-pchannel/*diverges fromreplicate-configuration: serious — the watcher and the validator are looking at different truth. Usually points to an interrupted broadcast or partial restore. Re-applyingUpdateReplicateConfigurationtypically heals. - If
consume-checkpoint.replicate_checkpointis stale (far behind WAL latest): normal — it’s batched (10s default). If consistently > 30s behind under low load, the persist task may be stuck; checkrecovery_background_tasklogs. - If
salvage-checkpoint/*has aged past 7 days after a force failover: Data Salvage window has closed. Any unreplicated data on the old primary is no longer recoverable through the standard flow. - Backup/restore of these metas: standard ETCD backup covers all of them. Restoring an ETCD snapshot mid-stream while CDC is running will desync — drain CDC and pause writes first.
- Multi-cluster operator note: each cluster has its own copies of all keys above. They are not shared across clusters. Cross-cluster consistency is achieved through replicating
AlterReplicateConfigMessage, not through shared meta storage.
13. FAQ
GetReplicateInfo — is it really available?The RPC exists and works; only the high-level PyMilvus wrapper is missing.
Proto / server side:
MilvusService.GetReplicateInfois defined and implemented (internal/proxy/impl.go:7062). Returnscheckpoint+salvage_checkpoint(bothReplicateCheckpoint).Go SDK: implemented at
client/milvusclient/replicate.go:38asClient.GetReplicateInfo(ctx, req, opts...), with example inreplicate_example_test.go.PyMilvus: only the auto-generated low-level gRPC stub is available (
pymilvus/grpc_gen/milvus_pb2_grpc.py:599); there is no high-levelMilvusClient.get_replicate_info()method as of today. Workaround until a wrapper lands:from pymilvus.grpc_gen import milvus_pb2, milvus_pb2_grpc import grpc channel = grpc.insecure_channel("localhost:19530") stub = milvus_pb2_grpc.MilvusServiceStub(channel) resp = stub.GetReplicateInfo(milvus_pb2.GetReplicateInfoRequest( source_cluster_id="source-cluster", target_pchannel="target-cluster-rootcoord-dml_0", )) print(resp.checkpoint, resp.salvage_checkpoint)This bypasses
MilvusClient’s connection management, token injection, and retry — fine for a one-off probe but not recommended for production until the high-level method is added.Roadmap: a high-level
MilvusClient.get_replicate_info()wrapper will land in a subsequent PyMilvus release. Once it ships, applications can call it the same way they callupdate_replicate_configurationtoday.
Not noticeably. CDC consumes the WAL asynchronously and is off the primary data path. It does share node resources with other Milvus components, so resource isolation matters under saturation.
Yes—search, query, and metadata reads all work on the standby. Direct writes are rejected. Do not invoke DDL like load_collection manually on the standby; those operations should run on the primary and replicate over.
No. CDC captures only changes after activation. For non-empty primaries, run the Standby Cluster Rebuild flow first using the standalone zilliztech/milvus-backup tool: ./milvus-backup create -n bkp on the source, then ./milvus-backup restore secondary --source_cluster_id A --target_cluster_id B on the target. The tool replays the FlushAll captured at backup time to materialize the standby's ReplicateCheckpoint, after which CDC seamlessly takes over the incremental tail. See §7.
Not today. CDC's automatic forwarding of MessageTypeImport from a source WAL is on the roadmap, not yet shipped — see the §3.3 message replication matrix. If your application relies heavily on Bulk Insert, this is a real gap to plan around: either pause Bulk Insert on the primary during steady-state CDC operation, or schedule periodic Standby Cluster Rebuilds matched to your Bulk Insert cadence so the standby catches up.
Note this is unrelated to the milvus-backup restore secondary path: that tool itself pushes ImportMsg into the target via ReplicateStream as part of the rebuild flow — that's a different code path and works fine. The gap is only about CDC's automatic forwarding of Import messages during normal replication.
No—RPO = 0. The standby waits until the AlterReplicateConfigMessage has been replicated before promoting.
No. Milvus only provides the streaming.walScanner.pauseConsumption config switch — when you set it true, the SN responds by entering "WAL-write only" mode and breaks the crashloop. The whole "try the switch, fail, flip the config, retry again" loop is your control plane's responsibility, not Milvus internal logic. For POC you need to budget engineering time for a small script (or operator) that implements this escalation. See §6.2.
The upper bound is the CDC lag at the moment of cutover. Zero lag = zero loss. Even when loss happens, Data Salvage on the recovered original primary lets you drain the missing window via the DumpMessages gRPC stream — note that today only the raw stream RPC is implemented (no JSON/Parquet serializer, no automatic object-storage writer, no replay tool), so you have to write a small consumer client that drains the stream and decides replay policy yourself. See §6.4.
They must be rebuilt against the new primary. Sibling standbys never received the Fence Message (the old primary was unreachable), so their state may have diverged from the promoted standby in either direction — they could be ahead or behind. No general-purpose mechanism exists to reconcile in place. Use milvus-backup restore secondary --source_cluster_id <new-primary> --target_cluster_id <sibling> for each sibling.
This is Force-specific. Graceful and Coordinated Failovers propagate the Fence Message to all standbys (broadcast through CDC), so sibling rewire is automatic there — no rebuild needed. See §6.5 and §7.
Vector DB writes (Insert / Upsert / Delete) don't admit cheap cross-cluster conflict resolution—primary-key collisions and vector versions require business decisions. The 2.6 CDC scope is engineering-viable DR; full multi-master is out of scope.
Submit an UpdateReplicateConfiguration whose new topology does not include the source→target edge. CDC detects the edge removal (IsReplicationRemovedByAlterReplicateConfigMessage in internal/cdc/util/util.go), calls RemoveReplicatePChannelWithRevision to delete the ReplicatePChannelMeta from ETCD, and tears down the channel_replicator.
Code-wise there is no lock preventing re-creation. The real risk is that the standby's ReplicateCheckpoint is retained: if you re-establish replication after a window longer than WAL retention (default 3 days), the messages between the delete and re-create are gone from the source WAL, and CDC will resume from the stale checkpoint with a data hole. To safely re-create, run Standby Cluster Rebuild to reset the standby state instead of relying on the residual checkpoint.
There is no pause API. Workaround: scale the CDC pod to 0 — when scaled back up, it resumes from the persisted ReplicateCheckpoint with no data loss (CDC's resume position is on the target side, not in the CDC pod itself).
Do NOT use "remove the edge from topology" as a pause. That's a delete operation (see the next FAQ). Once deleted, the standby's residual ReplicateCheckpoint stays put but the WAL retention window keeps ticking; re-establishing replication after the WAL retention expires will silently lose data between delete and re-create.
Not supported. 2.5 CDC and 2.6 CDC are different implementations on different message subsystems. Use Backup + Restore or in-place upgrade to migrate; CDC works between 2.6+ clusters only.
A cluster's initial pchannel count is set at creation time (tied to shardNum). UpdateReplicateConfiguration supports topology-wide append-only growth — you can submit a new config where every cluster has more pchannels (N→N+k) in lock-step, and CDC will provision replicators for the newly added pchannels using InitializedCheckpoint rather than GetReplicateInfo. The validator still rejects any configuration where clusters have different pchannel counts at the same time. Plan baseline capacity at production scale; treat topology growth as a coordinated operational change, not a quick fix for an undersized cluster.
The primary cluster's writes are unaffected (CDC is off the write path). On restart, CDC resumes from the latest ReplicateCheckpoint and the standby catches up; lag metrics rise temporarily. Kubernetes restarts the pod automatically.
CDC only consumes the WAL—it has no direct coupling to object storage. Source and target may use independent buckets in different regions; the only inter-region requirement is that the ReplicateStream RPC be reachable.
Complementary:
- CDC: incremental, sub-second to seconds of lag, hot standby. Built into Milvus 2.6 core (
internal/cdc/) — no separate install. - Backup: full snapshot, minutes to hours, used for initial bootstrap and migration. Standalone tool
zilliztech/milvus-backup— not part of the Milvus main repo. Install separately (binary release orbrew install zilliztech/tap/milvus-backup). See §7.
No. Conflict resolution (skip / overwrite / merge on primary-key collision) is business-defined. The system surfaces the salvaged data; the user owns the replay policy.