1. Overview & Use Cases

1.1 What is Milvus CDC

Milvus CDC (Change Data Capture) is the built-in cross-cluster replication mechanism in Milvus 2.6. It subscribes to the source cluster’s WAL (Write-Ahead Log) and forwards changes (DDL / DML / DCL) in order to one or more target clusters, building a near-real-time primary-standby topology.

Important: 2.5 vs 2.6 distinctionMilvus 2.6 CDC is not the same component as the standalone zilliztech/milvus-cdc service used with 2.5. The 2.6 implementation is a complete rewrite:

  • From a standalone external service to a built-in module inside the Milvus main repo (internal/cdc/).
  • From tightly-coupled msgstream consumption to a log forwarder on top of the new Streaming Service WAL.
  • From HTTP task-management APIs to a single unifying RPC: UpdateReplicateConfiguration.

See Section 10 for full comparison.

1.2 Core Properties

PropertyDetail
Real-timeWAL-driven subscription; source-to-target visibility is sub-second to seconds in steady state.
ConsistencyRelies on strictly monotonic WAL TimeTick plus Target-side ReplicateCheckpoint filtering. Provides exactly-once delivery semantics.
ResilienceSupports checkpoint-based resume. After CDC restart or RPC failure, replication resumes from the latest checkpoint with no loss or duplication.
ExtensibilitySingle-primary multi-standby star topology supported. New MessageType additions in Milvus require no CDC changes.

1.3 Use Cases

Use CaseDescriptionStatus
Primary-standby DROne primary writer + one or more hot read replicas. Switch traffic on failure.Primary focus
Multi-region DRStar topology: one primary fan-out to geographically distributed standbys.Supported
Data fan-outProduction cluster → multiple analytical / test clusters.Supported
Workload isolationOnline write cluster → read-only analytical cluster.Supported
Historical data migrationRequires the Backup Tool for the full bootstrap; CDC handles the incremental tail.Requires Backup
Bidirectional active-activeMultiple clusters accepting writes simultaneously.Not supported

1.4 Version Requirement

  • Milvus v2.6.16 or later required for production. Key features (force_promote / Data Salvage / GetReplicateConfiguration / Coordinated Failover) were merged across several patch versions through this baseline.
  • Milvus Operator v1.3.4 or later for Kubernetes deployment.
  • Message store: Woodpecker, Pulsar, or Kafka are all supported by Milvus 2.6. CDC itself does not require any particular choice; pick what your infrastructure standard dictates.

2. Architecture & Design Philosophy

2.1 Overall Architecture

RoleResponsibility
Source Cluster (Primary)Source of truth. Handles user DDL / DCL / DML. Writes land in the source WAL via Streaming Nodes. Read-write.
CDC ServiceReplication engine. A Controller watches the source cluster’s ETCD for ReplicateConfiguration changes; per-pchannel Channel Replicators subscribe to the source WAL and forward messages to the target via a bidirectional ReplicateStream RPC. Deployment principle: CDC is co-located with the Source Milvus cluster (same Kubernetes cluster), minimizing the network cost of ETCD watching and WAL consumption.
Target Cluster (Standby)Replication target. Receives forwarded messages from CDC and writes them into its own WAL. Reads are served; direct writes are rejected while in standby state to prevent split-brain.
SDK / Control PlaneSends UpdateReplicateConfiguration RPCs to source and target clusters to manage the replication topology.
flowchart LR
  subgraph Source["Source Cluster  Primary"]
    direction TB
    SMS["MilvusProxy / StreamingNode"]
    SETCD[("ETCD meta")]
    SWAL[("WAL")]
    SCDC["CDC ServiceWatcher + Replicator"]
    SMS -->|save config| SETCD
    SMS -->|produce| SWAL
    SETCD -.->|watch| SCDC
    SWAL -->|sub| SCDC
  end
  subgraph Target["Target Cluster  Standby"]
    direction TB
    TMS["MilvusProxy / StreamingNode"]
    TETCD[("ETCD")]
    TWAL[("WAL")]
    TCDC["CDC Servicestandby"]
    TMS -->|save config| TETCD
    TMS -->|produce| TWAL
  end
  SDK(["SDK / Console"]) -->|UpdateReplicateConfiguration| SMS
  SDK -->|UpdateReplicateConfiguration| TMS
  SCDC ==>|ReplicateStream RPC pub| TMS
  TMS -->|append| TWAL
  style Source fill:#e6f4ea,stroke:#1a7f37
  style Target fill:#dbeafe,stroke:#1f6feb
  style SCDC fill:#fff3bf,stroke:#d4a72c

Why install CDC on both sides?

  • Single primary + single standby (A↔B): after a switchover or coordinated failover, the standby becomes the new primary and starts replicating reverse direction; its CDC pod, previously idle, now has to do work. If you didn’t pre-install CDC on the standby, replication will not resume after switchover.
  • Single primary + multiple standbys (A→B, A→C): behavior splits by failover mode.
    • Graceful / Coordinated (Fence Message reaches all standbys): every standby’s CDC is rewired automatically — C’s secondaryState switches source from A to B and resumes from the Fence position in B’s WAL. No rebuild needed. See §6.2 for the mechanism.
    • Force Failover (A unreachable, no Fence): C never received a sync barrier, its state may diverge from B in either direction, and there’s no general reconciliation path. C must be rebuilt against the new primary. This matches AWS Aurora Global Database’s stance — once the original primary is gone unceremoniously, sibling secondaries are inconsistent and must be re-seeded.

So: install CDC on every cluster that might become primary, and plan rebuilds for siblings.

Minimal Milvus CR (Kubernetes Custom Resource managed by the Milvus Operator) with CDC enabled:

apiVersion: milvus.io/v1beta1
kind: Milvus
metadata:
  name: source-cluster
spec:
  mode: cluster
  components:
    image: milvusdb/milvus:v2.6.16
    cdc:
      replicas: 1   # CDC currently supports a single replica
  dependencies:
    msgStreamType: woodpecker   # or pulsar / kafka

2.2 Design Philosophy: CDC as a Log Forwarder

Milvus 2.5 CDC was deeply coupled with the old msgstream layer—it owned TtMsg management and Collection / VChannel mapping logic. In 2.6 CDC is reduced to a pure log forwarder:

  • Decoupled from Milvus internals. CDC has no knowledge of Collections, VChannels, or timestamp semantics.
  • Simple. CDC’s only job is to subscribe to the source WAL in order and replay to the target in order.
  • Extensible. When Milvus adds a new MessageType, CDC needs no change.

Enabling TtMsg replication: the key shift

In 2.5, to keep multi-stream ingest consistent, the legacy CDC disabled Target RootCoord’s TtMsg production and ran an in-CDC TtManager to inject TtMsgs. This made CDC inseparable from Milvus internals.

In 2.6 the new WAL guarantees strictly increasing TimeTick on write, so CDC no longer needs to fabricate TtMsgs—it simply forwards source messages and the WAL stamps timestamps automatically. Consequences:

  • No CDC dependency on RootCoord behavior.
  • Order consistency falls out of the WAL’s own write ordering.
  • The CDC code path is light enough to debug and iterate quickly.

2.3 PChannel Mapping

When forwarding to the target, CDC routes each source PChannel to a target PChannel under a strict 1:1 mapping. The validator (pkg/util/replicateutil/config_validator.go) enforces this at every UpdateReplicateConfiguration:

  • Equal counts: at any moment in time, every cluster in a ReplicateConfiguration must have the same number of pchannels (line 133–139). Heterogeneous configurations such as A=16 / B=24 are rejected outright.
  • Append-only growth: across successive UpdateReplicateConfiguration calls, each cluster’s pchannels list may grow but never shrink, and existing entries must remain at the same positions (line 271–283). When the new total exceeds the previous total, the validator marks the change as isPChannelIncreasing.
  • Topology-wide expansion: a growth operation applies to all clusters together — both source and target grow from N to N+k simultaneously in the same submission. You cannot grow A alone.

So “v2.6.12+ relaxes pchannel constraints” really means you can grow the whole topology together over time, not that you can run with mismatched counts. Mismatched counts have never been and are not supported.

flowchart LR
  S0[("s-pchannel-0")] --> T0[("t-pchannel-0")]
  S1[("s-pchannel-1")] --> T1[("t-pchannel-1")]
  Sn[("s-pchannel-n")] --> Tn[("t-pchannel-n")]
  style S0 fill:#e6f4ea
  style S1 fill:#e6f4ea
  style Sn fill:#e6f4ea
  style T0 fill:#dbeafe
  style T1 fill:#dbeafe
  style Tn fill:#dbeafe

Default PChannel name convention: {cluster_id}-rootcoord-dml_{i} for i in 0..pchannel_num-1. The mapping is persisted in the Target’s Meta (ReplicatePChannelMeta) so failover, restart, and scale-out reuse it without manual reconstruction.

What controls the count: rootCoord.dmlChannelNum

The number of PChannels in a cluster is a cluster-level config (one number per Milvus cluster, not per collection):

# configs/milvus.yaml
rootCoord:
  dmlChannelNum: 16   # The number of DML-Channels to create at root coord startup.

Default 16, defined in pkg/util/paramtable/component_param.go:1882, available since v2.0. The PChannel names you put into ReplicateConfiguration must list exactly these channels for the cluster ({cluster_id}-rootcoord-dml_0 through {cluster_id}-rootcoord-dml_{dmlChannelNum-1}).

Don’t confuse three related concepts:

ConceptWhat it isControlled by
PChannel (physical)DML message channels created at RootCoord startup. Cluster-wide.rootCoord.dmlChannelNum (cluster config)
VChannel (virtual)Per-collection shard; each shard maps to exactly one VChannel.shardNum at create_collection time
MappingMultiple collections’ VChannels are hashed onto the fixed set of PChannels: hash(collection_id) mod dmlChannelNumImplicit

This is why CDC validates pchannel lists at the cluster level: one list per cluster, not per collection.

POC recommendationFor the initial POC, use the same pchannel count on both sides. This:

  • Matches the design documents’ canonical configuration.
  • Avoids triggering edge cases in the relaxed heterogeneous mode.
  • Makes failure diagnosis simpler (1:1 channel correspondence in metrics and logs).

If pchannel-count flexibility ends up being required, request feature confirmation from the Milvus team before going to production.

2.4 ReplicateCheckpoint & Exactly-Once

In 2.5, CDC persisted a checkpoint on the CDC side after every successful message replication. Two issues followed:

  • I/O overhead: every message paid a disk write.
  • Duplicate-write risk: if a message landed in the Target but the RPC dropped before ack returned, CDC would re-send it, and the Target would write it again.

In 2.6 the ReplicateCheckpoint mechanism flips the responsibility — and the IO model changes entirely:

  • Target owns the checkpoint: when a replicated message lands in the Target’s WAL, the Target tracks the corresponding ReplicateCheckpoint (source MessageID + TimeTick) in its in-memory recovery state.
  • Persistence is batched, not per-message: a background task in recovery_background_task.go wakes on a persistInterval ticker (default 10s, configurable) and flushes the dirty recovery snapshot in a single batch write to the StreamingNode catalog — vchannels + segment assignments + checkpoint go through one batch transaction. Hot-path writes per replicated message are just the WAL append on the target side, the same write the Target would do anyway for any local write.
  • WAL is the source of truth, catalog is a snapshot: even if the node crashes between two persist ticks, no data is lost — the WAL already has every message. On restart the Target replays its WAL to rebuild in-memory state and reads the most recently persisted checkpoint from the catalog as the resume anchor.
  • Deduplication on resend: if CDC re-sends a message that already landed, the Target’s WAL interceptor sees incoming.tt ≤ ReplicateCheckpoint.tt and drops it. No double-write.
  • CDC keeps no local checkpoint: at startup it calls GetReplicateInfo on the Target to learn the resume position. The CDC side has zero checkpoint persistence — the whole 2.5-style overhead is gone.

Net effect on IO2.5 paid one CDC-side checkpoint write per replicated message (thousands per second under load). 2.6 pays one catalog batch write per persistInterval (~once per 10s by default), shared across multiple recovery state changes. The hot path becomes “append to target WAL, that’s it” — same as a normal local write, no extra checkpoint syscalls per message.

type WALCheckpoint struct {
    MessageId          *messagespb.MessageID
    TimeTick           uint64
    // For Replicate messages, the source-side MessageID and TimeTick
    ReplicateMessageId *messagespb.MessageID
    ReplicateTimeTick  uint64
}

Two-layer filtering — where deduplication actually happens

The ReplicateCheckpoint participates in two filters at different layers; together they give exactly-once. They are not redundant, they cover different failure modes.

LayerWhereWhat it doesWhat it covers
① Source CDC — read-offset optimizationinternal/cdc/replication/replicatemanager/channel_replicator.go:127–135At startup CDC calls GetReplicateInfo on Target to get the checkpoint, then subscribes the Source WAL with DeliverFilterTimeTickGT(cp.TimeTick). The WAL scanner skips messages whose tt ≤ cp.tt — CDC never even reads them, so it never forwards them.CDC restart, scale-up, leader change — anything where CDC re-establishes a Scanner from scratch.
② Target ReplicateInterceptor — idempotent fenceinternal/streamingnode/server/wal/interceptors/replicate/replicates/impl.go:170–177When a message reaches the Target’s WAL write path, the interceptor compares incoming tt against the current ReplicateCheckpoint. If incoming.tt ≤ cp.tt (or < for txn-body messages) it returns IgnoreOperation and the message is dropped before append.RPC drops between Target WAL append and ack arriving at CDC; CDC’s in-memory send buffer re-pushes a message already written. Source-side filtering can’t catch this — the message is already past the Source WAL read.

Concretely:

  • No-failure steady state: CDC’s Source-side filter is enough; the Target interceptor sees only fresh messages.
  • RPC mid-flight drop: CDC’s resend hits the Target interceptor; the interceptor’s idempotent fence drops the duplicate.
  • CDC process crash + restart: New CDC re-reads checkpoint, sets DeliverFilterTimeTickGT, resumes — neither layer sees a duplicate because nothing was in flight.

The Target-side check is the source of truth-level exactly-once; the Source-side check is the source of efficiency-level exactly-once.

The two typical scenarios this mechanism handles:

ScenarioFlow
CDC service restart1. Source WAL contains S100, S200, S300. 2. CDC has forwarded S100 and S200 to Target. Target’s Checkpoint = S200. 3. The CDC service itself crashes and restarts. 4. CDC calls GetReplicateInfo on Target and learns the checkpoint is S200. 5. CDC resubscribes from S200 (Exclude) and resumes at S300.
RPC connection drop1. CDC is in the middle of forwarding S200 when the RPC drops, but S200 actually reached the Target. 2. CDC reconnects and re-sends S200. 3. Target sees S200.tt ≤ Checkpoint.tt and drops it—no duplicate write. 4. CDC forwards S300 normally.

3. Data Flow & Message Replication

3.1 Startup / Restart Recovery

Role of each stepExcept step 2 (initiated by the control plane / SDK), all subsequent steps below are executed by the Source-side CDC service. Since CDC is co-located with the Source Milvus cluster, the ETCD watch in step 3 and the WAL subscription in step 5 are in-cluster operations; only steps 4 and 6 are cross-cluster RPCs to the Target.

Initial startup

  1. Start Source Milvus cluster A, Target Milvus cluster B, and CDC service. (operator / K8s)
  2. Control plane sends UpdateReplicateConfiguration to both A and B, carrying ClusterID / URI / Token / PChannels for both and a topology edge A→B. (SDK / control plane)
  3. CDC watches A’s ETCD and detects the new ReplicateConfiguration. (Source-side CDC Controller)
  4. CDC calls B’s GetReplicateInfo for each PChannel to get the ReplicateCheckpoint. (For two empty clusters, the checkpoint is the WAL’s earliest.) (Source-side CDC Channel Replicator)
  5. CDC creates a WAL Scanner at A starting from the checkpoint and begins consuming. (Source-side CDC, in-cluster WAL Read)
  6. CDC calls B’s CreateReplicateStream to establish the bidirectional gRPC stream and starts forwarding consumed messages. (Source-side CDC Stream Client)

Recovery after CDC restart

  1. A, B, and CDC are in a healthy replicating state.
  2. The CDC service crashes and is restarted.
  3. CDC reads ReplicateConfiguration from A’s ETCD.
  4. CDC calls B’s GetReplicateInfo for the pre-crash ReplicateCheckpoint.
  5. CDC resubscribes A’s WAL from that checkpoint and rebuilds the Scanner.
  6. CDC re-establishes the ReplicateStream RPC with B and resumes forwarding.

3.2 Message Replication Details

DDL example: CreateCollection

  1. Source Milvus writes CreateCollectionMsg to multiple VChannels.
  2. CDC forwards the first CreateCollectionMsg to the Target.
  3. Target resolves VChannelName to PChannelName and locates the WAL.
  4. Target writes CreateCollectionMsg to its WAL.
  5. The ACK path triggers RootCoord’s CreateCollection.
  6. RootCoord creates the collection meta in Creating state with just one vchannel registered.
  7. CDC forwards CreateCollectionMsg from the remaining VChannels; RootCoord fills in the other vchannels.
  8. When the meta’s vchannel count equals shardNum, the collection transitions to Created.

DropCollection, CreatePartition, DropPartition, CreateDatabase, DropDatabase, CreateRole, and similar DDLs follow the same pattern: RootCoord processes the first message and subsequent messages are idempotent.

DML: Insert / Delete

  1. CDC consumes InsertMsg / DeleteMsg.
  2. CDC calls the Target’s Replicate RPC.
  3. The Target’s WAL interceptor:
    • Rewrites CollectionID / PartitionID / VChannel (source-side IDs are not valid on the target).
    • Allocates a new SegmentID.
  4. Once rewriting is done, the message is appended to the WAL.

3.3 Message Type Replication Matrix

Below is the current matrix of MessageTypes and whether CDC replicates them.

MessageTypeCategoryReplicatedNote
MessageTypeTimeTickSystemNWAL self-stamps tt; nothing to replicate
MessageTypeInsertDMLY
MessageTypeDeleteDMLY
MessageTypeImportDMLNNot yet supported; on roadmap (in progress)
MessageTypeManualFlushDMLY
MessageTypeCreateCollectionDDLY
MessageTypeDropCollectionDDLY
MessageTypeAlterCollectionDDLY
MessageTypeCreatePartitionDDLY
MessageTypeDropPartitionDDLY
MessageTypeCreateDatabaseDDLY
MessageTypeAlterDatabaseDDLY
MessageTypeDropDatabaseDDLY
MessageTypeAlterAlias / DropAliasDDLY
MessageTypeSchemaChangeDDLY
MessageTypeAlterLoadConfig / DropLoadConfigDDLY
MessageTypeAlterReplicateConfigDDLYTopology change carrier (incl. force_promote)
MessageTypeCreateIndex / AlterIndex / DropIndexIndexY
MessageTypeAlterUser / DropUserRBACY
MessageTypeAlterRole / DropRoleRBACY
MessageTypeAlterUserRole / DropUserRoleRBACY
MessageTypeAlterPrivilege / DropPrivilegeRBACY
MessageTypeAlterPrivilegeGroup / DropPrivilegeGroupRBACY
MessageTypeRestoreRBACRBACY
MessageTypeAlterResourceGroup / DropResourceGroupRGYRG-cross-cluster optimization on roadmap
MessageTypeBeginTxn / CommitTxn / TxnSystemYTransaction messages replicated as a group
MessageTypeRollbackTxnSystemNFiltered before reaching Scanner/CDC
MessageTypeCreateSegmentSystemNSegments are managed by the standby itself
MessageTypeFlushSystemNSystem flush is local to each cluster

4. API & SDK

gRPC only — no REST endpointThe CDC control plane (UpdateReplicateConfiguration, GetReplicateConfiguration) is exposed solely via gRPC. Clients use PyMilvus or the Go SDK, which speak gRPC under the hood. The Milvus Proxy’s RESTful gateway (v1 / v2) does not route these RPCs.

The older “Manage CDC Tasks” HTTP API found in legacy documentation belongs to the standalone zilliztech/milvus-cdc service used with Milvus 2.5 — a different system from 2.6’s built-in CDC. Do not assume the HTTP API surface carries over.

4.1 UpdateReplicateConfiguration

CDC’s control flow has a single entry point: UpdateReplicateConfiguration. It accepts a ReplicateConfiguration object and the server atomically replaces the persisted configuration. The control plane must call this RPC on every participating cluster so each cluster sees the same topology.

Proto

// milvus.proto
service MilvusService {
  rpc UpdateReplicateConfiguration(UpdateReplicateConfigurationRequest)
      returns (common.Status) {}

  // Added in v2.6.12+: read-only API for the current topology (tokens redacted)
  rpc GetReplicateConfiguration(GetReplicateConfigurationRequest)
      returns (GetReplicateConfigurationResponse) {}
}

message UpdateReplicateConfigurationRequest {
  common.ReplicateConfiguration replicate_configuration = 1;
  bool force_promote = 2;   // For Force Failover
}

// common.proto
message ReplicateConfiguration {
  repeated MilvusCluster clusters = 1;
  repeated CrossClusterTopology cross_cluster_topology = 2;
}

message MilvusCluster {
  string cluster_id = 1;
  ConnectionParam connection_param = 2;
  repeated string pchannels = 3;
}

message ConnectionParam {
  string uri = 1;
  string token = 2;
}

message CrossClusterTopology {
  string source_cluster_id = 1;
  string target_cluster_id = 2;
}

Behavior

  • Full replacement: the new ReplicateConfiguration replaces the current persisted one atomically.
  • Idempotent: re-submitting the same config is a no-op.
  • Validated: the server runs validation (see next section) before persisting; invalid requests return a typed error.

4.2 Validation Rules

ReplicateConfiguration represents a directed topology graph: MilvusCluster entries are vertices, CrossClusterTopology entries are edges.

  1. Per-cluster format: each MilvusCluster must have a non-empty cluster_id (no whitespace), a non-empty connection_param.uri (valid URI), and a non-empty pchannels list where each entry is prefixed by the cluster_id.
  2. Vertex uniqueness: cluster_ids must be globally unique.
  3. Edge uniqueness: a given source → target edge may appear at most once.
  4. Self-reference required: the cluster receiving the request must appear in clusters.
  5. Star topology only: at most one vertex has out-degree (= total clusters − 1) and in-degree 0; all others have in-degree 1 and out-degree 0. A single-vertex configuration (no edges) is also legal—and is how you remove replication.
  6. pchannels count: all clusters must have identical pchannels counts at any moment in time. Across successive updates, the count can grow but only via topology-wide append-only expansion (every cluster from N to N+k together). Mismatched counts between clusters are always rejected.
  7. pchannels uniqueness: no pchannel may repeat across clusters.
flowchart TB
  Center((Source))
  L1((Target-1))
  L2((Target-2))
  L3((Target-3))
  L4((Target-N))
  Center --> L1
  Center --> L2
  Center --> L3
  Center --> L4
  style Center fill:#1f6feb,color:#fff

4.3 PyMilvus Example

from pymilvus import MilvusClient

source_cluster_id = "source-cluster"
target_cluster_id = "target-cluster"
pchannel_num = 16

source_cluster_addr  = "http://source-cluster-milvus.milvus.svc.cluster.local:19530"
target_cluster_addr  = "http://target-cluster-milvus.milvus.svc.cluster.local:19530"
source_cluster_token = "root:Milvus"
target_cluster_token = "root:Milvus"

source_pchannels = [f"{source_cluster_id}-rootcoord-dml_{i}" for i in range(pchannel_num)]
target_pchannels = [f"{target_cluster_id}-rootcoord-dml_{i}" for i in range(pchannel_num)]

replicate_config = {
    "clusters": [
        {
            "cluster_id": source_cluster_id,
            "connection_param": {"uri": source_cluster_addr, "token": source_cluster_token},
            "pchannels": source_pchannels,
        },
        {
            "cluster_id": target_cluster_id,
            "connection_param": {"uri": target_cluster_addr, "token": target_cluster_token},
            "pchannels": target_pchannels,
        },
    ],
    "cross_cluster_topology": [
        {"source_cluster_id": source_cluster_id, "target_cluster_id": target_cluster_id}
    ],
}

# Submit the same config to both sides
source_client = MilvusClient(uri=source_cluster_addr, token=source_cluster_token)
target_client = MilvusClient(uri=target_cluster_addr, token=target_cluster_token)
try:
    source_client.update_replicate_configuration(**replicate_config)
    target_client.update_replicate_configuration(**replicate_config)
finally:
    source_client.close()
    target_client.close()

Operational tipIn production, use short-lived clients dedicated to control-plane operations like this. Avoid sharing a gRPC channel with DML traffic, since role changes briefly enter a write-fenced state.

Why submit to both clusters — strong-consistency semantics

The two update_replicate_configuration calls above are not a “primary forwards to standby” mechanism. They are two independent RPCs initiated by the application; the source cluster and the target cluster do not directly exchange configuration RPCs over the wire.

The two RPCs do different things server-side:

RecipientWhat the cluster does on receipt
Primary (the one currently allowed to broadcast)Acquires the exclusive cluster resource key, broadcasts the AlterReplicateConfigMessage into its WAL, the ACK callback persists the new ReplicatePChannelMeta into its own ETCD, and returns success.
Standby (any non-primary in the topology)broadcaster.StartBroadcastWithResourceKeys fails with ErrNotPrimary. The cluster then enters waitUntilPrimaryChangeOrConfigurationSameit blocks the RPC until the local ETCD’s ReplicateConfiguration matches the incoming one. The configuration arrives via CDC replicating the AlterReplicateConfigMessage from the primary’s WAL into the standby’s WAL; the standby’s SC consumes that message and writes its own ETCD through the same ACK callback. Only then does the standby’s blocked RPC return success.

Strong-consistency guarantee when both calls succeed

When both update_replicate_configuration calls return success, the application can rely on the following invariants:

  • The new ReplicateConfiguration has been broadcast into the primary’s WAL and persisted in the primary’s ETCD.
  • CDC has already replicated the configuration change to every cluster the application called.
  • Each cluster’s ETCD ReplicatePChannelMeta reflects the new topology.
  • Each cluster’s CDC controller has been notified via its ETCD watch and has had a chance to reconcile its channel replicators.

This both-success = both-applied property is what makes the two-call pattern a strong-consistency fence, not just a redundancy.

Sequence — what actually happens on the wire

Application (PyMilvus)               Primary (A)                            Standby (B)
       │                                  │                                       │
       ├──UpdateReplicateConfig──────────►│                                       │
       │                                  │ broadcaster.StartBroadcast ✓          │
       │                                  │ broadcast AlterReplicateConfigMsg     │
       │                                  │       │                               │
       │                                  │       ▼                               │
       │                                  │   A's WAL ─┐                          │
       │                                  │            │ ACK                      │
       │                                  │            ▼                          │
       │                                  │   alterReplicateConfiguration cb      │
       │                                  │            │                          │
       │                                  │            ▼                          │
       │                                  │   A's ETCD (ReplicatePChannelMeta)    │
       │                                  │            │                          │
       │                                  │            ▼ etcd watch fires         │
       │                                  │   A's CDC Controller → CreateReplicator (ACTIVE)
       │                                  │                                       │
       │◄────────────success──────────────┤                                       │
       │                                                                          │
       │                                  │  (A's CDC forwards the msg) ─────────►│
       │                                                                  ReplicateStream RPC
       │                                                                          │
       ├──UpdateReplicateConfig──────────────────────────────────────────────────►│
       │                                                                          │ broadcaster fails ErrNotPrimary
       │                                                                          │ waitUntilPrimaryChangeOrConfigurationSame
       │                                                                          │ (blocking watch loop on local config)
       │                                                                          │       │
       │                                  │  (still forwarding... msg lands)─────►│  B's WAL ─┐
       │                                                                          │           │ ACK
       │                                                                          │           ▼
       │                                                                          │  alterReplicateConfiguration cb
       │                                                                          │           │
       │                                                                          │           ▼
       │                                                                          │  B's ETCD (ReplicatePChannelMeta)
       │                                                                          │           │
       │                                                                          │           ▼ etcd watch fires
       │                                                                          │  B's CDC Controller — sees event
       │                                                                          │  but sourceChannel ≠ self → stays idle
       │                                                                          │           │
       │                                                                          │  watch loop above sees config now matches
       │                                                                          │           │
       │◄─────────────────────────────────────────────────────────success─────────┤
At this point both ETCDs are consistent; the application may proceed.

If you only call the primaryThe system still converges — A’s CDC will eventually push the AlterReplicateConfigMessage into B’s WAL and B’s ETCD will catch up via the same ACK callback. But there is no synchronous fence: when the primary call returns, B’s ETCD has not yet been updated. Any application logic that immediately reads back the configuration on B will see the old state. The two-call pattern exists specifically to make this fence synchronous and observable.

Direction agnosticThis pattern is robust to not knowing which cluster is currently primary — useful right after a switchover or in automation where the caller can’t easily determine the role. Both endpoints accept the same payload; only the actual primary executes the broadcast, the others just wait for convergence.

4.4 Go SDK Example (builder pattern)

ctx, cancel := context.WithCancel(context.Background())
defer cancel()

cli, err := milvusclient.New(ctx, &milvusclient.ClientConfig{
    Address: "localhost:19530",
})
if err != nil { log.Fatal(err) }

sourceCluster := milvusclient.NewMilvusClusterBuilder("source-cluster").
    WithURI("localhost:19530").
    WithToken("source-token").
    WithPchannels("source-cluster-rootcoord-dml_0", "source-cluster-rootcoord-dml_1").
    Build()

targetCluster := milvusclient.NewMilvusClusterBuilder("target-cluster").
    WithURI("localhost:19531").
    WithToken("target-token").
    WithPchannels("target-cluster-rootcoord-dml_0", "target-cluster-rootcoord-dml_1").
    Build()

config := milvusclient.NewReplicateConfigurationBuilder().
    WithCluster(sourceCluster).
    WithCluster(targetCluster).
    WithTopology("source-cluster", "target-cluster").
    Build()

if err := cli.UpdateReplicateConfiguration(ctx, config); err != nil {
    log.Printf("Failed to update replicate configuration: %v", err)
    return
}
log.Println("Replicate configuration updated successfully")

4.5 GetReplicateInfo & CreateReplicateStream

Internal RPCs—application developers normally do not call these, but understanding them helps with debugging.

GetReplicateInfo

Returns the persisted ReplicateCheckpoint metadata on the target. CDC calls this at startup to learn where to resume.

message GetReplicateInfoRequest {
  string source_cluster_id = 1;
  string target_pchannel   = 2;
}
message GetReplicateInfoResponse {
  common.ReplicateCheckpoint replicate_checkpoint = 1;
  common.ReplicateCheckpoint salvage_checkpoint   = 2;  // Captured during force failover
}

CreateReplicateStream

The bidirectional gRPC stream between CDC and the target. Send: ReplicateRequest. Recv: ReplicateResponse with ReplicateConfirmedMessageInfo, used to drain the in-flight queue by TimeTick for back-pressure.

5. Deployment & Quick Start

5.1 Prerequisites

  • Milvus v2.6.16 or later.
  • Milvus Operator v1.3.4 or later.
  • A Kubernetes cluster.
  • Network connectivity between source and target clusters.
  • Admin credentials for both clusters.

5.2 Install Milvus Operator

helm repo add zilliztech-milvus-operator https://zilliztech.github.io/milvus-operator/
helm repo update zilliztech-milvus-operator
helm -n milvus-operator upgrade --install milvus-operator \
  zilliztech-milvus-operator/milvus-operator \
  --create-namespace
kubectl get pods -n milvus-operator

5.3 Deploy Source and Target Clusters

Two near-identical CRs—the only difference is metadata.name. Both sides must enable the CDC component.

apiVersion: milvus.io/v1beta1
kind: Milvus
metadata:
  name: source-cluster   # change to target-cluster on the other side
  namespace: milvus
spec:
  mode: cluster
  components:
    image: milvusdb/milvus:v2.6.16
    cdc:
      replicas: 1
  dependencies:
    msgStreamType: woodpecker   # or pulsar / kafka
kubectl create namespace milvus
kubectl apply -f milvus_source_cluster.yaml
kubectl apply -f milvus_target_cluster.yaml

# Confirm CDC pods are Running
kubectl get pods -n milvus | grep cdc
# source-cluster-milvus-cdc-66d64747bd-sckxj   1/1 Running 0 2m
# target-cluster-milvus-cdc-7f8c9d6b8c-q4t9m   1/1 Running 0 2m

5.4 Discover Cluster Addresses

kubectl get svc -n milvus | grep -E 'NAME|source-cluster|target-cluster'
# source-cluster-milvus  ClusterIP 10.98.124.90    19530/TCP,9091/TCP
# target-cluster-milvus  ClusterIP 10.109.234.172  19530/TCP,9091/TCP

Two kinds of addresses

  • Cluster-internal address: written into ReplicateConfiguration; must be reachable from CDC pods.
  • Client address: only used when running PyMilvus to push update_replicate_configuration; can be a port-forward / LB / Ingress.

Never put a port-forward 127.0.0.1 address into ReplicateConfiguration—CDC pods cannot reach it.

5.5 Apply the Replication Topology

See 4.3 PyMilvus example. Submit the same config to both sides.

5.6 Verify Replication

  1. Create a collection on the source, insert data.
  2. Load and run query / search on source.
  3. Connect to the target—do not manually load_collection there (DDL replicates automatically).
  4. Run the same query / search on the target; identical results = replication healthy.

6. Primary-Standby Failover

Milvus 2.6 provides three failover modes, ordered by their consistency guarantees:

ModePrimary goalUse caseData lossStatus
Graceful (Planned Switchover)Consistency firstOriginal primary still reachable; planned maintenanceRPO = 0Available
Coordinated FailoverConsistency + HAPrimary partially degraded (e.g. StreamingNode CrashLoop)RPO = 0Available
Force FailoverAvailability (AP)Primary fully unreachable (network partition, region loss)Bounded by CDC lag (salvageable manually)Available

6.1 Graceful Failover (Planned Switchover)

Use when the original primary is still reachable—maintenance, version rollout, planned traffic relocation. Replication uses an AlterReplicateConfigMessage as the synchronization point; no data is lost.

When to use

  • Original primary A is fully healthy and reachable. No CrashLoop, no degraded components.
  • You initiated the switch — it’s planned: version upgrade, maintenance window, traffic redistribution.
  • You can tolerate “switch may take seconds to minutes to drain the remaining lag” — RPO=0 has a price in RTO.
  • Do not use if A has any failure signal. Pick Coordinated if WAL still writes but some component is down, or Force if A is unreachable.

Pre-check (verify before submitting the switch)

  1. CDC lag is low and steady. Query:

    clamp_min(max by (channel_name)(milvus_wal_last_confirmed_time_tick)
             - min by (channel_name)(milvus_cdc_last_replicated_time_tick), 0)
    

    Expect a small, stable number (sub-second to a few seconds). A high or climbing lag means the switch will block until the lag drains — runs the operation for longer than expected.

  2. Stream RPC connections all healthy: milvus_cdc_stream_rpc_connections{connection_status="disconnected"} == 0.

  3. Both clusters reachable from PyMilvus. A simple get_replicate_configuration() on both should succeed and return the current topology.

  4. Network between A and B is healthy. Recent reconnect counter should be flat: rate(milvus_cdc_stream_rpc_reconnect_times[5m]) ~ 0.

Flow (initial A→B, target B→A)

  1. Cluster A: control plane calls UpdateReplicateConfiguration on A with the new topology B→A
    • A writes AlterReplicateConfigMessage into its WAL.
    • A demotes to standby; its WAL enters write-fenced state.
  2. Cluster B: B receives the replicated AlterReplicateConfigMessage via CDC; once confirmed, its WAL unfences and B promotes to primary.
  3. CDC: A’s CDC stops once it has forwarded the message; B’s CDC takes over the reverse direction (B→A).

PyMilvus

switchover_config = {
    "clusters": [
        {
            "cluster_id": cluster_a_id,
            "connection_param": {"uri": cluster_a_addr, "token": cluster_a_token},
            "pchannels": cluster_a_pchannels,
        },
        {
            "cluster_id": cluster_b_id,
            "connection_param": {"uri": cluster_b_addr, "token": cluster_b_token},
            "pchannels": cluster_b_pchannels,
        },
    ],
    "cross_cluster_topology": [
        {"source_cluster_id": cluster_b_id, "target_cluster_id": cluster_a_id}
    ],
}

# Submit to the current primary (A) first, then to the current standby (B).
# Both calls are synchronous and block until success (or RPC timeout).
# When both return success, the switch is committed — see §4.3.
client_a.update_replicate_configuration(**switchover_config)
client_b.update_replicate_configuration(**switchover_config)

Verify the switch succeeded

  1. Topology shows the new direction. On either side: get_replicate_configuration() should report cross_cluster_topology = [B → A].
  2. Old primary A rejects writes. Try a small insert / delete via PyMilvus against A — should fail with a “not primary” / write-fenced error.
  3. New primary B accepts writes. Same insert / delete against B — should succeed.
  4. Reverse CDC is forwarding. On B (now source), watch milvus_cdc_replicated_messages_total — the counter for target_cluster=A should start incrementing. CDC lag for the new direction should drop into normal range within a few cycles.
  5. Test data round-trips. Insert one row on B → wait a few seconds → query A; you should see the row replicated back.

Failure handling

  • One of the two RPCs failed, the other succeeded: state is half-committed. Run get_replicate_configuration() on both sides — whichever side hasn’t applied the new topology, retry submitting there. The receiving side that already saw the AlterReplicateConfigMessage via CDC will short-circuit to success via waitUntilPrimaryChangeOrConfigurationSame (see §4.3) — retrying is safe.
  • Both RPCs hang for a long time: usually means CDC lag is high and B is waiting for the Fence Message to arrive. Watch lag — once it drains to the Fence position, both RPCs return. If it doesn’t drain, CDC has a problem (check pod status + reconnect counter) — abort by canceling the calls and investigate.
  • Rolling back (you want A primary again): submit another UpdateReplicateConfiguration with the original topology A→B. This is just another Graceful Switchover — RPO=0 still holds because the just-switched-to B never lost anything and the reverse direction back to A will also drain a Fence. Customers running automation should always treat Graceful Switchover as bidirectional from day one.

6.2 Coordinated Failover

For cases where the primary cluster has partial failure—for example, the primary’s StreamingNode is in CrashLoop due to segcore SEGSIGV, but MixCoord / Proxy / WAL remain alive. Still achieves RPO=0.

When to use

  • Primary A has a degraded component but its WAL can still accept writes. Typical signals:
    • StreamingNode pod is in CrashLoop (rising restart_count on the SN pod) but the WAL service stays up between restarts.
    • Delegator / Flusher are throwing fatal errors (e.g. segcore SIGSEGV in logs) but MixCoord / Proxy responses are still healthy.
    • Some metric or alert flagged the problem and you’ve manually confirmed A is degraded but reachable.
  • You verified A’s WAL is still writable: try a small insert() on A through PyMilvus. If it succeeds (even slowly), WAL works → Coordinated is the right mode. If it times out or returns a service-unavailable type error → A’s WAL is unreachable → escalate to Force Failover instead.
  • You can afford a few seconds to tens of seconds RTO (drain remaining lag).
  • Do not use if A is fully healthy (that’s Graceful) or A is fully unreachable (that’s Force).

Pre-check

  1. WAL writability probe: milvus_client.insert() against A. Success = good, proceed. Time-out or 503 = wrong mode, switch to Force.
  2. CDC stream connection between A and B: milvus_cdc_stream_rpc_connections{connection_status="connected", target_cluster="B"} > 0. If 0, CDC can’t carry the Fence Message — Coordinated cannot work, escalate to Force.
  3. Standby B is healthy: get_replicate_configuration() on B returns current topology successfully.
  4. Decide your retry budget. Recommendation: 10 attempts with 2-second backoff between attempts (≈ 20s total) before flipping pauseConsumption. Adjust based on your RTO target.

Fence Message

AlterReplicateConfigMessage serves as a Fence Message. Two properties exploit WAL ordering:

  • Write fencing: once the message is in the WAL, the original primary rejects new writes.
  • Sync barrier: when CDC has consumed this message, all prior data messages have been propagated → RPO = 0.

Feasibility of writing the Fence Message

The StreamingNode’s startup order is WAL → Flusher → Delegator. The WAL module finishes OpenWAL and Recover first—it is Ready to accept writes before Delegator/Flusher fully come up. Even if Delegator/Flusher will crash again immediately, there is a window in which the SN can accept the Fence Message write.

Last-resort: pauseConsumption

If the SN crashes so fast that the window is insufficient, the config:

streaming:
  walScanner:
    pauseConsumption: true   # default: false

is in Milvus (pkg/util/paramtable/component_param.go:6887, refreshable). When set, Delegator and Flusher pause WAL consumption, reducing the SN to “WAL-write only” mode and breaking the crash loop, giving the Fence Message enough time to land.

Milvus does not do this automaticallyMilvus only provides the pauseConsumption switch — when you set it, the SN responds to it. Milvus does NOT automatically retry, detect failure, or flip the switch on its own. The whole “try, fail, flip pauseConsumption, retry” loop is something you (or your control-plane script / operator) have to write yourself. The customer’s runbook owns the retry-and-flip-switch logic; Milvus only owns the switch.

Full flow (you / your control plane does this)

  1. Failure detection. Operator decides to fail over.
  2. Fence Old Primary: control plane calls UpdateReplicateConfiguration on A with the reversed topology.
    • If it succeeds: proceed.
    • If it fails after some retries (e.g. 10 — your choice): your script flips walScanner.pauseConsumption=true on A and retries until success. Milvus does not do this for you.
  3. Promote New Primary: call UpdateReplicateConfiguration on B. After B receives and confirms the Fence Message from A via CDC, B promotes to primary.
  4. Post-handling:
    • If A is recoverable: reset pauseConsumption=false; A auto-rejoins as B’s standby.
    • If A is unrecoverable: tear down the topology with B and run Standby Cluster Rebuild to provision a new standby C.

Verify the switch succeeded

  1. Topology shows new direction: get_replicate_configuration() on either side reports B → A.
  2. Old primary A no longer accepts writes: a small insert() against A should fail with a write-fenced error. (Note this is different from the pre-check probe — that was before the switch and writes succeeded.)
  3. New primary B accepts writes: same insert against B succeeds.
  4. Reverse CDC is live: on B as source, milvus_cdc_replicated_messages_total{target_cluster="A"} begins to climb. CDC lag for the new direction normalizes within a few cycles.
  5. Test data round-trips: insert one row on B → wait a few seconds → query A; the row should be there.
  6. If you flipped pauseConsumption=true: A’s SN is now WAL-write-only — its Delegator/Flusher are not consuming. You must reset pauseConsumption=false once A’s underlying issue is fixed (e.g. segcore upgrade) so A can re-join as standby. Leaving it on means A will never catch up reading the replicated stream from B; the auto-rejoin can’t happen.

Failure handling

  • Fence Old Primary attempt fails 10 times even with pauseConsumption: A’s WAL is unreachable, not just degraded. Abort Coordinated, escalate to Force Failover.
  • B’s promotion appears stuck (Fence Message hasn’t arrived after > 30s): A’s CDC isn’t forwarding. Check milvus_cdc_stream_rpc_connections for the A→B link. If disconnected, you have a CDC delivery problem on top of the primary failure — likely also needs Force Failover.
  • Auto-rejoin of A doesn’t work after pauseConsumption reset: query B’s get_replicate_configuration() to confirm A is in cross_cluster_topology as target. If yes but data isn’t flowing, check A’s CDC stream connection metric. If A’s underlying fault still recurs (re-crashing), treat A as unrecoverable and run Standby Cluster Rebuild on a fresh node.
  • Rolling back to A as primary: same as a Graceful Switchover from B back to A. Only safe if A is now genuinely fixed.

Multi-standby topology — sibling standbys rewire automatically

In a one-primary-many-standby topology (e.g. A→B, A→C), Coordinated Failover handles all sibling standbys without rebuild. Because the Fence Message is broadcast through A’s CDC to every standby, all of them — not just the chosen new primary — receive the same sync barrier and reach the same state at the cutover moment (RPO = 0 for B and C).

The mechanism:

  1. The new topology in the UpdateReplicateConfiguration request lists the full edge set (e.g. B→A, B→C).
  2. A’s CDC forwards the Fence Message to both B and C. Both consume it and run SwitchReplicateMode — B becomes primary, C rebuilds its secondaryState with SourceClusterID=B and a reset checkpoint (TT=0, MessageID=nil).
  3. B’s StreamingCoord writes the new ReplicatePChannelMeta for the B→C edge, setting InitializedCheckpoint to the Fence Message’s position in B’s own WAL (the moment B was promoted).
  4. B’s CDC, on creating the new channel replicator, calls C’s GetReplicateInfo; C returns MessageID=nil because secondaryState was just reset; the CDC falls back to InitializedCheckpoint.
  5. B’s CDC scanner reads B’s WAL starting from after the Fence — skipping every message B accumulated while it was still a secondary (those entries were already replicated to C from A and don’t need re-sending).

Net result: C continues without rebuild, no data overlap, no data gap. The Fence Message simultaneously serves as write fence, sync barrier, source-switch signal, and new-source start anchor.

6.3 Force Failover

For the worst case: the original primary is fully unreachable (network partition, region outage). Explicitly AP-leaning—does not promise strong consistency, prioritizes availability.

When to use

  • Primary A is completely unreachable. Specific signals:
    • All RPCs to A time out for > N minutes (N from your RTO target, typically 1–5 min).
    • A’s pods are in Pending/Unknown state across the cluster, or the K8s namespace itself is unreachable.
    • Network partition isolates A’s region — confirmed by independent connectivity probes (ping / TCP) from outside the customer’s primary path.
    • You’ve already tried Coordinated and the pre-check probe (write to A’s WAL) failed.
  • You explicitly accept data loss up to the current CDC lag. This is a business decision — confirm before pulling the trigger.
  • You have a plan to remove A from all write endpoints (DNS / LB / Ingress) so that if A recovers later, no application traffic accidentally hits it.
  • Do not use if A’s WAL still writes (use Coordinated instead — RPO=0) or if A is healthy (use Graceful).

Pre-check

  1. Confirm A is really unreachable. From the same network the application uses, run a basic Milvus health probe against A — should hard-fail (timeout, connection refused, DNS not resolving). A slow/degraded response means A is actually reachable; reconsider Coordinated.
  2. Record the current CDC lag — this is the upper bound on data loss. Capture the metric value before promoting so post-mortem can quantify what was lost.
  3. Confirm B is healthy and was in secondary state. get_replicate_configuration() on B should respond and show A→B topology (about to be torn down by force_promote).
  4. Plan the application-side endpoint cutover. Decide concretely:
    • Which DNS record / LB / Ingress points application writes at A?
    • How will you flip it to B (and remove A entirely)?
    • If using GlobalEndpoint (Zilliz Cloud), this is automatic; otherwise it’s your runbook step.
  5. For multi-standby topology (A→B, A→C): plan to rebuild C after promoting B. Sibling standbys cannot be salvaged automatically (see §7) — make sure operator time + storage capacity are available for the rebuild.

force_promote

message UpdateReplicateConfigurationRequest {
  common.ReplicateConfiguration replicate_configuration = 1;
  bool force_promote = 2;
}

Constraints

ConstraintValidationRationale
Standby cluster onlyWithSecondaryClusterResourceKey(); primary call returns ErrNotSecondaryEmergency promotion only applies to standbys
clusters must be emptylen(config.Clusters) == 0Config is rebuilt from existing meta
topology must be emptylen(config.CrossClusterTopology) == 0Same as above

Internal mechanics

  1. Client calls UpdateReplicateConfiguration(config={}, force_promote=true).
  2. StreamingCoord validates the empty-config requirement and acquires the secondary-cluster lock via WithSecondaryClusterResourceKey().
  3. A standalone-primary config is reconstructed from existing meta.
  4. A force_promote=true AlterReplicateConfigMessage is broadcast.
  5. TxnBuffer: on force_promote, runs RollbackAllUncommittedTxn()—any in-flight transactions are discarded.
  6. Replicate Interceptor: switches the cluster to primary mode.
  7. Incomplete broadcasts: prior in-flight AlterReplicateConfigMessage broadcasts are marked with the ignore flag so they don’t overwrite the new config.
  8. Config is persisted with force_promoted=true.

PyMilvus

force_failover_config = {
    "clusters": [
        {
            "cluster_id": cluster_b_id,
            "connection_param": {"uri": cluster_b_addr, "token": cluster_b_token},
            "pchannels": cluster_b_pchannels,
        }
    ],
    "cross_cluster_topology": [],
    "force_promote": True,
}
client_b.update_replicate_configuration(**force_failover_config)

Data loss

  • Loss bound = CDC lag at the moment of cutover. Zero lag = no loss.
  • After promoting B, remove A from every write endpoint (DNS / LB / Ingress) to prevent split-brain.
  • Do not reattach A to the old topology even if it recovers—A may contain unreplicated writes and B may already have new writes; reconciliation is not automatic.

Verify the promotion succeeded

  1. B is now a standalone primary: get_replicate_configuration() on B returns a config with B as the only cluster, no topology edges, and force_promoted=true in the metadata.
  2. B accepts writes: a small insert() through PyMilvus targeting B succeeds (apply the application’s normal write path).
  3. Application traffic landed on B, not on A: confirm the DNS / LB / Ingress cutover actually took effect — application metrics should show writes hitting B’s address.
  4. A is removed from all write endpoints: any attempt by mis-routed traffic to write to A should fail at the network layer (connection refused / DNS not resolving), not silently succeed.
  5. SalvageCheckpoint is captured (for later Data Salvage if A recovers): get_replicate_info on B returns a non-empty salvage_checkpoint per pchannel.

Post-promotion (next 24 hours)

  • Multi-standby: rebuild sibling C. C was previously A’s secondary; it cannot be rewired to B automatically (no Fence Message ever reached C). Use milvus-backup restore secondary against C with --source_cluster_id B — see §7 for the full procedure.
  • If A recovers, do not reattach. Run Data Salvage against A to dump any unreplicated messages, then decommission A. Reattaching it would cause split-brain because both A and B believe they are primary.
  • SalvageCheckpoint window is 7 days. If you want the option to salvage from A, recover A within that window. After 7 days the salvage_checkpoint may be pruned by Milvus and you lose the opportunity to drain unreplicated data.

Failure handling

  • force_promote RPC returns ErrNotSecondary: B is already in primary role (maybe from a prior failover you forgot about), or you submitted to the wrong cluster. Check get_replicate_configuration() on B; resolve before retrying.
  • force_promote RPC returns config-must-be-empty error: you accidentally included clusters or cross_cluster_topology. Resubmit with empty clusters and topology — only force_promote=true.
  • B accepts the RPC but writes still fail afterward: usually the application client is still pinned to A’s address. Restart application clients or verify DNS TTL has expired.
  • A “recovers” and starts seeing writes too: split-brain in progress, this is dangerous. Immediately remove A from networking (firewall rule, scale to 0, etc.) before any further writes hit it. Plan to discard whatever A wrote post-recovery; only B’s data is authoritative.
  • No way to roll back: force_promote is one-way. Once B is promoted, you cannot “unpromote” back to A. If you regret promoting B, your only option is a new Graceful Switchover from B back to a freshly-rebuilt A (with A’s data wiped).

6.4 Data Salvage

After force_promote, the original primary may hold data that was never replicated. Data Salvage provides a manual reconciliation path.

SalvageCheckpoint

At force_promote time, StreamingCoord captures the current ReplicateCheckpoint per pchannel as the salvage checkpoint:

  • Key: $rootPath/streamingnode-meta/salvage-checkpoint/$clusterID/$ChannelName
  • Value: commonpb.ReplicateCheckpoint
  • Retention: 7 days

DumpMessages API

rpc DumpMessages(DumpMessagesRequest) returns (stream DumpMessagesResponse) {}

message DumpMessagesRequest {
  string pchannel              = 1; // required
  common.MessageID start_message_id = 2; // required (from salvage_checkpoint)
  uint64 start_timetick        = 3; // optional filter
  uint64 end_timetick          = 4; // optional, 0 = unbounded
  bool   start_position_exclusive = 5;
}
message DumpMessagesResponse {
  oneof response {
    common.Status status            = 1;
    common.ImmutableMessage message = 2;
  }
}

How to do it today

Milvus exposes a gRPC streaming RPC DumpMessages on the old primary. You connect to it, drain the stream of raw protobuf messages, and do whatever you want with the data on your side. The detailed steps:

  1. Wait for old primary A to recover (operator action).
  2. Get the salvage checkpoint from new primary B: call GetReplicateInfo RPC on B, the response includes a salvage_checkpoint per source cluster — this is your starting position.
  3. Stream messages from A: call the DumpMessages RPC on A with that salvage_checkpoint.message_id as start_message_id (plus optional start_timetick / end_timetick filters). The server opens a WAL scanner exclusive of start_message_id, filters out system messages (TimeTick, CreateSegment, Flush, RollbackTxn), and streams every other message back as common.ImmutableMessage proto. Code: internal/proxy/impl.go:7150–7248.
  4. You write the consumer client: read the stream, persist however your environment needs (S3, Parquet, plain file, etc.). Milvus does not write the data to object storage for you — you write a client to do that.
  5. Decide replay policy and replay: inspect the dump, apply business-defined conflict resolution (skip / overwrite / merge on primary-key collision), then write back to the new primary using normal SDK calls. Milvus does not do this for you either — conflict policy is business decision.

What you have to buildSteps 4 and 5 are on you. Milvus’s contract here is: expose the raw data via gRPC, no built-in object-storage persistence, no automatic replay. The reason for not automating replay is sound — primary-key collision resolution is business-specific and cannot be safely automated. The reason for not automating persistence is just “not built yet”. For POC planning, budget engineering time for a small consumer client (couple hundred lines of Go / Python that drains a gRPC stream into your preferred storage).

6.5 Mode Comparison

AspectGracefulCoordinatedForce
Primary stateHealthyPartial failureUnreachable
RPO00≤ CDC lag (salvageable)
RTODrain remaining lagSeconds (fence + fallback)Seconds (no waiting)
TriggerOperator-initiatedFailure detected + confirmedOperator decision (emergency)
Fence Message neededNoYesNo (force_promote)
Data Salvage neededNoNoOptional
Old primary handlingAuto-demotedAuto-demote if recoverable; otherwise rebuildMust be removed; can be decommissioned
Sibling standby (multi-standby) handlingAuto-rewire, no rebuild — Fence Message broadcasts to all standbys; each SwitchReplicateModes to new sourceAuto-rewire, no rebuild — same mechanism as GracefulMust rebuild — no Fence Message reached siblings; their state may diverge from the promoted standby. See §7

7. Standby Cluster Rebuild

Two scenarios require rebuilding a standby:

  • After a Force Failover, when the topology was multi-standby: sibling standbys (other than the one that was promoted) cannot be safely rewired because they never saw a Fence Message — the old primary was unreachable, so no sync barrier propagated. Their state may have diverged from the promoted standby (either ahead or behind), and no general-purpose mechanism exists to reconcile them in place. Each sibling must be rebuilt from scratch against the new primary. This does not apply to Graceful or Coordinated Failover — those propagate a Fence Message to every sibling and rewire them automatically (see §6.2 and the comparison in §6.5).
  • Adding a standby to a non-empty primary: the primary already has historical data; a fresh standby must catch up to the primary before CDC takes over the incremental tail.

7.1 Recovery Checkpoint

The rebuild needs a precise boundary between “data restored from backup” and “incremental replicated via CDC”. Since CDC operates per-PChannel, the boundary is a PChannel-level consistency checkpoint.

The Checkpoint is atomic: data before it is fully persisted to object storage; data after it is still in the log stream. Restoration splits into two phases:

  1. Bulk import: messages before the CP are imported via Backup.
  2. Incremental replication: messages after the CP are taken over by CDC’s Replicator.

7.2 FlushAll Refactoring

Milvus 2.6 redefines FlushAll on top of the Streaming Service. A new FlushAllMsg MessageType forces every vchannel to flush; the message itself becomes the Recovery Checkpoint when consumed.

service MilvusService {
  rpc FlushAll(FlushAllRequest) returns (FlushAllResponse) {}
  rpc GetFlushAllState(GetFlushAllStateRequest) returns (GetFlushAllStateResponse) {}
}
message FlushAllResponse {
  common.Status status                = 1;
  map flush_all_tss   = 4; // pchannel -> flush_all_ts
}
message GetFlushAllStateResponse {
  common.Status status = 1;
  bool flushed         = 2;
}

7.3 Tool: zilliztech/milvus-backup

The rebuild is performed by a separate tool that lives in its own repository: zilliztech/milvus-backup. It is not part of the Milvus main repo and not a component bundled with the cluster — install / upgrade it independently.

  • Install: download a binary from Releases, or brew install zilliztech/tap/milvus-backup on macOS.
  • Modes: CLI (./milvus-backup ...) or REST API server (./milvus-backup server, default port 8080, Swagger UI at /api/v1/docs/index.html).
  • Storage backends: local / minio / s3 / aws / gcp / aliyun / azure / tencent / gcpnative; cross-storage backup supported (e.g. S3 region A → S3 region B).
  • Version rule: a backup can only be restored to the same or newer Milvus version. A 2.6 backup must restore to 2.6.

The right subcommand for CDC standby rebuild

milvus-backup has a dedicated subcommand for restoring into a cluster that should become a CDC secondary — restore secondary. Do not use plain restore for this scenario — that one creates standalone collections without setting up the CDC seam.

# On source cluster A — create the snapshot
./milvus-backup create -n bkp_xxx

# Move backup files to target side (or use crossStorage)

# On target cluster B — restore into secondary mode
./milvus-backup restore secondary \
  --name bkp_xxx \
  --source_cluster_id A \
  --target_cluster_id B

# Finally, on source cluster A — wire up CDC
# (pymilvus.MilvusClient.update_replicate_configuration with topology A→B)

What restore secondary does under the hood

Code reference: cmd/restore/secondary.go + core/restore/secondary/task.go. The tool opens a CreateReplicateStream RPC to the target’s Milvus proxy, declaring itself as the source cluster — i.e. it really does impersonate CDC. Then it pushes messages in a specific order:

StepWhat gets sentTransport
① Open streamNewStreamClient(sourceClusterID, taskID, pchannels, grpc) → opens a bidirectional CreateReplicateStream to targetgRPC stream
② DB DDLDatabase create messages reconstructed from backup metaStream
③ Collection DDLFor each collection: CreateCollection / CreatePartition / CreateIndex messagesStream
④ Collection DMLNot via stream — uses BulkInsert REST API. The tool points BulkInsert at the backup’s segment files in object storage. Splits into L0 (delta) and non-L0 (insert) batches, sequenced per partition; runs concurrently across collections (controlled by backup.parallelism.RestoreCollection)BulkInsert RESTful
⑤ Collection LoadLoad collection messagesStream
⑥ RBAC restoreReconciles backup RBAC vs current; sends RestoreRBACMessage via broadcastStream broadcast
⑦ FlushAllMsg replay (★ the seam)For each pchannel, the FlushAllMsg captured at backup time (stored base64 in backup meta) is forwarded into target’s WAL — this materializes the ReplicateCheckpoint on the target at exactly the source’s FlushAll momentStream forward
⑧ Wait confirmstreamCli.WaitConfirm() blocks until all sent messages are acked by target

Design vs implementation noteThe PDF design doc describes DML as “ImportMsg pushed via ReplicateStream”. The implementation uses BulkInsert REST instead for the data phase — stream is reserved for DDL / RBAC / Load / FlushAll. The end result is the same (target’s WAL has all the data and a clean checkpoint seam), but the data plane path is BulkInsert, not stream. This also means the §3.3 matrix entry “MessageTypeImport is not replicated by CDC” is consistent — restore secondary doesn’t generate Import messages on the wire.

7.4 End-to-end flow

StageWhere it runsWhat happens
① BackupSource cluster A (operator runs milvus-backup create)- Tool calls A’s FlushAll → captures one FlushAllMsg per pchannel; stores base64-encoded in backup meta. - Tool reads segment-level binlogs directly from A’s object storage and copies them to the backup storage (the configured backupBucketName / backupRootPath). - RBAC, schema, collection / partition / index meta exported. - gcPause on by default (2h) — pauses Milvus GC so segments don’t get cleaned mid-backup.
② Move backup filesOperatorIf A and B share storage, nothing to do. Otherwise either use crossStorage: true to back up directly into target storage, or transfer files (rclone / aws s3 sync / etc.) so they’re reachable by B’s object store reads.
③ Restore into secondaryTarget cluster B (operator runs milvus-backup restore secondary)- Tool opens CreateReplicateStream to B, declaring source cluster A — B sees this stream as a regular CDC client. - Reproduces DB / Collection / Index DDLs through the stream. - Data restored via BulkInsert REST, partition by partition, L0 / non-L0 sequenced; uses backup=true or l0_import=true options to identify origin. - Load + RBAC restored via stream. - Final step: replays the FlushAllMsgs into B’s WAL via stream — B’s secondaryState now holds a ReplicateCheckpoint pointing at A’s FlushAll position. - WAL retention guard rail: A’s WAL keeps messages a bounded window (default 3 days). If steps ①–③ together exceed the window, A may have GC’d messages beyond the FlushAll checkpoint that B needs next. For very large datasets, extend WAL retention before starting, or be prepared to redo.
④ Wire CDCControl plane (PyMilvus / Go SDK)- Call update_replicate_configuration on A and B with the topology A→B. - A’s CDC controller picks up the new ReplicatePChannelMeta, starts a channel replicator per pchannel. - Each replicator queries B’s GetReplicateInfo — gets back the FlushAll-materialized checkpoint, set up by step ③. - CDC reads A’s WAL from after that checkpoint, forwards every subsequent message to B. No gap (FlushAll is the precise seam), no overlap (target-side deduplication filters anything ≤ checkpoint).
⑤ ServingTarget cluster BB catches up the tail (writes between FlushAll moment and present). Once CDC lag < configured threshold (e.g. 1s), B is marked Serviceable — DR is fully restored.

8. Monitoring & Operations

8.1 Prometheus Metrics

All milvus_cdc_* metrics are exposed by the Source-side CDC process (the one co-located with the Source Milvus cluster). Defined in internal/cdc/replication/replicatestream/metrics.go.

MetricTypeUnitLabelsDescription
milvus_cdc_replicated_messages_totalCountercountchannel_name, target_channel_name, msg_typeTotal messages successfully forwarded by CDC
milvus_cdc_replicated_bytes_totalCounterbyteschannel_name, target_channel_name, msg_typeTotal bytes forwarded
milvus_cdc_replicate_end_to_end_latencyHistogrammschannel_name, target_channel_nameEnd-to-end latency from Source WAL read to Target WAL ack
milvus_cdc_last_replicated_time_tickGaugemschannel_name, target_channel_nameLatest replicated TimeTick
milvus_cdc_stream_rpc_connectionsGaugecounttarget_cluster, connection_status (connected / disconnected)Stream RPC connection counts to a target cluster, grouped by status. Per-replicator transitions Disconnected↔Connected; sum across both label values is the total channel replicator count for that target.
milvus_cdc_stream_rpc_reconnect_timesCountercounttarget_clusterNumber of stream RPC reconnects (failure / timeout)

8.2 CDC Lag — PromQL

CDC lag is the time-tick gap between the primary’s WAL frontier and the latest replicated position. It is the single most important metric for steady-state replication health and the data-loss risk window for force failover.

clamp_min(
  max by (channel_name) (
    milvus_wal_last_confirmed_time_tick
  )
  -
  min by (channel_name) (
    milvus_cdc_last_replicated_time_tick
  ),
  0
)

Units: seconds. Note milvus_wal_last_confirmed_time_tick is exposed by the Source Milvus StreamingNode (not CDC); Prometheus must scrape both the Source Milvus pods and the Source CDC pod to compute lag. For multi-standby topologies the min by (channel_name) term picks the slowest-progressing target. Add namespace / app_kubernetes_io_instance label filters when scraping multiple cluster pairs.

8.3 What Increases CDC Lag

  • High primary write rate.
  • Cross-cluster network latency / packet loss.
  • Overloaded standby.
  • Under-provisioned CDC pods.
  • Large DDLs or imports in flight.

8.4 Alerting Guidance

All alert sources below are Source-side CDC process metrics (CDC lag combines source-WAL and source-CDC scrapes).

AlertThresholdResponse
CDC lag sustained> 5 minutesCheck network, standby load, CDC pod resources
stream_rpc_connections — disconnected sustainedmilvus_cdc_stream_rpc_connections{connection_status="disconnected"} > 0 sustained for 1 minuteDiagnose stream RPC reconnect failures (a per-replicator counter — non-zero means at least one channel can’t connect)
stream_rpc_reconnect_times raterate(milvus_cdc_stream_rpc_reconnect_times[5m]) * 300 > 10Network instability — RPO at risk
e2e_latency P99sustained > 3 sSystem saturation or network jitter

9. Limitations & Boundaries

LimitationDetail
No active-activeCDC supports single-primary topology only. Multi-cluster simultaneous writes lead to split-brain. Standbys reject direct writes.
Star topology onlyCurrently only one source fan-out to N targets; chained A→B→C→D is on roadmap, not yet supported.
PChannel countAll clusters must have identical counts at any moment. Topology-wide append-only growth is supported (all clusters grow N→N+k together via UpdateReplicateConfiguration). Mismatched counts (A:16 / B:24) are not supported and never have been.
No automatic historical syncCDC captures only changes after it is enabled. For non-empty primaries, use Standby Cluster Rebuild (Backup Tool) for the initial bootstrap.
No pause APIUpdateReplicateConfiguration has no pause semantics. Workaround: scale CDC pods to 0 (it resumes from checkpoint when scaled back up), or rewrite the topology to remove the edge.
CDC is single-replicacomponents.cdc.replicas: 1 only. Multi-replica CDC is on roadmap.
Whole-cluster granularityReplication is at the cluster level. Database / Collection-level filtering is on roadmap.
WAL retentionDefault 3 days. Affects standby rebuild (Import must finish within the window unless retention is extended) and replication deletion (re-creating replication after retention expiry causes data loss).
Resource Group cross-clusterMismatched primary/standby RG configuration can cause load failures during RG message replication. Optimization on roadmap.
Import message not yet replicatedMessageTypeImport replication is in development; do not rely on imports being mirrored to the standby today.

10. 2.5 vs 2.6 CDC Comparison

AspectMilvus 2.5 CDCMilvus 2.6 CDC
Code locationStandalone repo zilliztech/milvus-cdcBuilt-in internal/cdc/ in the main repo
Underlying messagingTightly coupled with 2.5 msgstreamBuilt on Streaming Service WAL
RoleOwned business logic (TtManager, vchannel management)Pure log forwarder; ignores collections / vchannels
Management surfaceHTTP API to create / pause / delete tasksSingle RPC: UpdateReplicateConfiguration (full replace), gRPC only
CheckpointCDC persists locally per messageTarget persists ReplicateCheckpoint in WAL
ConsistencyAt-least-once (duplicate-write risk)Exactly-once (target filters by tt)
TtMsg handlingCDC injects TtMsgs via TtManagerWAL stamps tt natively; CDC just forwards
PChannel mappingMulti-to-one allowed1:1 mapping (with v2.6.12+ relaxation for count increases)
Failover modesNone nativeThree modes: Graceful, Coordinated, Force + Data Salvage
DeploymentSeparate milvus-cdc serviceMilvus Operator components.cdc.replicas

11. POC Guidance

11.1 Validation Plan

  1. Environment: two K8s clusters (or dual-namespace), Milvus v2.6.16 + CDC on each side. Verify CDC pods Running.
  2. Basic functions: build A→B, exercise DDL (CreateCollection / CreateIndex / CreatePartition), DML (Insert / Delete / Upsert), DCL (CreateUser / Role / Privilege).
  3. Failover:
    • Graceful: with steady write traffic, A→B → B→A → A→B; verify zero data loss across cycles.
    • Coordinated: inject StreamingNode CrashLoop, verify the Fence Message + pauseConsumption fallback.
    • Force: simulate region partition, run force_promote, record CDC lag and observed data loss.
  4. Data Salvage: after the original primary is recovered, run DumpMessages and verify the salvaged data range is what’s expected. Heads-up: today only the raw gRPC stream is implemented — there is no Backup-Tool-driven JSON/Parquet export or object-storage persistence yet. Plan to write a small consumer client that drains the stream and stores it however your environment expects (S3, Parquet, plain file, etc.). See §6.4.
  5. Standby Rebuild: start from a primary with existing data, build a fresh standby, run all 5 stages.
  6. Performance:
    • Sustained primary writes at production-like vector dimension + batch size; record CDC lag P50/P99.
    • Large DDL bursts (e.g. dozens of partitions) and observe lag spikes.
  7. Observability: ingest all 6 CDC metrics into Grafana, configure thresholds.
  8. Chaos: kill -9 CDC pod, drop cross-cluster network briefly, briefly stop ETCD; verify recovery.
  9. Security: deploy per-cluster mTLS between source and target, verify cross-CA isolation by intentionally misconfiguring a cert (should fail).

11.2 Pre-Production Checklist

  • ☐ Both clusters on Milvus v2.6.16+, Operator v1.3.4+.
  • ☐ PChannel counts planned in advance and consistent across clusters.
  • ☐ CDC component enabled on both primary and standby (standby will need it after switchover).
  • ☐ Cross-cluster network reachability validated (CDC pods can reach target ClusterIP / DNS).
  • ☐ Per-cluster mTLS enabled on cross-cluster links.
  • ReplicateConfiguration templates code-reviewed (the delete-then-rebuild path can lose data — see FAQ).
  • ☐ Dashboards live, alerts configured against the 6 metrics + the CDC-lag PromQL.
  • ☐ All three failover modes drilled.
  • ☐ Standby Cluster Rebuild drilled end-to-end.

12. CDC Metadata Reference

This section catalogs every piece of persistent state CDC depends on — keys, schemas, owners, lifecycle. Useful for SREs operating CDC in production, debugging desync, or auditing what survives a restart.

12.1 Where each piece lives

Three storage layers participate:

LayerBacking storePrefixWhat it holds
StreamingCoord catalogETCDstreamingcoord-meta/Cluster-wide replication topology & per-pchannel CDC tasks
StreamingNode catalogETCDstreamingnode-meta/wal/{channel}/Per-PChannel WAL recovery state & replicate checkpoints
Streaming messagesWAL (Woodpecker / Pulsar / Kafka)In-flight control-plane messages, e.g. AlterReplicateConfigMessage

12.2 StreamingCoord meta (cluster-wide topology)

KeyProtoPurposeLifecycle
streamingcoord-meta/replicate-configurationReplicateConfigurationMetaThe single source of truth for the current cross-cluster topology on this cluster: list of clusters, edges, and a force_promoted flag indicating whether the current state came from a Force Failover.Written by the broadcaster ACK callback when an AlterReplicateConfigMessage commits. Replaces the previous value atomically. Read at SC startup to restore in-memory state.
streamingcoord-meta/replicating-pchannel/{targetClusterID}-{sourceChannelName}ReplicatePChannelMetaOne entry per source → target PChannel edge. Contains the source/target channel names, the target cluster’s connection info, and an initialized_checkpoint used when no prior progress exists (e.g. brand-new replication, or pchannel-increasing growth).Written by the same ACK callback as above (SaveReplicateConfiguration writes both keys in one call). Deleted when an AlterReplicateConfigMessage removes the edge. This is the prefix CDC Controller watches to start/stop channel replicators.
streamingcoord-meta/broadcast-task/...Internal broadcast task stateTracks in-flight broadcasts (e.g. an AlterReplicateConfigMessage being propagated to all pchannels). Force Failover scans this prefix to mark stale in-flight broadcasts with the ignore flag, preventing them from overwriting the new force-promote config.Created when broadcast starts; removed when all pchannels ACK or the broadcast is force-ignored.

Schema highlights

message ReplicateConfigurationMeta {
  common.ReplicateConfiguration replicate_configuration = 1; // clusters + edges
  AckedResult                   acked_result            = 2; // broadcast progress
  bool                          force_promoted          = 3; // true if from force_promote=true
}

message ReplicatePChannelMeta {
  string source_channel_name                  = 1; // e.g. clusterA-rootcoord-dml_3
  string target_channel_name                  = 2; // e.g. clusterB-rootcoord-dml_3
  common.MilvusCluster target_cluster         = 3; // target uri / token / cluster_id
  common.ReplicateCheckpoint initialized_checkpoint = 4; // start position for new edges
  bool   skip_get_replicate_checkpoint        = 5; // true for pchannel-increasing tasks
}

12.3 StreamingNode meta (per-PChannel WAL recovery)

These keys are scoped under streamingnode-meta/wal/{pchannel}/ on each PChannel’s catalog. The StreamingNode hosting that PChannel owns them; CDC reads them via RPC, not by direct ETCD access.

Key suffixProtoPurposeLifecycle
consume-checkpointWALCheckpointWhy it’s in this reference: this key is the only persistent record of CDC replication progress on a target cluster. If you want to audit “where exactly has the standby caught up to from the primary?” off-line — without calling RPCs, just inspecting etcd — this is where you look. Its replicate_checkpoint field is the durable mirror of the in-memory secondaryState.checkpoint that GetReplicateInfo returns. On node restart, the target rebuilds its in-memory CDC state from this key. The proto carries two checkpoints serving two audiences: - message_id / time_ticklocal WAL position; used by StreamingNode to resume WAL replay. - replicate_checkpointsource-side position; used by CDC (indirectly via GetReplicateInfo) to know where to resume forwarding.Batched persistence: recovery_background_task.go wakes every persistInterval (default 10s) and writes the in-memory dirty snapshot. Overwrite semantics — only the latest is kept. Read at node restart to rebuild in-memory state.
salvage-checkpoint/{sourceClusterID}commonpb.ReplicateCheckpointCaptured at the moment of Force Failover. Records the last source-side position this cluster had received from the (now-lost) source, keyed by source cluster ID. Enables Data Salvage: when the old primary comes back, this checkpoint tells Backup Tool from where to dump unreplicated messages.Written during the force_promote broadcast handling. Retention 7 days. After that the salvage opportunity is lost.

WALCheckpoint schema highlights

message WALCheckpoint {
  common.MessageID            message_id            = 1; // local WAL position
  uint64                      time_tick             = 2; // local TT
  int64                       recovery_magic        = 3; // schema version hint
  common.ReplicateConfiguration replicate_config    = 4; // joined topology, if any
  common.ReplicateCheckpoint  replicate_checkpoint  = 5; // SOURCE-SIDE MessageID + TT
  AlterWALState               alter_wal_state       = 6; // in-progress WAL alter ops
}

// Defined in milvus-proto:
message ReplicateCheckpoint {
  string                cluster_id  = 1; // source cluster id this checkpoint belongs to
  string                pchannel    = 2;
  common.MessageID      message_id  = 3; // source-side
  uint64                time_tick   = 4; // source-side
}

The two checkpoint perspectives in one WALCheckpoint entry — local vs replicate — are the bridge between “where am I in my own WAL” and “where am I in the source’s WAL”.

12.4 Streaming messages (transient, in-WAL)

Not persistent in a catalog — these flow through the WAL itself, but they’re CDC’s control-plane “events” and worth listing because operators see them in WAL traces / logs.

MessageTypeCarrierPurpose
MessageTypeAlterReplicateConfig (V2)Source cluster’s WAL (broadcast across all pchannels)The actual cross-cluster topology change event. Carries the new ReplicateConfiguration, plus force_promote and ignore flags. Doubles as the Fence Message for Coordinated Failover and Graceful Switchover (RPO=0 barrier).
MessageTypeFlushAllSource cluster’s WALForces flushing of all vchannels. Used by Standby Rebuild Stage ④ to materialize a ReplicateCheckpoint on the new standby exactly at the snapshot boundary.

12.5 Who reads / writes each piece

Two groupings here: (a) keys the CDC pod directly touches via etcd; (b) keys CDC never touches but whose contents are part of the CDC data path (CDC interacts with them only through RPC to other components).

(a) etcd keys CDC pod directly operates on

The CDC pod is a reader + conditional deleter of one prefix and never a writer/creator of any etcd key.

Meta keyCDC pod’s etcd opCode locationPurpose
replicating-pchannel/*Watch (long-poll, prefix)internal/cdc/controller/controller.go:90Reacts to PUT events to start channelReplicator for new edges; reacts to DELETE events to tear them down.
replicating-pchannel/*List/Get (prefix, on startup)controller.go:107meta/replicate_meta.go:96 (etcdCli.Get(prefix, ...))Full snapshot on startup or reconnect — reconciles in-memory replicator set with current truth.
replicating-pchannel/{specific key}Conditional Delete (txn with ModRevision compare)replicate_stream_client_impl.go:354meta/replicate_meta.go:105 (etcdCli.Txn().If(cmps).Then(delOp))When the AlterReplicateConfigMessage indicates this edge is removed (IsReplicationRemovedByAlterReplicateConfigMessage returns true), CDC cleans up its own ReplicatePChannelMeta and exits. The revision compare ensures it only deletes what it was managing.

That’s all of CDC’s direct etcd footprint. No Put, no Create, no other prefixes.

(b) etcd keys CDC never touches (but their content matters for CDC behavior)

These keys are written and read by other components (StreamingCoord, StreamingNode). CDC interacts with them only via RPC to those owners, never by reading etcd directly.

Meta keyWriterReader(s)How CDC interacts
replicate-configurationStreamingCoord (ACK callback of AlterReplicateConfigMessage)StreamingCoord (restart recovery, validation); SDK GetReplicateConfiguration (token-redacted)Not at all — CDC doesn’t need full topology, only its per-pchannel slice from replicating-pchannel/*
broadcast-task/*StreamingCoord (broadcaster)StreamingCoord (force-promote incomplete-broadcast cleanup)Not at all — purely SC-internal
consume-checkpointStreamingNode (background batch task, every persistInterval)StreamingNode itself (WAL recovery; rebuilds in-memory secondaryState.checkpoint on restart)Indirect — CDC calls GetReplicateInfo RPC; Proxy returns the in-memory secondaryState.checkpoint (whose persistent mirror is this etcd key). CDC never reads etcd here.
salvage-checkpoint/*StreamingNode (during force_promote broadcast handling)StreamingNode (loaded into in-memory SalvageCheckpoints)Indirect — Backup Tool (not CDC) calls GetReplicateInfo + DumpMessages RPC during Data Salvage

12.6 Operational implications

  • If replicating-pchannel/* diverges from replicate-configuration: serious — the watcher and the validator are looking at different truth. Usually points to an interrupted broadcast or partial restore. Re-applying UpdateReplicateConfiguration typically heals.
  • If consume-checkpoint.replicate_checkpoint is stale (far behind WAL latest): normal — it’s batched (10s default). If consistently > 30s behind under low load, the persist task may be stuck; check recovery_background_task logs.
  • If salvage-checkpoint/* has aged past 7 days after a force failover: Data Salvage window has closed. Any unreplicated data on the old primary is no longer recoverable through the standard flow.
  • Backup/restore of these metas: standard ETCD backup covers all of them. Restoring an ETCD snapshot mid-stream while CDC is running will desync — drain CDC and pause writes first.
  • Multi-cluster operator note: each cluster has its own copies of all keys above. They are not shared across clusters. Cross-cluster consistency is achieved through replicating AlterReplicateConfigMessage, not through shared meta storage.

13. FAQ

PyMilvus doesn't expose GetReplicateInfo — is it really available?

The RPC exists and works; only the high-level PyMilvus wrapper is missing.

  • Proto / server side: MilvusService.GetReplicateInfo is defined and implemented (internal/proxy/impl.go:7062). Returns checkpoint + salvage_checkpoint (both ReplicateCheckpoint).

  • Go SDK: implemented at client/milvusclient/replicate.go:38 as Client.GetReplicateInfo(ctx, req, opts...), with example in replicate_example_test.go.

  • PyMilvus: only the auto-generated low-level gRPC stub is available (pymilvus/grpc_gen/milvus_pb2_grpc.py:599); there is no high-level MilvusClient.get_replicate_info() method as of today. Workaround until a wrapper lands:

    from pymilvus.grpc_gen import milvus_pb2, milvus_pb2_grpc
    import grpc
    
    channel = grpc.insecure_channel("localhost:19530")
    stub = milvus_pb2_grpc.MilvusServiceStub(channel)
    resp = stub.GetReplicateInfo(milvus_pb2.GetReplicateInfoRequest(
        source_cluster_id="source-cluster",
        target_pchannel="target-cluster-rootcoord-dml_0",
    ))
    print(resp.checkpoint, resp.salvage_checkpoint)
    

    This bypasses MilvusClient’s connection management, token injection, and retry — fine for a one-off probe but not recommended for production until the high-level method is added.

  • Roadmap: a high-level MilvusClient.get_replicate_info() wrapper will land in a subsequent PyMilvus release. Once it ships, applications can call it the same way they call update_replicate_configuration today.

Does CDC affect primary query performance?

Not noticeably. CDC consumes the WAL asynchronously and is off the primary data path. It does share node resources with other Milvus components, so resource isolation matters under saturation.

Can the standby serve queries?

Yes—search, query, and metadata reads all work on the standby. Direct writes are rejected. Do not invoke DDL like load_collection manually on the standby; those operations should run on the primary and replicate over.

Can CDC backfill historical data that existed before CDC was enabled?

No. CDC captures only changes after activation. For non-empty primaries, run the Standby Cluster Rebuild flow first using the standalone zilliztech/milvus-backup tool: ./milvus-backup create -n bkp on the source, then ./milvus-backup restore secondary --source_cluster_id A --target_cluster_id B on the target. The tool replays the FlushAll captured at backup time to materialize the standby's ReplicateCheckpoint, after which CDC seamlessly takes over the incremental tail. See §7.

Will Bulk Insert / Import operations on the primary replicate to the standby via CDC?

Not today. CDC's automatic forwarding of MessageTypeImport from a source WAL is on the roadmap, not yet shipped — see the §3.3 message replication matrix. If your application relies heavily on Bulk Insert, this is a real gap to plan around: either pause Bulk Insert on the primary during steady-state CDC operation, or schedule periodic Standby Cluster Rebuilds matched to your Bulk Insert cadence so the standby catches up.

Note this is unrelated to the milvus-backup restore secondary path: that tool itself pushes ImportMsg into the target via ReplicateStream as part of the rebuild flow — that's a different code path and works fine. The gap is only about CDC's automatic forwarding of Import messages during normal replication.

Does graceful failover lose data?

No—RPO = 0. The standby waits until the AlterReplicateConfigMessage has been replicated before promoting.

Coordinated Failover's "retry + pauseConsumption escalation" — does Milvus do this automatically?

No. Milvus only provides the streaming.walScanner.pauseConsumption config switch — when you set it true, the SN responds by entering "WAL-write only" mode and breaks the crashloop. The whole "try the switch, fail, flip the config, retry again" loop is your control plane's responsibility, not Milvus internal logic. For POC you need to budget engineering time for a small script (or operator) that implements this escalation. See §6.2.

How much data does force failover lose?

The upper bound is the CDC lag at the moment of cutover. Zero lag = zero loss. Even when loss happens, Data Salvage on the recovered original primary lets you drain the missing window via the DumpMessages gRPC stream — note that today only the raw stream RPC is implemented (no JSON/Parquet serializer, no automatic object-storage writer, no replay tool), so you have to write a small consumer client that drains the stream and decides replay policy yourself. See §6.4.

After Force Failover in a 1-primary-many-standby topology, what happens to the sibling standbys?

They must be rebuilt against the new primary. Sibling standbys never received the Fence Message (the old primary was unreachable), so their state may have diverged from the promoted standby in either direction — they could be ahead or behind. No general-purpose mechanism exists to reconcile in place. Use milvus-backup restore secondary --source_cluster_id <new-primary> --target_cluster_id <sibling> for each sibling.

This is Force-specific. Graceful and Coordinated Failovers propagate the Fence Message to all standbys (broadcast through CDC), so sibling rewire is automatic there — no rebuild needed. See §6.5 and §7.

Why isn't active-active supported?

Vector DB writes (Insert / Upsert / Delete) don't admit cheap cross-cluster conflict resolution—primary-key collisions and vector versions require business decisions. The 2.6 CDC scope is engineering-viable DR; full multi-master is out of scope.

How do I delete an established replication relationship?

Submit an UpdateReplicateConfiguration whose new topology does not include the source→target edge. CDC detects the edge removal (IsReplicationRemovedByAlterReplicateConfigMessage in internal/cdc/util/util.go), calls RemoveReplicatePChannelWithRevision to delete the ReplicatePChannelMeta from ETCD, and tears down the channel_replicator.

Code-wise there is no lock preventing re-creation. The real risk is that the standby's ReplicateCheckpoint is retained: if you re-establish replication after a window longer than WAL retention (default 3 days), the messages between the delete and re-create are gone from the source WAL, and CDC will resume from the stale checkpoint with a data hole. To safely re-create, run Standby Cluster Rebuild to reset the standby state instead of relying on the residual checkpoint.

How do I pause replication?

There is no pause API. Workaround: scale the CDC pod to 0 — when scaled back up, it resumes from the persisted ReplicateCheckpoint with no data loss (CDC's resume position is on the target side, not in the CDC pod itself).

Do NOT use "remove the edge from topology" as a pause. That's a delete operation (see the next FAQ). Once deleted, the standby's residual ReplicateCheckpoint stays put but the WAL retention window keeps ticking; re-establishing replication after the WAL retention expires will silently lose data between delete and re-create.

Cross-version replication 2.5 ↔ 2.6?

Not supported. 2.5 CDC and 2.6 CDC are different implementations on different message subsystems. Use Backup + Restore or in-place upgrade to migrate; CDC works between 2.6+ clusters only.

PChannel count is wrong after creation — can I change it?

A cluster's initial pchannel count is set at creation time (tied to shardNum). UpdateReplicateConfiguration supports topology-wide append-only growth — you can submit a new config where every cluster has more pchannels (N→N+k) in lock-step, and CDC will provision replicators for the newly added pchannels using InitializedCheckpoint rather than GetReplicateInfo. The validator still rejects any configuration where clusters have different pchannel counts at the same time. Plan baseline capacity at production scale; treat topology growth as a coordinated operational change, not a quick fix for an undersized cluster.

What if the CDC pod crashes?

The primary cluster's writes are unaffected (CDC is off the write path). On restart, CDC resumes from the latest ReplicateCheckpoint and the standby catches up; lag metrics rise temporarily. Kubernetes restarts the pod automatically.

Cross-region object storage?

CDC only consumes the WAL—it has no direct coupling to object storage. Source and target may use independent buckets in different regions; the only inter-region requirement is that the ReplicateStream RPC be reachable.

CDC vs Backup?

Complementary:

  • CDC: incremental, sub-second to seconds of lag, hot standby. Built into Milvus 2.6 core (internal/cdc/) — no separate install.
  • Backup: full snapshot, minutes to hours, used for initial bootstrap and migration. Standalone tool zilliztech/milvus-backup — not part of the Milvus main repo. Install separately (binary release or brew install zilliztech/tap/milvus-backup). See §7.
Is there automatic conflict resolution / append on salvage data?

No. Conflict resolution (skip / overwrite / merge on primary-key collision) is business-defined. The system surfaces the salvaged data; the user owns the replay policy.