实验日期 2026-05-18 · 服务端 Apache Pulsar 3.0.2 standalone(对齐 Milvus 2.6 cluster 部署)· 参考文档:《Pulsar go client 对于触发 backlog quota limit 的处理》
一、结论(TL;DR)
二、版本事实(回答"Milvus 2.6 用啥")
| 项目 | Milvus 2.6 实际取值 | 来源 |
|---|---|---|
| pulsar-client-go(客户端) | v0.17.0(含此 bug) | git show v2.6.15:go.mod / pkg/go.mod,master 同为 v0.17.0 |
| Pulsar 服务端 · cluster 模式 | Apache Pulsar 3.0.2 ( apachepulsar/pulsar-all:3.0.2) | Helm 部署 → milvus-helm pulsarv3 子 chart = apache pulsar-helm-chart 3.3.0,Chart.yaml: appVersion 3.0.2,defaultPulsarImageTag 留空回落 appVersion |
| Pulsar 服务端 · 本地 docker | 2.8.2(仅遗留 dev compose) | docker-compose.yml / deployments/docker/dev —— 不代表真实集群部署 |
说明:cluster 模式走 Helm,真实服务端是 Pulsar 3.0.2;仓库里的 2.8.2 是本地开发遗留路径。实验主服务端据此选用 3.0.2。
三、根因(源码层面)
三版 reconnectToBroker() 都已是 internal.Retry(opFn) 结构。差异只有一处——命中 ProducerBlockedQuotaExceededException 后是否提前 return:
// v0.17.0 / v0.18.0 —— BUG:return nil 让 Retry 视为"成功"→ 停止重连
if strings.Contains(errMsg, errMsgProducerBlockedQuotaExceededException) {
p.log.Warn("Producer was blocked by quota exceed exception, failing pending messages, stop reconnecting")
p.failPendingMessages(errors.Join(ErrProducerBlockedQuotaExceeded, err))
return struct{}{}, nil // ← producer 永久死亡,不再重连
}
// v0.19.0 —— FIX(PR #1457):去掉 return,fallthrough 到 return err → Retry 继续 backoff 重连
if strings.Contains(errMsg, errMsgProducerBlockedQuotaExceededException) {
// ProducerBlockedQuotaExceededException is a retryable exception,
// we only fail pending messages but continue trying to reconnect
p.log.Warn("Producer was blocked by quota exceed exception, failing pending messages, will retry reconnecting")
p.failPendingMessages(errors.Join(ErrProducerBlockedQuotaExceeded, err))
// 无 return —— 继续往下走到 return err,Retry 继续重连
}
- 引入:commit
bd11581867· PR #1134(2023-12)—— 把 quota 异常和TopicNotFound/TopicTerminated/ProducerFenced一起当成终态错误放弃重连。但 quota 是可恢复瞬态错误,归错类。 - 修复:commit
d7fafb5a55· PR #1457(2026-03-20)“regard ProducerBlockedQuotaExceededException as retryable exception to continue to reconnect”。 - 关键:该修复仅包含在 v0.19.0,v0.17.0 与 v0.18.0 均有此 bug。→ Milvus 修复的最小升级目标 = v0.19.0,升到 v0.18.x 无效。
四、实验设计(脱离 Milvus,复刻 PDF)
这是纯客户端重连逻辑问题,与 Milvus 无关:PDF 里那些日志(Reconnecting to broker、Failing N messages on timeout 30s、message send timeout: TimeoutError)全是 pulsar-client-go 内部 logrus 日志。harness 把 client logger 接到 logrus 即可逐行复刻。
- 拓扑(沿用 PDF 命名):topic
persistent://public/default/high-concurrency-test-0、消费订阅high-concurrency-sub-0、单 producer 灌 10KB 消息、单 consumer 镜像 PDF 消费端日志。 - 服务端:独立隔离容器
apachepulsar/pulsar:3.0.2standalone(端口 16650/18083,不碰 Milvus 依赖的 6650);调优backlogQuotaCheckIntervalInSeconds=5、小 ledger 加速。 - 配额:
namespaces set-backlog-quota public/default --limit 50K --policy producer_exception;另建一个永不消费的持久订阅 backlog-holder-sub 稳定堆 backlog。 - 三阶段:① P1 flood 灌到 broker 拦截 producer → ② P2 observe 60s 观察卡死 → ③ P3 recover
pulsar-admin clear-backlog(等价"消费者追上")后观察是否自愈 —— P3 是 PDF 没覆盖、本次新增的核心对照。 - 版本切换:单 module,
go get pulsar-client-go@v0.17.0跑一遍 →@v0.19.0再跑一遍;两版实验代码完全相同。
backlogSize 归零、pulsar-client produce 探针"1 messages successfully produced",即 broker 此时已放行新 producer。所以 v0.17 的 NO_RECOVERY 纯粹是客户端自己放弃重连,而非服务端仍拒绝——对照有效。同时验证 Pulsar 3.0.2 仍返回
ProducerBlockedQuotaExceededException 错误串,客户端 strings.Contains 匹配在 Milvus 2.6 实际服务端版本上有效。五、实验结果
| 观测点 | v0.17.0(Milvus 2.6 现状) | v0.19.0(修复后) |
|---|---|---|
| broker 触发拦截 | ProducerBlockedQuotaExceededException @ +4.7s | ProducerBlockedQuotaExceededException @ +3.0s |
| 客户端内部关键日志 | stop reconnecting ×1 | will retry reconnecting ×22 |
重连尝试 Reconnecting to broker | 0 次(立即放弃) | 22 次(持续 backoff) |
| 清 backlog 后成功重连 | 0(已放弃,永不再试) | Reconnected producer to broker ×1 |
| P3(清 backlog 后)写入结果 | 6× message send timeout: TimeoutError,始终未恢复 | +23s 后 sent ok 持续成功(total 58→63…) |
| VERDICT | NO_RECOVERY —— producer 永久死亡,复刻 PDF 结局 | RECOVERED in 23s —— 自动自愈 |
时间线
六、关键日志摘录(逐行可对 PDF)
v0.17.0 —— 复刻 PDF “卡死不恢复”
level=error msg="Failed to create producer at reconnect" error="server error: ProducerBlockedQuotaExceededException: ...TopicBacklogQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" level=warning msg="Producer was blocked by quota exceed exception, failing pending messages, stop reconnecting" [+1m8.774s] PHASE3 clearing backlog (== consumers caught up) [+1m8s ~ +3m] 期间 "Reconnecting to broker" 出现 0 次 · 6× message send timeout: TimeoutError RESULT=NO_RECOVERY producer permanently dead after quota block; clearing backlog did NOT bring it back
v0.19.0 —— 持续重连 → 清 backlog 后自愈
level=warning msg="Producer was blocked by quota exceed exception, failing pending messages, will retry reconnecting" (×22,伴随 backoff "Reconnecting to broker") [+1m34.31s] PHASE3 clearing backlog [+1m35.8s] clear-backlog sub=backlog-holder-sub · [+1m37.4s] clear-backlog sub=high-concurrency-sub-0 level=info msg="Connected producer" cnx="127.0.0.1:51714 -> 127.0.0.1:16650" epoch=11 ... level=info msg="Reconnected producer to broker" cnx="127.0.0.1:51714 -> 127.0.0.1:16650" ... [+1m57.022s] >>> Producer 0 RECOVERED: first ok send +23s after clear-backlog [+1m57.022s] Producer 0 sent ok (recovered) total=58 … 59 … 60 … RESULT=RECOVERED in 23s after backlog cleared (auto self-heal)
七、对 Milvus 2.6 的影响与建议
producer_exception backlog quota 策略下,一旦 topic 触发配额,对应 producer 会永久死亡、即使 backlog 清空也不会自愈,写入持续 30s 超时直至该组件重启。这是生产级隐患。建议(按优先级)
- P0 止血 部署侧把 backlog quota 策略改为
producer_request_hold:纯 Pulsar namespace 配置、不动代码,backlog 恢复后 producer 自动恢复;配套 backlog quota 监控告警。 - P1 根因 Milvus 升级 pulsar-client-go 至 v0.19.0+(含 PR #1457)。注意最小目标是 v0.19.0,v0.18.x 无效;2.6 hotfix 需评估 v0.17→v0.19 的 API/行为差异。
- 注意 升级客户端不是银弹:实验显示 v0.19.0 每次 quota 命中仍会
failPendingMessages(在途消息照样失败抛ErrProducerBlockedQuotaExceeded),修复的只是"producer 不再永久死亡、backlog 清掉后能自愈"。Milvus 上层(msgstream/WAL)仍需对该错误做重试 + 背压,否则 backlog 期间写入仍会失败。 - 观察 v0.19.0 稳态行为是 block → retry → recover →(backlog 再涨)→ block 的振荡,但永不永久死亡;v0.17.0 是首次 block 即永久死亡。这是两者本质区别。
八、复现方式
# 1. 隔离 Pulsar 3.0.2 standalone(端口 16650/18083,不碰 Milvus 依赖)
docker run -d --name pulsar-quota-exp -p 16650:6650 -p 18083:8080 \
apachepulsar/pulsar:3.0.2 bash -c 'printf "\nbacklogQuotaCheckIntervalInSeconds=5\n..." \
>> conf/standalone.conf && exec bin/pulsar standalone --no-functions-worker --no-stream-storage'
# 2. 两版各跑一遍(setup.sh 配额+重置 topic+建滞后订阅;harness 三阶段)
cd /tmp/pulsar-quota-exp
bash run.sh v0.17.0 # → run_v0.17.0.log (NO_RECOVERY)
bash run.sh v0.19.0 # → run_v0.19.0.log (RECOVERED in 23s)
实验工程与原始日志:/tmp/pulsar-quota-exp/(main.go / setup.sh / run.sh / run_v0.17.0.log / run_v0.19.0.log)。
Pulsar backlog quota → producer 重连复刻对比实验 · 生成于 2026-05-18 · 服务端 Apache Pulsar 3.0.2(对齐 Milvus 2.6 cluster)· 客户端 pulsar-client-go v0.17.0 vs v0.19.0 · 上游修复 PR apache/pulsar-client-go#1457