vLLM推理服务假死排查-多模态缓存幽灵Key导致死循环

发布时间：2026/7/2 11:05:49

契机生产环境中的vLLM推理服务突然假死进程存活、显存正常占用、健康检查端点返回正常但所有推理请求全部超时。GPU-util为0CPU拉到100极端环境出现特指高并发环境说明推理框架vLLM-v0.23.0模型多模态大模型驱动Driver Version: 575.57.08cudarelease 12.9, V12.9.86启动参数VLLM_LOGGING_LEVELDEBUG \ PYTHONUNBUFFERED1 \ CUDA_VISIBLE_DEVICESx \ nohup /home/conda/envs/vllm/bin/vllm serve /home/models/xxxx \ --served-model-name xxxx \ --host 0.0.0.0 \ --port 20100 \ --trust-remote-code \ --enable-prefix-caching \ --gpu-memory-utilization 0.9 \ --uvicorn-log-level debug \ --log-error-stack \ --enable-log-requests \ --enable-log-outputs \ --enable-request-id-headers \ --max-log-len 2000 \ --enable-logging-iteration-details \ vllm_debug.log 21 运行时长服务已稳定运行数天约 100 万次迭代突发假死这里多说一嘴由于我是12.9的驱动所以装vllm相当吃力生产环境又不敢升级驱动如果你刚好也是12.x驱动一定要求https://vllm.ai/找安装方法并且安装过程需要把加速源全部禁用安装过程中观察下载连接和地址一定不要从镜像下载pip install vllm --extra-index-url https://wheels.vllm.ai/0.24.0/cu129 --extra-index-url https://download.pytorch.org/whl/cu129 --index-strategy unsafe-best-match故障现象检测项状态备注进程 PID存活EngineCoreGPU 显存正常正常GPU 利用率0%异常CPU 利用率136%EngineCore 进程异常高/health端点200 OK正常返回/v1/models端点200 OK模型列表正常/v1/chat/completions超时15s 无响应诊断过程外部观测# 1. GPU状态nvidia-smi# GPU : 显存 xxxxMiB / 利用率 0%# 2. 进程CPU占用top-H-p 3791245# EngineCore 子进程 CPU 136%远超正常值# 3. 健康检查curl-s http://0.0.0.0:20100/health# 返回 OK# 4. 推理请求curl-s--max-time 15 http://0.0.0.0:20100/v1/chat/completions \-H Content-Type:application/json \-d {...}# 超时无响应服务处于假死状态——http层存活但调度引擎冻结debug日志分析# 5. 查找最后一次正常推理grep-a generation tokens debug.log|tail-20# 最后一次有实际生成吞吐00:30:42之后全部归零# 6. 查找EngineCore最后一条日志grep-a EngineCore debug.log|tail-10# 最后一条00:30:33 EngineCore waiting for work# 之后 EngineCore 再无任何输出EngineCore在00:30:33之后完全静默主循环停止运转Metrics端点# 7. 查看vLLM内部指标curl-s http://0.0.0.0:30000/metrics|grep-E running|waiting|queue# vllm:num_requests_running: 0# vllm:num_requests_waiting: 0请求已被APIServer接收日志中可见 Added request但EngineCore的调度器显示 Running:0,Waiting:0 —— 请求在到达调度器之前就消失了。py-spy抓栈# 8. 安装 py-spy不依赖 ptrace通过 /proc 读取pip install py-spy# 9. 抓取 EngineCore 进程的 Python 调用栈py-spy dump--pid 3794291--locals#Thread-2持有GIL在popitem中死循环。由于CPython的GIL机制该线程阻塞了EngineCore所有其他Python线程Thread 3821499 (activegil):Thread-2 (process_input_sockets)popitem (vllm/utils/cache.py:196) 卡在这里 __setitem__ (cachetools/__init__.py:82) __setitem__ (cachetools/__init__.py:294) get_and_update_item (vllm/multimodal/cache.py:662) get_and_update_features (vllm/multimodal/cache.py:607) preprocess_add_request (vllm/v1/engine/core.py:829) process_input_sockets (vllm/v1/engine/core.py:1528)strace确认# 10. strace跟踪EngineCore子进程strace-p 3794291-f-e tracefutex-c# 3 秒内仅 8 次 futex 调用微秒级# 其余时间全部在用户态Python代码中空转确认纯用户态 CPU 消耗没有系统调用阻塞是Python死循环源码分析# cachetools/__init__.py:82class Cache:def __setitem__(self,key,value):...while self.__currsize size self.maxsize:self.popitem()# ← 调用 vllm 重写的 popitem# vllm/utils/cache.py:196def popitem(self,remove_pinnedFalse):lru_key next(key for key in self.order if key not in self.pinned_items) value self.pop(lru_key)# ← pop 一个不在实际缓存中的 keyreturn (lru_key,value)# 问题pop() 对不存在的 key 返回 Nonecurrsize 没有减少# 回到 cachetools 的 while 循环currsize size maxsize 仍为真# → 再次 popitem → 再次失败 → 死循环刚好去搜索日志中的异常grep-a AssertionError\|Expected a cached debug.log#找到报错日志(EngineCore pid3794291) ERROR 06-29 00:28:41[v1/engine/core.py:1616]AssertionError:Expected a cached item for mm_hasha874ce...4739d0结论touch() 方法cache.py#L120-L124对不存在的 key 将其写入内部 LRU 顺序链表 __order但不写入实际缓存数据 _Cache__dataget_and_update_item()cache.py#L660在 assert 失败时setitem未被执行但 touch() 已写入的幽灵 key 残留在 __order 中后续任意请求触发缓存写入 →setitem进入驱逐循环popitem() 从 __order 取出幽灵 key → pop() 找不到实际数据 → currsize 不减 → 死循环验证复盘# vLLM GitHub Issue #43941# 标题: Bug: Infinite loop in EngineCore during multimodal cache eviction# 状态: 已关闭合入 PR #43595# 修复版本: v0.24.0完全匹配的已知 Bug与我们的 py-spy 堆栈、日志、时间线一致时间线 00:28:41 多模态缓存 AssertionError 触发 └→ touch() 已将幽灵 key 写入 __order └→ assert 失败__setitem__ 未执行幽灵 key 残留 00:28:41 引擎继续处理其他请求幽灵 key 暂不影响 00:30:33 某请求触发缓存写入 → __setitem__ 进入驱逐循环 └→ popitem() 取出幽灵 key └→ pop() 返回 Nonecurrsize 未减少 └→ while 循环条件永真 → 死循环 └→ Thread-2 持有 GILEngineCore 完全冻结 00:30:33 所有后续请求均失败客户端超时位置代码问题cache.py#L120-L124touch()的 except 分支将不存在的 key 写入__order产生幽灵 keycache.py#L196-L208popitem()驱逐逻辑未检查 key 是否真实存在于缓存中cache.py#L658-L664get_and_update_item()touch()和__setitem__之间非原子异常导致不一致修复方案本着不升级vllm的方案我直接去改conda环境中包的源码touch-不存在的key不操作# 原代码L120-L124 def touch(self, key: _K) - None: try: self._LRUCache__order.move_to_end(key) except KeyError: self._LRUCache__order[key] None # ← 幽灵 key 来源 # 修复后 def touch(self, key: _K) - None: if key in self: self._LRUCache__order.move_to_end(key)popitem—跳过并清理幽灵key# 原代码L196-L208 def popitem(self, remove_pinned: bool False): if not remove_pinned: lru_key next( (key for key in self.order if key not in self.pinned_items), ALL_PINNED_SENTINEL, ) if lru_key is ALL_PINNED_SENTINEL: raise RuntimeError(...) else: lru_key next(iter(self.order)) value self.pop(cast(_K, lru_key)) return (lru_key, value) # 修复后 def popitem(self, remove_pinned: bool False): while True: if not remove_pinned: lru_key next( (key for key in self.order if key not in self.pinned_items), ALL_PINNED_SENTINEL, ) if lru_key is ALL_PINNED_SENTINEL: raise RuntimeError(...) else: lru_key next(iter(self.order)) if lru_key in self: # ← 仅驱逐真实存在的数据 value self.pop(cast(_K, lru_key)) return (lru_key, value) self._LRUCache__order.pop(lru_key, None) # ← 清理幽灵 key两层防御改动1从源头杜绝幽灵key产生改动2确保即使有其他路径产生幽灵key驱逐时也不会死循环对应上游PRvllm-project/vllm#43595已合入v24复现路径vLLM v0.23.0 或更早版本使用多模态模型视觉-语言模型启用prefix caching在缓存接近容量上限时发送一条触发了多模态hash不存在于缓存的请求# 1. 启动 vLLM 服务使用 v0.23.0vllm serve 多模态模型路径\--host 0.0.0.0--port 20100 \--enable-prefix-caching \--gpu-memory-utilization 0.9# 2. 发送大量多模态请求填满缓存for i in $(seq 1 500); do curl-s http://0.0.0.0:20100/v1/chat/completions \-H Content-Type:application/json \-d {model:...,messages:[{role:user,content:[{type:text,text:describe},{type:video_url,video_url:video_path}]}]} done# 3. 发送触发幽灵 key 的请求# 需满足mm_hash 在 cache 的 __order 中但不在 _Cache__data 中# 实际场景中较难精确控制高并发下概率触发# 4. 验证假死curl-s--max-time 10 http://0.0.0.0:20100/v1/chat/completions \-H Content-Type:application/json \-d {model:...,messages:[{role:user,content:hello}]}# 预期超时# 5. 验证根因py-spy dump--pid EngineCore_pid--locals# 预期Thread-2 卡在 vllm/utils/cache.py popitem()单元测试如下# 模拟 touch() 产生的幽灵 keyfrom vllm.utils.cache import LRUCache cache LRUCache(capacity100,getsizeoflambda x:10) cache[real_key] real_value# 模拟 touch() 对不存在的 key 写入 __ordercache._LRUCache__order[ghost_key] None# 填满缓存触发驱逐for i in range(9):cache[fkey_{i}] fvalue_{i}# 此时再次 setitem 会触发 popitem()# 如果 popitem 没有防御性检查将死循环try:cache[new_key] new_valueexcept Exception:pass# 在修复前会死循环总结

资讯详情

vLLM推理服务假死排查-多模态缓存幽灵Key导致死循环

相关新闻

深度掌控AMD Ryzen处理器：SMUDebugTool硬件级调试实战指南

Dify 1.15 人工介入功能详解：构建可控AI工作流

KMX63与PIC18LF47K40在HMI手势交互中的应用

智能过热保护系统设计与工程实践

Metasploit渗透测试框架：从模块化架构到实战攻防演练

移动太阳能追踪系统：提升偏远地区电力供应效率

LTE NAS解密实战：从信令采集到密钥推导的完整工程实现

有限域置换多项式在图像加密中的应用与MATLAB实现

《UNIX 网络编程-卷1》 服务类型

BurpSuite Cluster Bomb模式深度避坑指南：从原理到实战的完整爆破策略

Selenium元素定位全解析：从八大方法到实战策略

移动端UI自动化测试框架Maestro终极指南：从入门到实战

管理者的六个层次

华为OD机试2025C卷-座位调整[100分]（ Java _ Python3 _ C++ _ C语言 _ JsNode _ Go）实现100%通过率

棋牌室茶楼收银别只看价格：京东收银K6收银机为什么值得小店老板优先了解

基于Dify与DeepSeek构建私有知识库问答系统实战指南

FAE放射组学分析工具：医学影像特征探索的完整解决方案

DesktopNaotu：你的终极离线思维导图解决方案，告别网络依赖！

《UNIX 网络编程-卷1》服务类型