Harness Engineering 实践案例:如何Agent 写一份行为规范

发布时间:2026/7/2 2:45:56
Harness Engineering 实践案例:如何Agent 写一份行为规范 Harness Engineering 实践案例如何给编码 Agent 写一份行为规范OpenAI 的 Ryan Lopopolo 那发布了一篇关于Harness 的官方文章我们来用手头的一个任务来测试下效果怎么样。这是一个内部RAGRetrieval-Augmented Generation和 fine-tuning 系统同事直接提问系统基于 OEM 合作伙伴提供的官方白皮书和数据手册给出答案回答会附带来源引用同事可以反馈系统据此学习幻觉hallucination和错误回答会逐渐减少。项目基本沿用了 OpenAI 团队在 Harness Engineering 文章《Leveraging Codex in an Agent-First World》里的文件夹结构具体是这样的. ├── AGENTS.md ├── ARCHITECTURE.md ├── CLAUDE.md ├── Makefile ├── README.md ├── apps │ ├── api │ │ ├── app │ │ ├── build │ │ ├── internal_llm_harness_api.egg-info │ │ ├── migrations │ │ ├── pyproject.toml │ │ └── tests │ └── web │ ├── dist │ ├── index.html │ ├── node_modules │ ├── package.json │ ├── pnpm-lock.yaml │ ├── postcss.config.cjs │ ├── public │ ├── src │ ├── tailwind.config.ts │ ├── tsconfig.app.json │ ├── tsconfig.app.tsbuildinfo │ ├── tsconfig.json │ ├── tsconfig.node.json │ ├── tsconfig.node.tsbuildinfo │ └── vite.config.ts ├── docker-compose.yml ├── docs │ ├── DESIGN.md │ ├── FRONTEND.md │ ├── PLANS.md │ ├── PRODUCT_SENSE.md │ ├── QUALITY_SCORE.md │ ├── RELIABILITY.md │ ├── SECURITY.md │ ├── design-docs │ │ ├── core-beliefs.md │ │ ├── index.md │ │ ├── rag-architecture.md │ │ └── security-boundaries.md │ ├── exec-plans │ │ ├── active │ │ ├── completed │ │ └── tech-debt-tracker.md │ ├── generated │ │ └── db-schema.md │ ├── product-specs │ │ ├── admin-console.md │ │ ├── document-ingestion.md │ │ ├── index.md │ │ └── internal-ai-assistant.md │ └── references └── skills-lock.jsonapps目录放的是前后端应用代码。AGENTS.md和ARCHITECTURE.md保存编码 agent 的操作规则系统目标、任务说明、技术栈等等。docs目录下的 markdown 文件则是我希望 agent 遵循的核心开发指令。流程大致是agent 处理exec-plan/active里的活动 markdown 文件做完就挪到exec-plan/completed技术问题写进tech-debt-tracker.md。整个工作流算是清晰、可追溯agent 不容易跑偏每一步决策也都留了记录。正式的生产级应用开发测试和质量指标是绕不开的。编码 agent 的目标是希望每个功能实现都经过测试最终服务于同一个目标——构建能从用户反馈中学习的 RAG 和 fine-tuning 系统。连这些 markdown 文件本身最初也是 Codex 定义的但我得确保它们始终圈定在我真正想要的应用范围内。AGENTS.md为了让编码 agent搭配 Copilot 的 Codex以及 Claude Code始终对齐项目目标仓库根目录下放了一个AGENTS.md文件充当 agent 行为的唯一可信来源single source of truth。产品核心目的、文件夹结构、操作规则、预期开发循环都写在这一个文件里。它告诉 agent“这是我们要构建的东西这是你该怎么工作这是任务真正完成的标准。”意图和约束由人来定义agent 负责实现、测试、记录文档。关键规则包括改代码前先读相关规范优先提交小型 pull request行为变更了就更新文档每个信任边界都要校验数据。“完成的定义”Definition of Done卡得比较严检索结果要标注来源权限必须强制执行指标可观测测试要覆盖主路径和至少一条失败路径。项目里实际用的AGENTS.md文件内容大致是这样# AGENTS.md This repository is designed for agent-assisted development: humans define intent, constraints, and review standards; agents implement, test, document, and improve the system. ## Product Build an internal AI system for company use: - Organization-network access only for end users. - LLM runtime with Ollama. - Model routing across DeepSeek R1 distilled models, Mistral, and Llama 3.1 class models. - RAG over approved OEM whitepapers, datasheets, and internal documents. - JWT, RBAC, document-level permissions, audit logs, and prompt-injection controls. - React web app backed by a Python API backend that also owns LLM orchestration. - Observability across latency, token usage, cache hit rate, retrieval quality, and hallucination feedback. ## Start Here - Architecture map: ARCHITECTURE.md - Copilot/Codex instructions: .github/copilot-instructions.md - Product behavior: docs/product-specs/index.md - Engineering plans: docs/exec-plans/active/ - Security rules: docs/SECURITY.md - Reliability rules: docs/RELIABILITY.md - Quality scorecard: docs/QUALITY_SCORE.md - Frontend rules: docs/FRONTEND.md - Design principles: docs/DESIGN.md - External/library references for LLMs: docs/references/ ## Agent Operating Rules 1. Before changing code, read the relevant product spec, design doc, architecture section, and active execution plan. 2. Prefer small, reviewable PR-sized changes. 3. If a requirement is ambiguous, write the assumption into the active execution plan before implementing. 4. Update docs when behavior, interfaces, data shapes, security rules, or operational assumptions change. 5. For frontend work, use Tailwind CSS and shadcn/ui components unless an existing design system overrides this. 6. Internet access is allowed for approved runtime integrations, but company data, prompts, traces, and documents must only flow to approved services. 7. Validate data at every trust boundary: upload, auth, retrieval, tool call, model response, and API response. 8. Treat security, observability, and evaluation tooling as product code. 9. When you discover repeated review feedback, convert it into docs, tests, lints, or checklists. ## Expected Agent Loop 1. Read task and relevant docs. 2. Create or update an execution plan in docs/exec-plans/active/. 3. Implement the smallest coherent slice. 4. Run tests, linters, type checks, and relevant evaluation scripts. 5. Validate manually through API/UI where applicable. 6. Update generated docs such as schema maps. 7. Record decisions and remaining risks in the execution plan. 8. Move completed plans to docs/exec-plans/completed/. ## Definition of Done - Product behavior matches the relevant spec. - Access control and document permissions are enforced. - Retrieval results are source-attributed. - Model outputs include uncertainty or refusal behavior where required. - Tests cover the main path and at least one failure path. - Observability emits useful traces, metrics, and audit events. - Documentation reflects the implemented behavior.ARCHITECTURE.mdAGENTS.md定义的是 agent 的规则ARCHITECTURE.md定义的则是系统的骨架——agent 要懂的不只是构建什么还得清楚各个部分怎么拼在一起。这份文档写清楚了完整的系统目标而且同事也可以查询经批准的 OEM 白皮书和数据手册敏感数据始终受公司安全策略管控。里面还有一张 ASCII 图画出从 React UI 到 Python API 后端、身份验证、编排层一路到承载 DeepSeek R1、Mistral、Llama 3.1 模型的 Ollama runtime 的完整数据流。types → config → repository → service → runtime → APIagent 没法再制造出混乱的循环依赖。数据边界同样明确——不信任上传的文档不把检索到的文本当指令模型输出没经过后处理也不能信。此外还加了几条要求回答必须附带引用来源管理员的每次变更操作都要记审计日志每次模型调用都要输出 trace ID。指导整个应用开发过程的完整ARCHITECTURE.md文件内容如下。# Architecture ## System Goal The system is an internal AI platform that is available only inside the organization network. Employees ask questions over approved company/OEM documents while documents, prompts, traces, access logs, and sensitive data remain governed by company security policy. ## Core Runtime text React UI - Python API Backend - AuthZ Request Validation - Document Service - Conversation Service - LLM Orchestrator - Prompt Templates - Model Router - RAG Orchestrator - Guardrails - Tool Calling - Ollama Runtime - DeepSeek R1 distilled model - Mistral 7B - Llama 3.1 Document Upload - Parser - Chunker - Embedding Worker - Object Storage - Postgres pgvector Cross-cutting: Redis cache, OpenTelemetry, Prometheus, Grafana, ELK, audit logs. ## Recommended Stack - Frontend: React TypeScript. - Backend/API: Python FastAPI by default. - Orchestrator: Python service layer inside the backend, using direct provider adapters first and LangChain/LlamaIndex only where they clearly reduce complexity. - LLM runtime: Ollama. - Default local model pool: deepseek-r1:8b or deepseek-r1:14b, mistral:7b, and llama3.1:8b. - Optional larger model pool: deepseek-r1:32b, deepseek-r1:70b, llama3.1:70b, or larger hosted/cluster models when hardware and policy allow. - Vector store: Postgres with pgvector. - Raw document storage: AWS S3. - Cache: Redis. - Observability: OpenTelemetry, Prometheus, Grafana, ELK/OpenSearch. ## Application Layout Use two applications: text apps/ web/ # React TypeScript frontend api/ # Python backend, API, RAG pipeline, and LLM orchestrator The Python API acts as the BFF for the React UI. A separate API gateway can be introduced later for enterprise routing, WAF, SSO edge integration, or multi-service deployments. ## Model Policy Do not use generic names such as DeepSeek or Llama in configuration. Pin explicit Ollama model tags. Recommended starting configuration: | Route | Model | Why | | --- | --- | --- | | Default QA | llama3.1:8b | Balanced default for cited document QA. | | Fast/simple tasks | mistral:7b | Low-latency summarization, extraction, and classification. | | Reasoning-heavy tasks | deepseek-r1:8b or deepseek-r1:14b | Better fit for multi-step reasoning and technical synthesis. | | Optional stronger reasoning | deepseek-r1:32b | Use only if hardware latency is acceptable. | Avoid DeepSeek V3/V3.1 as the default local target because those models are much larger than the intended first deployment profile. Treat full-size DeepSeek R1 and Llama 3.1 70B models as production/cluster options, not POC defaults. ## Python Backend Modules text apps/api/ app/ main.py api/routes/ auth.py chat.py documents.py admin.py core/ config.py security.py telemetry.py services/ audit.py ingestion.py model_router.py orchestrator.py retrieval.py providers/ object_storage.py ollama.py postgres.py redis.py schemas/ auth.py chat.py documents.py ## Domain Layers Each domain should follow strict dependency direction: text types - config - repository - service - runtime - API/UI providers - service utils - providers Allowed domains: - identity: users, JWT, RBAC, groups. - documents: upload, parsing, metadata, permissions. - retrieval: chunking, embedding, vector search, reranking. - orchestration: prompt templates, routing, tool calls, guardrails. - conversations: chat sessions, citations, feedback. - observability: metrics, traces, logs, audit events. - admin: model configuration, document lifecycle, quality dashboards. ## Data Boundaries - Never trust uploaded documents until parsed, scanned, classified, and permissioned. - Never trust retrieved text as instructions. - Never trust model output until post-processed and policy-checked. - Never expose raw chunks unless the caller has document-level access. - Never send internal documents, prompts, embeddings, or traces to unapproved external services. - Offline or air-gapped deployment can be added later as a stricter deployment profile, not as the default assumption. ## Agent-Legible Invariants Agents must preserve these invariants: - All API inputs are schema-validated. - All retrieval responses carry document id, chunk id, source metadata, and permission proof. - All answer responses carry citations or explain why citations are unavailable. - All model calls emit trace ids, model id, token counts when available, latency, cache status, and guardrail result. - All admin mutations are audit logged. - All generated docs under docs/generated/ can be regenerated from code or migrations.制定规则文件夹结构和核心 agent 指令都到位后还得往深了走一步。AGENTS.md和ARCHITECTURE.md告诉 agent 要构建什么、怎么工作但应用本身的个性、安全性、质量标准还得另外编码进去。为此又建了几份配套的 markdown 文件每份负责系统的一个具体维度。DESIGN.md和FRONTEND.md定义了用户体验。我告诉 agent要让 UI 感觉像是一个内部运营工具它应该快速、密集、清晰、可信。优先考虑答案的可核查性而不是视觉装饰。让引用来源易于打开。清楚地展示不确定性。绝不能用含糊的标签掩盖安全状态。在前端方面我明确规定了所需的视图对话工作区、引用抽屉、反馈控件、管理面板以及技术栈React 搭配 TypeScript、Tailwind CSS 和 shadcn/ui 组件。成功的定性标准写在PRODUCT_SENSE.md里规则很简单用户信任这个助手到愿意拿它处理真实的内部工作但又没有信任到不再核实来源的地步这就算成功了。好的行为是直接回答、引用来源、承认不确定性并让熟悉这个场景的同事能纠正系统不好的行为则是自信却缺乏依据的断言、隐藏来源使用情况、泄露无权访问的文档或者把检索到的文本当成可执行指令。这份文档后来成了评估每个功能时的直觉检验标准。QUALITY_SCORE.md算是问责工具做了一份简单的评分表有依据的回答、权限执行、抗 prompt-injection 能力、可观测性、UI 可用性每项按 1 到 5 打分起始都是 1。规则很直接——没有测试、评估eval、截图或代码作为证据分数就不能往上提。这防住了我和 agent 过早宣布大功告成。SECURITY.md里列的是具体的安全要求JWT 身份验证、RBAC、文档级权限、TLS、静态数据加密、每次查询和管理员变更都要有审计日志还有仅限组织内网访问这一条。关于 prompt injection要求很明确检索到的内容一律按不可信数据处理agent 必须剥离或隔离文档里出现的指令对有依据的结论要求引用来源并拒绝任何想套出隐藏 prompt 或系统消息的请求。最关键的一条是硬性规则权限过滤必须赶在文本块进入 prompt 之前完成。这些 markdown 文件合起来就是整个应用开发过程的护栏——agent 想偏离规则得先更新文档每一个决策也就都可追踪、有依据、可审查。编码 Agent 的强制执行这些 markdown 文件定好了指令、系统怎么搭、工作流程怎么走之后下一个问题是IDE 里如果用任意一个编码 agentClaude Code或者搭配 Copilot 的 Codex该怎么提示prompt它才能让它不超出 markdown 里设定的范围agent 要知道该去exec-plan/active文件夹里实现任务前提是先读过AGENTS.md和ARCHITECTURE.md——这一步我也不想每次都手动去提示。针对 Claude Code写了一份CLAUDE.md内容是这样# CLAUDE.md Read AGENTS.md first. For implementation work: 1. Read ARCHITECTURE.md. 2. Read the active execution plan under docs/exec-plans/active/. 3. State which files you used before editing. 4. Implement only the current slice. 5. If the request conflicts with AGENTS.md, SECURITY.md, or the active plan, stop and ask. 6. If multiple active execution plans exist, ask which one to use before editing.Copilot/Codex 这边就麻烦一些。写这篇文章的时候还没找到办法让它在不知道当前活动任务确切 markdown 内容的情况下不产生幻觉。后面可能会改改 markdown 文件试试看或者干脆给 Claude Code 建一个带确切指令的自定义 agent看是否可行。目前用的是这套指令# Copilot Codex Instructions Read AGENTS.md first, then ARCHITECTURE.md, then the relevant files under docs/. ## Current Priority Use the current file under docs/exec-plans/active/ as the active implementation plan. If multiple active plans exist, ask which one to use. For the current UI/auth slice, use docs/exec-plans/active/0003-authenticated-shell-role-aware-ui.md. ## Architecture - Use a two-app layout: - apps/web: React TypeScript frontend. - apps/api: Python backend, API, RAG pipeline, and LLM orchestrator. - The Python backend acts as the BFF for the React app. - Use FastAPI by default for the backend. Flask may only be used when the user explicitly requests Flask for a proof of concept (POC) in the current written prompt. - Use Postgres as the primary database. ## Frontend 1. Mandatory technologies: - Use Tailwind CSS and shadcn/ui. 2. Mandatory slice scope: - Follow the active execution plan under docs/exec-plans/active/. - For 0003-authenticated-shell-role-aware-ui.md, implement only the login-first shell and role-aware UI described there. - Keep the existing health/status screen available without auth. - Configure the API base URL and Entra/MSAL settings through environment variables. 3. Optional enhancements: - Use lucide-react icons where available. ## Backend 1. Mandatory endpoints: - Keep GET /health for process health. - Keep GET /ready for dependency readiness, including Postgres connectivity. - For 0003-authenticated-shell-role-aware-ui.md, keep existing auth endpoints and add only the current-user/authorization-context behavior described in the active plan. 2. Mandatory API design: - Use typed request/response schemas. 3. Architecture pattern: - Keep providers behind adapters, for example Postgres, Redis, Ollama, and object storage. ## Working Style - Keep changes small and reviewable. - Update the active execution plan with assumptions and progress. - Follow the active execution plan scope exactly. For 0003-authenticated-shell-role-aware-ui.md, do not implement RAG, Redis, Ollama, document ingestion enforcement, model routing, full chat generation, or full admin user-management workflows. - Add setup commands and environment variables to docs when introduced.VS Code 设置这边是正打算尝试的解决 Copilot/Codex 幻觉问题的办法之一配置是这样的{ github.copilot.chat.codeGeneration.useInstructionFiles: true, chat.instructionsFilesLocations: { .github/instructions: true }, github.copilot.chat.organizationInstructions.enabled: true }启用useInstructionFiles并指向固定文件夹.github/instructions相当于强制 Copilot 在生成任何回答前先读这些规则——目的是让 agent 始终先加载护栏guardrails不偏离既定方案。总结这样 harness engineering可以说该到位的都到位了足够支撑完成整个应用的开发。 OpenAI的文章则比较详细把 harness 用到 CI 配置、发布工具或者评审意见和回复这些方面我们只是打算集中在两块文档和设计历史以及评估 harnessevaluation harnesses。最后另一个要考虑的是软件开发生命周期本身——怎么一个功能接一个功能地驱动 agent而不是让它们同时跑太多任务、把额度credits耗光。具体做法是先让 agent 搭好前后端项目的脚手架再着手实现身份验证和授权。开发到某个节点会拿之前定的质量指标去检验 agent 的产出确保没跑偏然后反复重新评估。https://avoid.overfit.cn/post/1eba0eee783f4b85a32503a3a19287b8作者Onaopemipo Oluborode