AI Agent 上線前,你一定要做的評測與安全把關
Agent 最危險的不是當機,是安靜地做錯——回了一段看起來很合理、其實全錯的答案,沒人發現。這篇講 Agent 為什麼會無聲失敗,怎麼用 Promptfoo 測、用 AgentOps debug 多步驟、用 Langfuse 在線上監控,最後附一份上線前檢查清單。
A startup that developed a customer service Agent found that the number of customer complaints increased instead of decreasing during its third week online. They were puzzled because everything seemed fine during testing. However, after reviewing the records, they discovered that the Agent was confidently citing a non-existent "Company Refund Policy, Article 7". The Agent wasn't failing because it didn't know the answer, but because it was presenting false information as if it were true, and the pre-launch test cases didn't cover this type of scenario.
Why Agents "Fail Silently"
Traditional software failures usually manifest as error messages, 500 errors, or stack traces pointing to a specific line of code. Agents are different because their output is generated, and their language always appears fluent and correct, making it difficult to distinguish between correct and incorrect answers.
Moreover, Agents often involve multiple steps, and any mistake in one step can lead to a cascade of errors, resulting in a "self-consistent but incorrect" outcome. You can't identify where the mistake occurred just by looking at the final answer.
Therefore, ensuring the quality of an Agent can't rely on manual spot checks or subjective feelings. You need a mechanism that can quantify, replay, and continuously monitor the Agent's performance online. This is where the three layers of evaluation, security, and observability mentioned in the article "Building an AI Agent Toolchain" come into play.
Key Points: Four Things to Check Before Launch
- Offline Evaluation: Use a fixed set of test cases to quantify the Agent's quality after each modification.
- Red Team Testing: Proactively identify the Agent's weaknesses by testing what inputs will cause it to fail, provide misleading information, or leak sensitive data.
- Multi-Step Debugging: Be able to replay the entire execution trace to identify which step went wrong when an error occurs.
- Online Monitoring and Guardrails: Continuously monitor the Agent's quality and costs after launch and set up intercepts to prevent dangerous actions.
Missing any of these four points means you're taking a gamble when you launch.
How to Test: Promptfoo
The core idea of offline evaluation is to replace subjective feelings with numerical evidence. Promptfoo allows you to establish a set of test cases, including inputs, expected behaviors, and judgment standards. Then, every time you modify the prompt, switch models, or adjust parameters, you can run the entire test suite to see how the score changes.
Judgment standards can be based on string matching, regular expressions, or even using another model as a judge (LLM-as-judge) to evaluate whether the response correctly cites sources. If the startup had a assertion that "the policy article number mentioned in the response must exist", the non-existent Article 7 would have been caught before launch.
Red team testing is also done at this level. Promptfoo can run a batch of adversarial inputs to try to make the Agent leak system prompts, bypass restrictions, or perform unauthorized actions. It's better to attack your own Agent first.
How to Debug: AgentOps
Evaluation tells you that the answer is wrong, but it won't tell you which step went wrong. Multi-step debugging relies on AgentOps.
It constructs a replayable timeline of the Agent's entire execution, showing which tools were called, what parameters were passed, what results were returned, how many tokens were used, and so on. For the refund policy example, AgentOps would clearly show that the error occurred in an early step when incorrect information was retrieved, and subsequent steps were based on that incorrect data. Without this execution trace, you'd be left staring at the final answer without a clue.
How to Monitor: Langfuse
Launch is not the end; it's another beginning. Production environments involve diverse and unpredictable inputs, and users will ask questions you never thought of during testing. Langfuse is responsible for recording every conversation, every token cost, and every latency issue online, allowing you to track quality drift, identify which types of questions are answered poorly, and monitor costs.
Langfuse and AgentOps divide labor roughly as follows: AgentOps focuses on deep, one-time debugging during development, while Langfuse focuses on long-term, group monitoring online. In practice, many teams use both. The key is having a place where you can always answer, "How did my Agent perform this week?" If you can't answer, you're flying blind.
Guardrails: The Last Line of Defense
Evaluation, debugging, and monitoring are about knowing before or after the fact, while guardrails are about intercepting dangerous actions in real-time. Before the Agent performs a dangerous action—such as making a payment, deleting data, sending external messages, or executing system commands—add a layer of rule checks or human confirmation.
Guardrails should intercept outputs containing personal or confidential information, transactions exceeding a certain threshold, or detecting prompt injection attacks. This layer works in conjunction with the execution sandbox in the toolchain (e.g., Blaxel)—the sandbox limits what the Agent can run, while guardrails limit what the Agent can do.
Focus for Different Teams
Taiwanese Individual Developers: At least integrate Promptfoo. Even with just twenty test cases, it's better than relying on intuition after each modification. Red team testing can focus on the most critical attack surfaces.
Startup Teams: AgentOps and Langfuse should be set up early. Your product is still rapidly iterating, and without observability, every issue becomes a team effort to dig through logs, which could have been spent on developing two more features. Prioritize guardrails for actions that involve spending money or are irreversible.
Enterprise Teams: Red team testing and guardrails are the baseline for compliance. Security and legal teams will ask about data leakage and the ability to audit every decision step, and the answers lie in Promptfoo's adversarial test records and AgentOps' execution traces. Keeping these records is equivalent to having audit evidence ready.
Pre-Launch Checklist
Follow these steps directly:
- Have a test case library covering common and edge cases, running on Promptfoo
- Run a full evaluation for every prompt or model change, ensuring the score doesn't regress
- Conduct at least one round of red team testing, including prompt injection, out-of-bounds, and misleading information tests
- Have assertions checking the existence of factual claims
- Integrate AgentOps for multi-step debugging
- Set up Langfuse for online monitoring
- Implement guardrails or human confirmation for dangerous actions
- Set token and cost limits to prevent background Agents from incurring unexpected costs
- Run dangerous code executions in a sandbox
- Have a person who knows where to look when something goes wrong
TheAI Academy Summary and Review
The biggest risk of launching an Agent isn't that the technology isn't strong enough, but that you won't know when it's doing something wrong. Evaluation lets you know beforehand, observability lets you investigate afterwards, and guardrails intercept in real-time—these three aspects are invaluable, even if their importance isn't felt until something goes wrong.
"An Agent that you can't see inside, whose quality you can't measure, is a time bomb, no matter how smoothly it runs; being able to inspect mediocrity is far better than not being able to see brilliance."
Advice for Taiwanese readers: Before launch, force yourself to answer one question—"If it does something wrong in front of a customer tomorrow, how quickly can I identify which step went wrong and why?" If you can't answer "within ten minutes", don't launch yet. Go back and complete Promptfoo, AgentOps, and Langfuse. For a complete toolchain setup, refer back to "2026 Developer Toolchain Full Map".
Frequently Asked Questions
為什麼 AI Agent 的錯誤比傳統軟體難發現?
因為 Agent 的輸出是生成的,語言永遠通順,錯誤答案會被包裝得跟正確答案一樣有條理,不會像傳統軟體那樣噴 error 或 stack trace。多步驟流程中,某一步偏掉後面會基於錯誤繼續推論,最終給出『自洽但錯誤』的結果,光看答案看不出問題。
Promptfoo 主要解決什麼問題?
它把 Agent 的品質從『我覺得變好了』變成可量化的數字。你建立一組固定測試案例與判斷標準,每次改 prompt、換模型就跑一遍看分數有沒有退步,同時可做紅隊測試,主動找出會被繞過或讓 Agent 唬爛的漏洞。
AgentOps 和 Langfuse 有什麼差別,需要都用嗎?
AgentOps 偏開發期的單次深度 debug,把一輪執行串成可回放的軌跡,方便定位是哪一步出錯;Langfuse 偏線上長期的群體監控,記錄每次對話、成本、延遲,追蹤品質漂移。兩者側重不同,實務上很多團隊會搭配使用。
護欄(guardrails)和評測有什麼不同?
評測是事前知道品質如何,監控是事後查得到問題,護欄則是『當下攔截』——在 Agent 執行付款、刪除、對外發送等危險動作前,加上規則檢查或人工確認,例如金額超過門檻轉人工、偵測到 prompt injection 就中止。