Uber LangEffect: Agent Side Effect Management

Key Takeaways

  • LangEffect: Uber’s internal framework for tracking and reversing agent-caused side effects
  • Problem: agents executing multi-step tasks create side effects that are hard to rollback atomically
  • Solution: effect log + compensating transactions pattern borrowed from distributed systems
  • Saga pattern applied to agent workflows: each action has a compensating undo action
  • Production result: 99.2% successful rollback rate for interrupted agent tasks at Uber
- LangEffect:Uber 用于追踪及撤销智能体(Agent)所致副作用的内部框架 - 问题:智能体在执行多步骤任务时会产生副作用,这些副作用难以通过原子操作进行回滚 - 解决方案:借鉴分布式系统中的效果日志(Effect Log)与补偿事务(Compensating Transactions)模式 - 将 Saga 模式应用于智能体工作流:为每个动作定义相应的补偿撤销动作 - 生产环境结果:Uber 平台上被中断的智能体任务实现了 99.2% 的成功回滚率

Summary

Uber’s infrastructure team developed LangEffect to solve a specific production problem: agents executing complex, multi-step tasks on production systems (rider matching, driver assignment, payment processing) would sometimes fail mid-task, leaving systems in inconsistent states. Unlike traditional software transactions, agent tasks are long-running and involve external API calls that don’t support traditional ACID rollback.

Uber 的基础设施团队开发了 LangEffect,旨在解决一个特定的生产环境问题:在生产系统(如乘客匹配、司机分配、支付处理)上执行复杂多步骤任务的智能体有时会在任务中途失败,导致系统处于不一致状态。与传统软件事务不同,智能体任务具有长时运行的特点,且涉及不支持传统 ACID 回滚的外部 API 调用。

LangEffect adapts the distributed systems Saga pattern for agent workflows. The core idea: each action an agent takes is registered in an effect log with two entries — the forward action and its compensating transaction (an undo operation). If the agent task fails or is interrupted, LangEffect executes the compensating transactions in reverse order, returning the system to a known-good state.

LangEffect 将分布式系统的 Saga 模式适配于智能体工作流。其核心理念是:智能体采取的每个动作都会在效应日志中注册两个条目——正向操作及其补偿事务(即撤销操作)。若智能体任务失败或被中断,LangEffect 将按逆序执行补偿事务,将系统恢复至已知良好状态。

Implementation details: agents are instrumented via a middleware layer that intercepts tool calls and registers them in the effect log before execution. Compensating transactions are either: (1) pre-specified for known operations (cancel a payment → refund), (2) generated by the LLM for novel operations (with human review for high-value compensations), or (3) flagged as “non-compensable” requiring human intervention.

实现细节:代理通过中间件层进行插桩,该层拦截工具调用并在执行前将其记录到效果日志中。补偿事务包括:(1) 针对已知操作预先指定(例如:取消支付 → 退款),(2) 由 LLM 针对新颖操作生成(针对高价值补偿需经人工审查),或 (3) 标记为“不可补偿”,需人工干预。

The 99.2% rollback success rate is measured on production incidents over 6 months. The 0.8% failure cases are non-compensable operations where external systems had already processed the agent’s actions beyond the point of reversal.

99.2% 的回滚成功率是基于 6 个月的生产事故测算得出的。0.8% 的失败案例属于不可补偿操作,即外部系统已在无法逆转的时间点之后处理了智能体的操作。

Relevant Concepts

Relevant Entities