Stop hand-customizing agents for every customer.
Find per-customer issues in your traces. Author the rules that fix them. Ship without writing code.
Hiring is the only way to scale customization today.
Every customer wants different rules, tone, and escalation thresholds. Today that means a person in the loop: engineers reading traces, account managers filing tickets, reviewers listening to calls.
The agent works. The workflow around it doesn’t. What’s missing is the loop that turns “I noticed a problem” into “I shipped a fix,” without an engineer in the middle.
No one size fits all
20% hand-tuned per customer
Templates cover order intake, refunds, appointment booking. Tone, escalation rules, refusal cases get hand-customized for every customer.
Drowning in traces
~1 hour per issue
When something goes wrong for a customer, an engineer combs long traces for the issue and the cause. Tens of reports a day are barely manageable. Hundreds are impossible. Every customer you sign adds more.
Regressions caught as complaints
Hand-built or skipped evals
Per-customer rules mean per-customer regressions. Either you hand-build a regression suite for every customer, slow and manual, or you find out from the next complaint.
Draft, test, ship, iterate.
Without engineering.
Every customer runs through the same four-step loop.
With Kayba it’s automated.
Draft the customisation for a new customer
An engineer studies customer call recordings, context, and policy docs. Writes the per-customer behavior in code.
Kayba ingests the base agent and everything the customer provides: requests, workflows, policies, expectations. Supports the implementation.
- Base agentagent_v2.3
- Customer requests12 items
- Workflows3 flows
- Policies7 rules
- Context & expectations4 outcomes
Test that the customisation worked
An engineer digs through production traces to verify the rules fired correctly.
Kayba turns each customer request into a named evaluator. Offline tests run against the agent automatically.
- uses_customer_name0/24
- refund_under_$500/24
- escalates_unresolved0/24
- brand_tone_check0/24
- no_unauthorized_promises0/24
Deploy and watch it run
Code, PR, review, deploy. An engineer watches production for the agent’s behavior.
Production is the real testing suite for agents. One-click ship. Kayba monitors the evals in production and auto-flags new issues (tool calls, workflows, behavior). Your team reviews each one in a HITL dashboard with evidence.
- tool: stripe_chargeauto12×
- escalates_unresolvedeval8×
- workflow: handoff_to_humanauto5×
- refund_thresholdeval3×
When the customer complains, find the issue, fix, repeat
The customer surfaces the issue. An engineer hunts through traces, writes a fix, redeploys.
Don’t search. Ask Kayba in Slack or Discord. The agent scopes the issue, finds the root cause, and prepares the fix.
@kayba customer says the agent approved a refund they shouldn’t have qualified for. Can you scope?
3 traces found in the last 24h matching the complaint.
refund_under_$50 fired before checking loyalty status.
Gate refund_under_$50 behind has_paid_subscription.
Built by researchers. Verified on benchmarks
and production.





| Baseline | Kayba | Improvement | |
|---|---|---|---|
| pass^1 | 41.2% | 55.3% | +34.2% |
| pass^2 | 28.3% | 44.2% | +56.2% |
| pass^3 | 22.5% | 41.2% | +83.1% |
| pass^4 | 20.0% | 40.0% | +100.0% |
τ2-bench is a real-world agent benchmark by Sierra Research.
The agent works.
The work behind the agent doesn’t.
“Every day I get at least 200,000 lines of logging. It’s impossible to go through all that.”
See what agent customisation can look like. 20 minutes, no slides.
