Stop hand-customizing agents for every customer.

Find per-customer issues in your traces. Author the rules that fix them. Ship without writing code.

The problem

Hiring is the only way to scale customization today.

Every customer wants different rules, tone, and escalation thresholds. Today that means a person in the loop: engineers reading traces, account managers filing tickets, reviewers listening to calls.

The agent works. The workflow around it doesn’t. What’s missing is the loop that turns “I noticed a problem” into “I shipped a fix,” without an engineer in the middle.

No one size fits all

20% hand-tuned per customer

Templates cover order intake, refunds, appointment booking. Tone, escalation rules, refusal cases get hand-customized for every customer.

Drowning in traces

~1 hour per issue

When something goes wrong for a customer, an engineer combs long traces for the issue and the cause. Tens of reports a day are barely manageable. Hundreds are impossible. Every customer you sign adds more.

Regressions caught as complaints

Hand-built or skipped evals

Per-customer rules mean per-customer regressions. Either you hand-build a regression suite for every customer, slow and manual, or you find out from the next complaint.

The solution

Draft, test, ship, iterate.
Without engineering.

Every customer runs through the same four-step loop.
With Kayba it’s automated.

01

Draft the customisation for a new customer

Today

An engineer studies customer call recordings, context, and policy docs. Writes the per-customer behavior in code.

With Kayba

Kayba ingests the base agent and everything the customer provides: requests, workflows, policies, expectations. Supports the implementation.

Customer intake
customer name
  • Base agentagent_v2.3
  • Customer requests12 items
  • Workflows3 flows
  • Policies7 rules
  • Context & expectations4 outcomes
02

Test that the customisation worked

Today

An engineer digs through production traces to verify the rules fired correctly.

With Kayba

Kayba turns each customer request into a named evaluator. Offline tests run against the agent automatically.

Offline tests
customer name
  • uses_customer_name0/24
  • refund_under_$500/24
  • escalates_unresolved0/24
  • brand_tone_check0/24
  • no_unauthorized_promises0/24
Offline · run #30 of 5 passing
03

Deploy and watch it run

Today

Code, PR, review, deploy. An engineer watches production for the agent’s behavior.

With Kayba

Production is the real testing suite for agents. One-click ship. Kayba monitors the evals in production and auto-flags new issues (tool calls, workflows, behavior). Your team reviews each one in a HITL dashboard with evidence.

Live issues
customer name
  • tool: stripe_chargeauto12×
  • escalates_unresolvedeval
  • workflow: handoff_to_humanauto
  • refund_thresholdeval
Auto-detected · tool call
Stripe charge timeout in refund flow
12 occurrences · 3 customers · 2h ago
04

When the customer complains, find the issue, fix, repeat

Today

The customer surfaces the issue. An engineer hunts through traces, writes a fix, redeploys.

With Kayba

Don’t search. Ask Kayba in Slack or Discord. The agent scopes the issue, finds the root cause, and prepares the fix.

kayba-triage
customer name
S
SarahAM2:14 PM

@kayba customer says the agent approved a refund they shouldn’t have qualified for. Can you scope?

K
KaybaApp2:15 PM
Kayba is typing
Scoping

3 traces found in the last 24h matching the complaint.

Root cause

refund_under_$50 fired before checking loyalty status.

Proposed fix

Gate refund_under_$50 behind has_paid_subscription.

Message #kayba-triage
Research

Built by researchers. Verified on benchmarks
and production.

Team from
UC BerkeleyOxfordEPFLETH ZurichHSG
Compliance
SOC 2Underway
Benchmark
τ2-bench · Claude Haiku 4.5
BaselineKaybaImprovement
pass^141.2%55.3%+34.2%
pass^228.3%44.2%+56.2%
pass^322.5%41.2%+83.1%
pass^420.0%40.0%+100.0%

τ2-bench is a real-world agent benchmark by Sierra Research.

The cost today

The agent works.
The work behind the agent doesn’t.

“Every day I get at least 200,000 lines of logging. It’s impossible to go through all that.”
— CTO building Agents

See what agent customisation can look like. 20 minutes, no slides.

HumanLLM