Layi Docs

Reliability

Reliability & Governance

Planner enforcement, clarification loops, telemetry, and security layers that keep runs safe

Last updated: December 2025

🧱 Reliability layers

Reliability is enforced as a layered stack. Each layer can emit telemetry, be feature-flagged, and operate independently so upgrades never require risky rewrites.

1. Planner contract
Plans must match schema (steps, agents, tools) before execution. Violations emit `planner_validation_fail_total`.
2. Role dispatch
Single-action steps keep supervisor vs specialist responsibilities isolated and auditable.
3. Structured data reuse
Contacts, IDs, and previous tool outputs are cached to avoid re-parsing unstructured text.
4. Autofill & retries
Missing parameters trigger auto-lookups + retries before we ever ask the user.
5. Clarification flow
Ambiguity emits `clarification_required` events with masked choices and REST resolution hooks.
6. Ingestion structuring
Documents are normalized at ingest so tools receive predictable schemas and cache hits.
7. Metrics & observability
Counters, gauges, and structured events feed Grafana dashboards and alerting.
8. Streaming finalization
Dual streaming envelopes keep UI clients synced even if they only support legacy formats.
9. Security & masking
Secrets are redacted before logging, and Vault policies prevent cross-tenant access.

💬 Clarification lifecycle

Clarifications keep workflows durable without guessing. The orchestration layer persists pending prompts with both masked and raw candidates so humans can resolve them safely.

  1. Detection: Autofill or schema validation encounters multiple candidates for a required parameter.
  2. Emission: `_emit_clarification_event` streams `clarification_required` and records pending state under the `workflow:<workflow_id>:clarifications` cache entry.
  3. Resolution: Operators POST to `/internal/health/clarification/resolve` with the chosen value; the entry moves to the resolved set.
  4. Auto-resume: On the next tool invocation the orchestrator merges the resolved value, retries once, and emits `tool_retry (reason=clarification_resolved)`.
  5. Telemetry: `clarification_requests_total` and `clarification_resolution_total` power dashboards to spot noisy tools or prompts.

🎚️ Feature flags & controls

Reliability features ship behind environment flags so we can roll out gradually and disable layers per tenant if required.

RELIABILITY_PLAN_ENFORCE
Soft or strict validation of planner output before execution.
RELIABILITY_AUTOFILL_RETRY
Turns on auto-lookup + retry before surfacing clarifications.
RELIABILITY_STRUCTURED_CACHE
Enables structured cache merge for contacts, IDs, and tool payloads.
PLAN_SCHEMA_ENFORCE
`log` or `strict` modes for JSON Schema response contracts.
ORCHESTRATOR_STREAM_SIZE_GUARD
(Planned) Truncates oversized streaming payloads while flagging telemetry.
STRICT_SCHEMA_EARLY_ENFORCE
Short-circuits finalization when required keys are missing.

📈 Telemetry & dashboards

All layers emit structured events so SREs and customers can observe health in Grafana.

  • `planner_validation_fail_total`, `plan_injected_steps_total`, and `planner.plan_diff` highlight planning drift.
  • `tool_retry_attempt_total`, `tool_retry_success_total`, and `clarification_requests_total` track stability per tool.
  • `observability.summary` bundles provenance, memory access, replan stats, and normalizer usage for each workflow.
  • `planner_replan_total` and `planner_replan_suppressed_total` expose bounded replan behavior.
  • Grafana dashboard JSON lives in `config/grafana/grafana_orchestrator_phase_dashboard.json` for import or provisioning.

🔐 Security commitments

Every reliability feature works hand-in-hand with the security review checklist.

  • Auth: Org Manager-issued tokens plus RBAC gate internal endpoints; workflow stubs stay behind feature flags.
  • Data handling: No PII is logged inside policy or agent events; diagnostic payloads enforce size limits.
  • Secrets: Vault + Tool Gateway ensure credentials never reach LLM responses, clarifications, or telemetry.
  • Rate limiting: Global and per-org throttles are staged for production to prevent abuse of dev stubs.
  • Transport: CORS and channel gateways restrict origins while LiveKit voice flows keep tenant routing data isolated.

Org Manager