Layi Docs

Reliability

Reliability & Governance

Planner enforcement, clarification loops, telemetry, and security layers that keep runs safe

Last updated: December 2025

🧱 Reliability layers

Reliability is enforced as a layered stack. Each layer can emit telemetry, be feature-flagged, and operate independently so upgrades never require risky rewrites.

1. Planner contract

Plans must match schema (steps, agents, tools) before execution. Violations emit `planner_validation_fail_total`.

2. Role dispatch

Single-action steps keep supervisor vs specialist responsibilities isolated and auditable.

3. Structured data reuse

Contacts, IDs, and previous tool outputs are cached to avoid re-parsing unstructured text.

4. Autofill & retries

Missing parameters trigger auto-lookups + retries before we ever ask the user.

5. Clarification flow

Ambiguity emits `clarification_required` events with masked choices and REST resolution hooks.

6. Ingestion structuring

Documents are normalized at ingest so tools receive predictable schemas and cache hits.

7. Metrics & observability

Counters, gauges, and structured events feed Grafana dashboards and alerting.

8. Streaming finalization

Dual streaming envelopes keep UI clients synced even if they only support legacy formats.

9. Security & masking

Secrets are redacted before logging, and Vault policies prevent cross-tenant access.

💬 Clarification lifecycle

Clarifications keep workflows durable without guessing. The orchestration layer persists pending prompts with both masked and raw candidates so humans can resolve them safely.

Detection: Autofill or schema validation encounters multiple candidates for a required parameter.
Emission: `_emit_clarification_event` streams `clarification_required` and records pending state under the `workflow:<workflow_id>:clarifications` cache entry.
Resolution: Operators POST to `/internal/health/clarification/resolve` with the chosen value; the entry moves to the resolved set.
Auto-resume: On the next tool invocation the orchestrator merges the resolved value, retries once, and emits `tool_retry (reason=clarification_resolved)`.
Telemetry: `clarification_requests_total` and `clarification_resolution_total` power dashboards to spot noisy tools or prompts.

🎚️ Feature flags & controls

Reliability features ship behind environment flags so we can roll out gradually and disable layers per tenant if required.

RELIABILITY_PLAN_ENFORCE

Soft or strict validation of planner output before execution.

RELIABILITY_AUTOFILL_RETRY

Turns on auto-lookup + retry before surfacing clarifications.

RELIABILITY_STRUCTURED_CACHE

Enables structured cache merge for contacts, IDs, and tool payloads.

PLAN_SCHEMA_ENFORCE

`log` or `strict` modes for JSON Schema response contracts.

ORCHESTRATOR_STREAM_SIZE_GUARD

(Planned) Truncates oversized streaming payloads while flagging telemetry.

STRICT_SCHEMA_EARLY_ENFORCE

Short-circuits finalization when required keys are missing.

📈 Telemetry & dashboards

All layers emit structured events so SREs and customers can observe health in Grafana.

`planner_validation_fail_total`, `plan_injected_steps_total`, and `planner.plan_diff` highlight planning drift.
`tool_retry_attempt_total`, `tool_retry_success_total`, and `clarification_requests_total` track stability per tool.
`observability.summary` bundles provenance, memory access, replan stats, and normalizer usage for each workflow.
`planner_replan_total` and `planner_replan_suppressed_total` expose bounded replan behavior.
Grafana dashboard JSON lives in `config/grafana/grafana_orchestrator_phase_dashboard.json` for import or provisioning.

🔐 Security commitments

Every reliability feature works hand-in-hand with the security review checklist.

Auth: Org Manager-issued tokens plus RBAC gate internal endpoints; workflow stubs stay behind feature flags.
Data handling: No PII is logged inside policy or agent events; diagnostic payloads enforce size limits.
Secrets: Vault + Tool Gateway ensure credentials never reach LLM responses, clarifications, or telemetry.
Rate limiting: Global and per-org throttles are staged for production to prevent abuse of dev stubs.
Transport: CORS and channel gateways restrict origins while LiveKit voice flows keep tenant routing data isolated.