Layi Docs
Reliability
Reliability & Governance
Planner enforcement, clarification loops, telemetry, and security layers that keep runs safe
Last updated: December 2025
🧱 Reliability layers
Reliability is enforced as a layered stack. Each layer can emit telemetry, be feature-flagged, and operate independently so upgrades never require risky rewrites.
1. Planner contract
Plans must match schema (steps, agents, tools) before execution. Violations emit `planner_validation_fail_total`.
2. Role dispatch
Single-action steps keep supervisor vs specialist responsibilities isolated and auditable.
3. Structured data reuse
Contacts, IDs, and previous tool outputs are cached to avoid re-parsing unstructured text.
4. Autofill & retries
Missing parameters trigger auto-lookups + retries before we ever ask the user.
5. Clarification flow
Ambiguity emits `clarification_required` events with masked choices and REST resolution hooks.
6. Ingestion structuring
Documents are normalized at ingest so tools receive predictable schemas and cache hits.
7. Metrics & observability
Counters, gauges, and structured events feed Grafana dashboards and alerting.
8. Streaming finalization
Dual streaming envelopes keep UI clients synced even if they only support legacy formats.
9. Security & masking
Secrets are redacted before logging, and Vault policies prevent cross-tenant access.
💬 Clarification lifecycle
Clarifications keep workflows durable without guessing. The orchestration layer persists pending prompts with both masked and raw candidates so humans can resolve them safely.
- Detection: Autofill or schema validation encounters multiple candidates for a required parameter.
- Emission: `_emit_clarification_event` streams `clarification_required` and records pending state under the `workflow:<workflow_id>:clarifications` cache entry.
- Resolution: Operators POST to `/internal/health/clarification/resolve` with the chosen value; the entry moves to the resolved set.
- Auto-resume: On the next tool invocation the orchestrator merges the resolved value, retries once, and emits `tool_retry (reason=clarification_resolved)`.
- Telemetry: `clarification_requests_total` and `clarification_resolution_total` power dashboards to spot noisy tools or prompts.
🎚️ Feature flags & controls
Reliability features ship behind environment flags so we can roll out gradually and disable layers per tenant if required.
RELIABILITY_PLAN_ENFORCE
Soft or strict validation of planner output before execution.
RELIABILITY_AUTOFILL_RETRY
Turns on auto-lookup + retry before surfacing clarifications.
RELIABILITY_STRUCTURED_CACHE
Enables structured cache merge for contacts, IDs, and tool payloads.
PLAN_SCHEMA_ENFORCE
`log` or `strict` modes for JSON Schema response contracts.
ORCHESTRATOR_STREAM_SIZE_GUARD
(Planned) Truncates oversized streaming payloads while flagging telemetry.
STRICT_SCHEMA_EARLY_ENFORCE
Short-circuits finalization when required keys are missing.
📈 Telemetry & dashboards
All layers emit structured events so SREs and customers can observe health in Grafana.
- `planner_validation_fail_total`, `plan_injected_steps_total`, and `planner.plan_diff` highlight planning drift.
- `tool_retry_attempt_total`, `tool_retry_success_total`, and `clarification_requests_total` track stability per tool.
- `observability.summary` bundles provenance, memory access, replan stats, and normalizer usage for each workflow.
- `planner_replan_total` and `planner_replan_suppressed_total` expose bounded replan behavior.
- Grafana dashboard JSON lives in `config/grafana/grafana_orchestrator_phase_dashboard.json` for import or provisioning.
🔐 Security commitments
Every reliability feature works hand-in-hand with the security review checklist.
- Auth: Org Manager-issued tokens plus RBAC gate internal endpoints; workflow stubs stay behind feature flags.
- Data handling: No PII is logged inside policy or agent events; diagnostic payloads enforce size limits.
- Secrets: Vault + Tool Gateway ensure credentials never reach LLM responses, clarifications, or telemetry.
- Rate limiting: Global and per-org throttles are staged for production to prevent abuse of dev stubs.
- Transport: CORS and channel gateways restrict origins while LiveKit voice flows keep tenant routing data isolated.