Observability (OpenTelemetry)
You can't operate an agent runtime in production without observability, and the cost of bolting it on after the fact is paid in incidents you can't reproduce. AgentOS treats spans, metrics, and log correlation as first-class concerns — but it does not own the OpenTelemetry SDK lifecycle. The SDK is an application-level concern: your host owns exporters, sampling, and context propagation, because the right answer for a CLI process is different from the right answer for a long-running server is different from the right answer for an edge worker.
What AgentOS owns is the emit side: opt-in spans around turns and tool-result handling, opt-in counters and histograms for the operations worth measuring, optional trace-correlation in logs and streamed response metadata, and an optional path to export application logs as OTEL LogRecords. All defaults are off. Turning them on is a single config change, and the runtime will surface to whatever exporter your host has wired (OTLP to Honeycomb, Tempo, Jaeger, Grafana Cloud — the runtime doesn't care, because your host SDK is what does the export).
The implementation lives in src/evaluation/observability/ and uses @opentelemetry/api directly — never bundled, always peer-dep-style imported, so your host SDK is the one and only OTEL provider in the process.
Table of Contents
- Overview
- Opt-In Policy
- Enable via AgentOS Config (Recommended)
- Enable via Environment Variables
- Host Setup (Node.js Example)
- What AgentOS Emits
- Logging (Pino + OTEL Logs)
- Privacy & Cardinality
- Performance Notes
- SOTA Techniques (TypeScript Agentic AI)
Overview
Defaults (all OFF):
- Manual AgentOS spans
- AgentOS metrics
- Trace IDs in streamed responses
- Log correlation (
trace_id,span_id) - OTEL LogRecord export
When enabled, AgentOS emits:
- Spans around turns and tool-result handling
- Metrics for turn/tool counters + histograms
- Optional: trace correlation in logs and streamed response metadata
- Optional: OTEL LogRecords (exported by your host, via OTLP)
Opt-In Policy
There are two layers:
-
Host OTEL SDK (required for export)
- In Node:
@opentelemetry/sdk-node+ exporters/instrumentations. - In browsers: the web OTEL SDK (if you choose to export from the client).
- In Node:
-
AgentOS instrumentation toggles (controls what AgentOS emits)
- Env flags (global defaults)
AgentOSConfig.observability(per-agent control)
Precedence:
AgentOSConfig.observability.enabled = falsehard-disables all AgentOS observability helpers (even if env is set).- Otherwise, config fields override env fields, and env provides defaults.
Enable via AgentOS Config (Recommended)
import { AgentOS } from '@framers/agentos';
const agentos = await AgentOS.create({
observability: {
// Master switch: when true, defaults to enabling tracing/metrics + log correlation.
// Keep explicit per-signal toggles if you want a tighter blast radius.
// enabled: true,
tracing: {
enabled: true,
includeTraceInResponses: true, // adds metadata.trace to select streamed chunks
},
metrics: {
enabled: true,
},
logging: {
includeTraceIds: true, // adds trace_id/span_id to pino log meta
exportToOtel: false, // keep OFF unless you want OTLP log export
// otelLoggerName: '@framers/agentos',
},
},
});
Enable via Environment Variables
# Master switch: enables tracing + metrics + log trace_id/span_id injection defaults
AGENTOS_OBSERVABILITY_ENABLED=true
# Optional fine-grained toggles
AGENTOS_TRACING_ENABLED=true
AGENTOS_METRICS_ENABLED=true
AGENTOS_TRACE_IDS_IN_RESPONSES=true
AGENTOS_LOG_TRACE_IDS=true
# Optional: emit OTEL LogRecords (still requires a host SDK + logs exporter)
AGENTOS_OTEL_LOGS_ENABLED=true
# Names (advanced; usually keep defaults)
AGENTOS_OTEL_TRACER_NAME=@framers/agentos
AGENTOS_OTEL_METER_NAME=@framers/agentos
AGENTOS_OTEL_LOGGER_NAME=@framers/agentos
Host Setup (Node.js Example)
AgentOS only uses OTEL APIs; your host must install/start an OTEL SDK to export anything.
Typical Node setup:
- Install dependencies:
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
- Configure env (OTLP/HTTP collector example):
OTEL_SERVICE_NAME=my-agent-host
OTEL_TRACES_EXPORTER=otlp
OTEL_METRICS_EXPORTER=otlp
# OTEL_LOGS_EXPORTER=otlp # keep explicit opt-in for log export
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
# Optional sampling
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1
- Start the SDK early (before most imports) so auto-instrumentation can patch libraries:
// otel.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({
instrumentations: [getNodeAutoInstrumentations()],
});
export async function startOtel(): Promise<void> {
await sdk.start();
}
export async function shutdownOtel(): Promise<void> {
await sdk.shutdown();
}
In this monorepo, the backend bootstrap lives at backend/src/observability/otel.ts.
What AgentOS Emits
Spans
When tracing is enabled, AgentOS emits spans such as:
agentos.turnagentos.gmi.get_or_createagentos.gmi.process_turn_streamagentos.tool_resultagentos.gmi.handle_tool_resultagentos.conversation.save(stage-tagged)
Metrics
When metrics are enabled, AgentOS records:
agentos.turns(counter)agentos.turn.duration_ms(histogram)agentos.turn.tokens.total|prompt|completion(histograms; only when usage is available)agentos.turn.cost.usd(histogram; only when cost is available)agentos.tool_results(counter)agentos.tool_result.duration_ms(histogram)
Trace IDs in Streamed Responses
When enabled, AgentOS attaches trace metadata to select streamed chunks:
{
"metadata": {
"trace": {
"traceId": "...",
"spanId": "...",
"traceparent": "00-...-...-01"
}
}
}
Logging (Pino + OTEL Logs)
Stdout Logs (Default)
AgentOS uses pino for structured logs. When includeTraceIds is enabled and an active span exists, AgentOS adds:
trace_idspan_id
to log metadata to make correlation easy in any log backend.
OTEL LogRecord Export (Optional)
If you want logs to flow through the OTLP pipeline (instead of, or in addition to, stdout shipping):
-
Enable AgentOS OTEL log emission:
AGENTOS_OTEL_LOGS_ENABLED=true, orobservability.logging.exportToOtel = true
-
Enable a host logs exporter (Node example):
OTEL_LOGS_EXPORTER=otlp
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
Recommendation:
- Keep OTEL log export OFF by default.
- Use stdout logs + trace correlation for most deployments.
- Turn on OTEL log export when you explicitly want one unified OTLP pipeline for traces/metrics/logs.
Privacy & Cardinality
Defaults are conservative:
- Prompts, model outputs, and tool arguments are not recorded by default.
- Prefer safe metadata only (durations, status, tool names, model/provider ids, token usage, cost).
Cardinality guidance:
- Avoid labeling metrics/spans with high-cardinality values (user ids, conversation ids, raw URLs, prompt text).
- Keep attributes stable and low-cardinality (e.g.
status,tool_name,persona_id).
Performance Notes
- AgentOS observability helpers are a safe no-op when disabled.
- Traces/metrics are typically low overhead when sampling is enabled.
- OTEL log export can add noticeable CPU/network overhead at
debugvolume; use it intentionally.
SOTA Techniques (TypeScript Agentic AI)
Patterns that work well in practice:
- Structured event stream: emit agent lifecycle events (turn started, tool called, tool returned, policy decision, final output) as strongly-typed records (AgentOS already streams chunks; persist them if you need audits).
- W3C context propagation: propagate
traceparentacross inbound HTTP, SSE/WebSocket streaming, and tool calls; use OTEL context managers (AsyncLocalStorage) in Node. - GenAI semantic conventions: add
gen_ai.*attributes/events to spans when instrumenting model calls, tool calls, and token usage; keep raw content behind explicit opt-in and redaction. - Redaction and data classification: treat prompt/tool args/output as sensitive by default; add allowlists + hashing for debugging without content exfiltration.
Common library choices:
- Telemetry:
@opentelemetry/sdk-node,@opentelemetry/auto-instrumentations-node - Logging:
pino(+@opentelemetry/instrumentation-pinoif you want automatic injection everywhere) - Agent/LLM observability layers: Langfuse, Helicone, Sentry AI monitoring, OpenLIT, OpenLLMetry-js
References
OpenTelemetry
- W3C. (2021). Trace Context Level 1. W3C Recommendation. — The W3C standard for distributed-trace context propagation across process boundaries; AgentOS uses it to correlate spans across microservices. w3.org/TR/trace-context-1
- OpenTelemetry Specification (current). OpenTelemetry signal specifications: traces, metrics, and logs. — The protocol contract AgentOS emits against. opentelemetry.io/docs/specs/otel
- OpenTelemetry. (current). Semantic conventions. — Naming and attribute schema for spans/metrics/logs; AgentOS follows the GenAI semantic conventions for LLM-call attributes. opentelemetry.io/docs/specs/semconv
- OpenTelemetry GenAI working group. (current). Generative AI semantic conventions. — The schema for
gen_ai.*attributes (request.model, usage.input_tokens, etc.) AgentOS sets on LLM-call spans. opentelemetry.io/docs/specs/semconv/gen-ai
Distributed tracing foundations
- Sigelman, B. H., Barroso, L. A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., & Shanbhag, C. (2010). Dapper, a large-scale distributed systems tracing infrastructure. Google Technical Report. — The original distributed-tracing paper that defined the span/trace abstractions used today. Google Research
- Mace, J., Roelke, R., & Fonseca, R. (2015). Pivot tracing: Dynamic causal monitoring for distributed systems. SOSP 2015. — Causal monitoring methodology informing the trace-id propagation through
AgentOSResponsemetadata. DOI
Logging
- OpenTelemetry. (current). OpenTelemetry logging specification. — The bridge spec connecting
LogRecordevents to span context; AgentOS's optionalexportToOtellog path follows it. opentelemetry.io/docs/specs/otel/logs - Pino contributors. (current). Pino: Very low overhead Node.js logger. — The logger AgentOS wraps via
PinoLogger; chosen for its sub-microsecond per-line cost in hot paths. GitHub
Implementation references
packages/agentos/src/evaluation/observability/Tracer.ts— span creation around turn / tool / guardrail / LLM-call boundariespackages/agentos/src/evaluation/observability/otel.ts— OpenTelemetry API peer-dep wiringpackages/agentos/src/logging/PinoLogger.ts— structured logger with trace-id / span-id field injectionpackages/agentos/src/evaluation/SqlTaskOutcomeTelemetryStore.ts— persisted per-turn outcome KPIs for rolling-quality dashboards