Skip to content

Improve OTel tracing for analytics and aws cli#231

Merged
carole-lavillonniere merged 1 commit into
mainfrom
fix-telemetry
May 11, 2026
Merged

Improve OTel tracing for analytics and aws cli#231
carole-lavillonniere merged 1 commit into
mainfrom
fix-telemetry

Conversation

@carole-lavillonniere
Copy link
Copy Markdown
Collaborator

@carole-lavillonniere carole-lavillonniere commented May 7, 2026

Motivation

Traces emitted under LSTK_OTEL=1 were not fine-grained enough around the lstk aws wrapper. Investigation revealed several blind spots: the telemetry call, the Docker daemon-ID lookup used to derive machine_id, and the aws CLI subprocess itself were either uninstrumented or producing orphaned/dropped spans.

Changes

  • Instrument the telemetry HTTP client with otelhttp.NewTransport
  • Instrument the Docker client used by LoadOrCreateMachineID
  • Wrap awscli.Exec in a span (aws cli)
  • Reorder defers in cmd/root.go:Execute so tel.Close() runs before the tracer-provider shutdown (otherwise the analytics POST span is created on a noop tracer after shutdown and never reaches the exporter)
  • Make machine-ID resolution lazy (in GetEnvironment(ctx)) so the Docker info / _ping spans are parented to the active command span instead of orphaned at process init.

toward DRG-809

Result of running lstk aws sts get-caller-identity:

image

@carole-lavillonniere carole-lavillonniere changed the title Improve OTel tracing for analytics, machine_id, and aws cli Improve OTel tracing for analytics and aws cli May 7, 2026
Copy link
Copy Markdown
Collaborator

@anisaoshafi anisaoshafi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👏🏼

@anisaoshafi
Copy link
Copy Markdown
Collaborator

Question: there is a difference in time for lstk aws sts get-caller-identity from before to after 700 ms -> 1.98 s. Is that perhaps related to the improvement itself or just random?

@carole-lavillonniere
Copy link
Copy Markdown
Collaborator Author

Question: there is a difference in time for lstk aws sts get-caller-identity from before to after 700 ms -> 1.98 s. Is that perhaps related to the improvement itself or just random?

Good observation! I did some research and testing, and figured I was pointing to aws cli v1 which is in Python (v2 is in go hence a single binary). aws cli v1 is much slower and needs to do I/O when cache is not warm. The slowness does not seem to be due to anything added in this PR.

@carole-lavillonniere carole-lavillonniere merged commit 3fcef76 into main May 11, 2026
12 checks passed
@carole-lavillonniere carole-lavillonniere deleted the fix-telemetry branch May 11, 2026 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants