LLMOps

digital.vasic.llmops -- LLM operations with continuous evaluation pipelines, A/B experiment management, dataset management, prompt versioning, and alerting.

Overview

LLMOps is a Go module that provides a complete MLOps/LLMOps toolkit for managing the lifecycle of LLM-powered applications. It covers four key operational areas: continuous evaluation of model quality against datasets, A/B experimentation for comparing prompts and models, prompt template versioning with variable rendering, and alerting for performance regressions.

The continuous evaluator runs evaluation datasets against specified prompt/model combinations, tracks pass rates and per-metric scores, automatically detects regressions against previous runs, and triggers alerts when thresholds are breached. Evaluations can be scheduled for recurring execution.

The experiment manager implements full A/B testing with traffic splitting, deterministic user-to-variant assignment (hash-based), metric recording, and statistical significance calculation using z-test approximation. It supports prompt experiments (comparing prompt versions) and model experiments (comparing different LLM providers).

Prompt versioning provides semantic versioning, variable templates with validation, active version tracking, and a diff tool for comparing prompt versions. All components are tied together by the LLMOpsSystem orchestrator which wires up debate-based evaluation, verifier integration, and component initialization.

Architecture

+----------------------------+
|       LLMOpsSystem         |
|    (Main Orchestrator)     |
+----+-------+-------+------+
     |       |       |
+----v--+ +--v---+ +-v---------+
|Prompt | |Exper.| |Continuous  |
|Registry| |Mgr  | |Evaluator  |
+-------+ +--+--+ +-----+-----+
              |          |
         +----v----+ +---v--------+
         | Variant | | Alert      |
         | Assign  | | Manager    |
         | Metrics | | (Regress.) |
         +---------+ +------------+

Integration Adapters:
  - DebateLLMEvaluator    (debate-based scoring)
  - VerifierIntegration   (provider selection)

Package Structure

Package	Purpose
`llmops`	Core module: evaluator, experiments, prompts, alerts, integration

Source Files

File	Description
`types.go`	All type definitions, interfaces, enums: prompts, experiments, evaluations, datasets, alerts
`evaluator.go`	`InMemoryContinuousEvaluator` -- evaluation runs, dataset management, regression detection, scheduling
`experiments.go`	`InMemoryExperimentManager` -- A/B testing, traffic splitting, variant assignment, statistical significance
`prompts.go`	`InMemoryPromptRegistry` -- prompt versioning, variable rendering, version comparison/diffing
`integration.go`	`LLMOpsSystem` orchestrator, alert manager, debate adapter, verifier integration

API Reference

Core Interfaces

// PromptRegistry manages prompt versions
type PromptRegistry interface {
    Create(ctx context.Context, prompt *PromptVersion) error
    Get(ctx context.Context, name, version string) (*PromptVersion, error)
    GetLatest(ctx context.Context, name string) (*PromptVersion, error)
    List(ctx context.Context, name string) ([]*PromptVersion, error)
    ListAll(ctx context.Context) ([]*PromptVersion, error)
    Activate(ctx context.Context, name, version string) error
    Delete(ctx context.Context, name, version string) error
    Render(ctx context.Context, name, version string, vars map[string]interface{}) (string, error)
}

// ExperimentManager manages A/B testing
type ExperimentManager interface {
    Create(ctx context.Context, exp *Experiment) error
    Get(ctx context.Context, id string) (*Experiment, error)
    List(ctx context.Context, status ExperimentStatus) ([]*Experiment, error)
    Start(ctx context.Context, id string) error
    Pause(ctx context.Context, id string) error
    Complete(ctx context.Context, id string, winner string) error
    Cancel(ctx context.Context, id string) error
    AssignVariant(ctx context.Context, experimentID, userID string) (*Variant, error)
    RecordMetric(ctx context.Context, experimentID, variantID, metric string, value float64) error
    GetResults(ctx context.Context, experimentID string) (*ExperimentResult, error)
}

// ContinuousEvaluator manages evaluation pipelines
type ContinuousEvaluator interface {
    CreateRun(ctx context.Context, run *EvaluationRun) error
    StartRun(ctx context.Context, runID string) error
    GetRun(ctx context.Context, runID string) (*EvaluationRun, error)
    ListRuns(ctx context.Context, filter *EvaluationFilter) ([]*EvaluationRun, error)
    ScheduleRun(ctx context.Context, run *EvaluationRun, schedule string) error
    CompareRuns(ctx context.Context, runID1, runID2 string) (*RunComparison, error)
}

// AlertManager manages performance alerts
type AlertManager interface {
    Create(ctx context.Context, alert *Alert) error
    List(ctx context.Context, filter *AlertFilter) ([]*Alert, error)
    Acknowledge(ctx context.Context, alertID string) error
    Subscribe(ctx context.Context, callback AlertCallback) error
}

LLMOpsSystem Methods

func NewLLMOpsSystem(config *LLMOpsConfig, logger *logrus.Logger) *LLMOpsSystem
func (s *LLMOpsSystem) Initialize() error
func (s *LLMOpsSystem) SetDebateEvaluator(evaluator DebateLLMEvaluator)
func (s *LLMOpsSystem) SetVerifierIntegration(vi *VerifierIntegration)
func (s *LLMOpsSystem) GetPromptRegistry() PromptRegistry
func (s *LLMOpsSystem) GetExperimentManager() ExperimentManager
func (s *LLMOpsSystem) GetEvaluator() ContinuousEvaluator
func (s *LLMOpsSystem) GetAlertManager() AlertManager
func (s *LLMOpsSystem) CreatePromptExperiment(ctx, name, control, treatment, trafficSplit) (*Experiment, error)
func (s *LLMOpsSystem) CreateModelExperiment(ctx, name, models, parameters) (*Experiment, error)

Usage Examples

Prompt versioning and rendering

registry := llmops.NewInMemoryPromptRegistry(logger)

registry.Create(ctx, &llmops.PromptVersion{
    Name:    "code-review",
    Version: "1.0.0",
    Content: "Review this {{language}} code for {{focus_area}}: {{code}}",
    Variables: []llmops.PromptVariable{
        {Name: "language", Type: "string", Required: true},
        {Name: "focus_area", Type: "string", Required: true, Default: "bugs"},
        {Name: "code", Type: "string", Required: true},
    },
})

rendered, _ := registry.Render(ctx, "code-review", "1.0.0", map[string]interface{}{
    "language":   "Go",
    "focus_area": "performance",
    "code":       "func main() { ... }",
})

A/B experiment for prompts

system := llmops.NewLLMOpsSystem(llmops.DefaultLLMOpsConfig(), logger)
system.Initialize()

exp, _ := system.CreatePromptExperiment(ctx,
    "code-review-v2-test",
    controlPrompt,   // *PromptVersion
    treatmentPrompt,  // *PromptVersion
    0.3,              // 30% traffic to treatment
)

mgr := system.GetExperimentManager()
mgr.Start(ctx, exp.ID)

// During request handling:
variant, _ := mgr.AssignVariant(ctx, exp.ID, userID)
// ... use variant.PromptName/PromptVersion ...
mgr.RecordMetric(ctx, exp.ID, variant.ID, "quality", 0.85)

// Check results
results, _ := mgr.GetResults(ctx, exp.ID)
// results.Confidence, results.Winner, results.Recommendation

Continuous evaluation with regression detection

eval := system.GetEvaluator().(*llmops.InMemoryContinuousEvaluator)

// Create dataset
eval.CreateDataset(ctx, &llmops.Dataset{
    Name: "golden-set",
    Type: llmops.DatasetTypeGolden,
})
eval.AddSamples(ctx, datasetID, samples)

// Create and run evaluation
eval.CreateRun(ctx, &llmops.EvaluationRun{
    Name:       "nightly-eval",
    Dataset:    datasetID,
    PromptName: "code-review",
    ModelName:  "claude-3-sonnet",
    Metrics:    []string{"accuracy", "helpfulness"},
})
eval.StartRun(ctx, runID)
// Automatic regression alerts if pass rate drops > 5%

Alert subscription

alerts := system.GetAlertManager()
alerts.Subscribe(ctx, func(alert *llmops.Alert) error {
    log.Printf("[%s] %s: %s", alert.Severity, alert.Type, alert.Message)
    return nil
})

Configuration

type LLMOpsConfig struct {
    EnableAutoEvaluation   bool               // Enable background evaluation (default: true)
    EvaluationInterval     time.Duration      // Evaluation frequency (default: 24h)
    MinSamplesForSignif    int                // Min samples for statistical significance (default: 100)
    AlertThresholds        map[string]float64 // Metric thresholds for alerts
    EnableDebateEvaluation bool               // Use debate for evaluation (default: true)
}

Experiment Status Lifecycle

draft -> running -> paused -> running -> completed (with winner) draft -> running -> cancelled

Alert Types and Severities

Type	Description
`regression`	Performance regression detected
`threshold`	Metric threshold breached
`anomaly`	Anomaly detected
`experiment`	Experiment result notification

Severities: info, warning, critical

Dataset Types

Type	Purpose
`golden`	Curated golden test set
`regression`	Regression test cases
`benchmark`	Benchmark evaluation set
`user`	User-generated examples

Testing

go build ./...
go test ./... -count=1 -race

Integration with HelixAgent

LLMOps connects to HelixAgent through the adapter layer:

Debate Evaluation: The DebateLLMEvaluator interface allows HelixAgent's debate service to evaluate model responses during continuous evaluation runs, providing multi-LLM consensus scoring.
Verifier Integration: VerifierIntegration uses LLMsVerifier's provider scores and health checks to select the best provider for experiments and evaluations.
Prompt Management: The prompt registry serves as the central store for all system prompts used by HelixAgent, enabling versioned rollouts and A/B testing of prompt changes.
Regression Alerts: Alerts integrate with HelixAgent's notification system (SSE, WebSocket, Webhooks) to surface quality regressions to operators.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Upstreams		Upstreams
challenges/scripts		challenges/scripts
docs		docs
llmops		llmops
pkg/i18n		pkg/i18n
scripts/host-power-management		scripts/host-power-management
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONSTITUTION.md		CONSTITUTION.md
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMOps

Overview

Architecture

Package Structure

Source Files

API Reference

Core Interfaces

LLMOpsSystem Methods

Usage Examples

Prompt versioning and rendering

A/B experiment for prompts

Continuous evaluation with regression detection

Alert subscription

Configuration

Experiment Status Lifecycle

Alert Types and Severities

Dataset Types

Testing

Integration with HelixAgent

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLMOps

Overview

Architecture

Package Structure

Source Files

API Reference

Core Interfaces

LLMOpsSystem Methods

Usage Examples

Prompt versioning and rendering

A/B experiment for prompts

Continuous evaluation with regression detection

Alert subscription

Configuration

Experiment Status Lifecycle

Alert Types and Severities

Dataset Types

Testing

Integration with HelixAgent

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages