SelfImprove

digital.vasic.selfimprove -- AI self-improvement through RLHF, reward modelling, feedback integration, policy optimization, and dimension-weighted scoring.

Overview

SelfImprove is a Go module that implements a complete Reinforcement Learning from AI Feedback (RLAIF) pipeline for continuous AI quality improvement. It provides reward models that score LLM responses across multiple quality dimensions (accuracy, relevance, helpfulness, harmlessness, honesty, coherence, creativity, formatting), feedback collection with aggregation and trend analysis, and policy optimization that refines system prompts based on accumulated evidence.

The module supports two evaluation modes: single-LLM evaluation where one provider scores responses, and multi-LLM debate evaluation where an ensemble of models collectively assess response quality through structured debate. Constitutional AI principles are enforced through self-critique filtering, ensuring that policy updates never violate safety constraints.

SelfImprove integrates with HelixAgent's debate service and LLMsVerifier to leverage provider trust scores when making optimization decisions. The background optimization loop periodically exports feedback as training examples, generates policy improvements, and applies them with configurable daily rate limits and minimum improvement thresholds.

Architecture

                    +---------------------------+
                    |  SelfImprovementSystem    |
                    |  (Main Orchestrator)      |
                    +------+--------+-----------+
                           |        |
              +------------+        +-------------+
              |                                   |
   +----------v----------+           +-----------v-----------+
   |   AIRewardModel     |           |  LLMPolicyOptimizer   |
   | (Score, Compare,    |           | (Optimize, Apply,     |
   |  ScoreWithDimensions)|          |  Rollback, Critique)  |
   +----------+----------+           +-----------+-----------+
              |                                   |
   +----------v----------+           +-----------v-----------+
   | Debate / LLM Eval   |           | Debate / LLM Analysis |
   | (Fallback chain)    |           | (Fallback chain)      |
   +---------------------+           +-----------------------+
              |
   +----------v-----------+
   | FeedbackCollector     |
   | (InMemory / Auto)     |
   | Collect, Aggregate,   |
   | Export, Trend          |
   +------------------------+

Package Structure

Package	Purpose
`selfimprove`	Core module: reward models, feedback collection, policy optimization, integration orchestrator

Source Files

File	Description
`types.go`	All type definitions, interfaces, enums, and configuration defaults
`reward.go`	`AIRewardModel` -- LLM/debate-based response scoring with caching and dimension weights
`feedback.go`	`InMemoryFeedbackCollector` and `AutoFeedbackCollector` -- feedback storage, filtering, aggregation, export
`optimizer.go`	`LLMPolicyOptimizer` -- feedback pattern analysis, policy generation, constitutional self-critique
`integration.go`	`SelfImprovementSystem` orchestrator, `DebateServiceAdapter`, background optimization loop

API Reference

Types

Feedback types: FeedbackTypePositive, FeedbackTypeNegative, FeedbackTypeNeutral, FeedbackTypeSuggestion, FeedbackTypeCorrection

Feedback sources: FeedbackSourceHuman, FeedbackSourceAI, FeedbackSourceDebate, FeedbackSourceVerifier, FeedbackSourceMetric

Evaluation dimensions: DimensionAccuracy, DimensionRelevance, DimensionHelpfulness, DimensionHarmless, DimensionHonest, DimensionCoherence, DimensionCreativity, DimensionFormatting

Policy update types: PolicyUpdatePromptRefinement, PolicyUpdateGuidelineAddition, PolicyUpdateExampleAddition, PolicyUpdateConstraintUpdate, PolicyUpdateToneAdjustment

Core Interfaces

// RewardModel evaluates response quality
type RewardModel interface {
    Score(ctx context.Context, prompt, response string) (float64, error)
    ScoreWithDimensions(ctx context.Context, prompt, response string) (map[DimensionType]float64, error)
    Compare(ctx context.Context, prompt, response1, response2 string) (*PreferencePair, error)
    Train(ctx context.Context, examples []*TrainingExample) error
}

// FeedbackCollector collects and processes feedback
type FeedbackCollector interface {
    Collect(ctx context.Context, feedback *Feedback) error
    GetBySession(ctx context.Context, sessionID string) ([]*Feedback, error)
    GetByPrompt(ctx context.Context, promptID string) ([]*Feedback, error)
    GetAggregated(ctx context.Context, filter *FeedbackFilter) (*AggregatedFeedback, error)
    Export(ctx context.Context, filter *FeedbackFilter) ([]*TrainingExample, error)
}

// PolicyOptimizer optimizes policies based on feedback
type PolicyOptimizer interface {
    Optimize(ctx context.Context, examples []*TrainingExample) ([]*PolicyUpdate, error)
    Apply(ctx context.Context, update *PolicyUpdate) error
    Rollback(ctx context.Context, updateID string) error
    GetHistory(ctx context.Context, limit int) ([]*PolicyUpdate, error)
    GetCurrentPolicy() string
    SetCurrentPolicy(policy string)
}

SelfImprovementSystem Methods

func NewSelfImprovementSystem(config *SelfImprovementConfig, logger *logrus.Logger) *SelfImprovementSystem
func (sis *SelfImprovementSystem) Initialize(provider LLMProvider, debateService DebateService) error
func (sis *SelfImprovementSystem) Start() error
func (sis *SelfImprovementSystem) Stop()
func (sis *SelfImprovementSystem) CollectFeedback(ctx context.Context, feedback *Feedback) error
func (sis *SelfImprovementSystem) CollectAutoFeedback(ctx, sessionID, promptID, prompt, response, provider, model string) (*Feedback, error)
func (sis *SelfImprovementSystem) ScoreResponse(ctx context.Context, prompt, response string) (float64, error)
func (sis *SelfImprovementSystem) CompareResponses(ctx context.Context, prompt, r1, r2 string) (*PreferencePair, error)
func (sis *SelfImprovementSystem) GetFeedbackStats(ctx context.Context, filter *FeedbackFilter) (*AggregatedFeedback, error)
func (sis *SelfImprovementSystem) GetPolicyHistory(ctx context.Context, limit int) ([]*PolicyUpdate, error)
func (sis *SelfImprovementSystem) RollbackPolicy(ctx context.Context, updateID string) error

Usage Examples

Basic setup and auto-feedback collection

config := selfimprove.DefaultSelfImprovementConfig()
config.AutoCollectFeedback = true
config.UseDebateForReward = true

system := selfimprove.NewSelfImprovementSystem(config, logger)
err := system.Initialize(llmProvider, debateService)
system.Start() // starts background optimization loop

// Auto-collect feedback on every response
feedback, err := system.CollectAutoFeedback(ctx,
    sessionID, promptID, prompt, response, "claude", "claude-3-sonnet")

// Score a response manually
score, err := system.ScoreResponse(ctx, prompt, response)

// Compare two responses (DPO-style)
pair, err := system.CompareResponses(ctx, prompt, responseA, responseB)
// pair.Chosen, pair.Rejected, pair.Margin

system.Stop()

Manual feedback collection with filtering

collector := selfimprove.NewInMemoryFeedbackCollector(logger, 10000)
collector.Collect(ctx, &selfimprove.Feedback{
    SessionID: "session-1",
    PromptID:  "prompt-1",
    Type:      selfimprove.FeedbackTypePositive,
    Source:    selfimprove.FeedbackSourceHuman,
    Score:     0.9,
    Dimensions: map[selfimprove.DimensionType]float64{
        selfimprove.DimensionAccuracy:    0.95,
        selfimprove.DimensionHelpfulness: 0.85,
    },
})

stats, _ := collector.GetAggregated(ctx, &selfimprove.FeedbackFilter{
    Sources:  []selfimprove.FeedbackSource{selfimprove.FeedbackSourceHuman},
    MinScore: float64Ptr(0.5),
})

Configuration

type SelfImprovementConfig struct {
    RewardModelProvider      string        // LLM provider for scoring (default: "claude")
    RewardModelName          string        // Model name (default: "claude-3-sonnet")
    MinRewardThreshold       float64       // Minimum score threshold (default: 0.5)
    AutoCollectFeedback      bool          // Enable auto AI feedback (default: true)
    FeedbackBatchSize        int           // Batch size for processing (default: 100)
    MinConfidenceForAuto     float64       // Min confidence for auto-feedback (default: 0.8)
    OptimizationInterval     time.Duration // How often to run optimization (default: 24h)
    MinExamplesForUpdate     int           // Min examples before optimizing (default: 50)
    MaxPolicyUpdatesPerDay   int           // Daily update limit (default: 3)
    ConstitutionalPrinciples []string      // Safety principles for self-critique
    EnableSelfCritique       bool          // Filter updates against principles (default: true)
    UseDebateForReward       bool          // Use debate ensemble for scoring (default: true)
    UseDebateForOptimize     bool          // Use debate for optimization (default: true)
    MaxBufferSize            int           // Max feedback buffer size (default: 10000)
}

Dimension Weights (Default)

Dimension	Weight
Accuracy	0.25
Relevance	0.20
Helpfulness	0.20
Harmlessness	0.15
Honesty	0.10
Coherence	0.10

Testing

go build ./...
go test ./... -count=1 -race

Integration with HelixAgent

SelfImprove connects to HelixAgent through the adapter layer at internal/adapters/selfimprove/. The main integration points are:

Debate Service: The DebateServiceAdapter wraps HelixAgent's debate service to enable multi-LLM evaluation and comparison of responses.
LLMsVerifier: Provider trust scores from the verifier influence which feedback is prioritized and which providers are considered most reliable for reward evaluation.
Background Optimization: The system runs as a background service alongside HelixAgent, periodically analyzing collected feedback and updating system prompts.
Policy Management: Policy updates (system prompt refinements) are applied with rollback capability, maintaining a complete history of all changes.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Upstreams		Upstreams
challenges/scripts		challenges/scripts
docs		docs
pkg/i18n		pkg/i18n
scripts/host-power-management		scripts/host-power-management
selfimprove		selfimprove
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONSTITUTION.md		CONSTITUTION.md
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SelfImprove

Overview

Architecture

Package Structure

Source Files

API Reference

Types

Core Interfaces

SelfImprovementSystem Methods

Usage Examples

Basic setup and auto-feedback collection

Manual feedback collection with filtering

Configuration

Dimension Weights (Default)

Testing

Integration with HelixAgent

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SelfImprove

Overview

Architecture

Package Structure

Source Files

API Reference

Types

Core Interfaces

SelfImprovementSystem Methods

Usage Examples

Basic setup and auto-feedback collection

Manual feedback collection with filtering

Configuration

Dimension Weights (Default)

Testing

Integration with HelixAgent

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages