Skip to content

Conversation

@lionakhnazarov
Copy link
Collaborator

@lionakhnazarov lionakhnazarov commented Dec 17, 2025

The Keep Core node now exposes 31+ performance metrics via the /metrics endpoint (port 9601). These metrics provide comprehensive visibility into node operations, network health, and system performance.

Integrated Metrics by Category

1. DKG (Distributed Key Generation) Metrics (6 metrics)

Counters:

  • performance_dkg_joined_total - Total number of DKG joins (members joined)
  • performance_dkg_failed_total - Total number of failed DKG executions
  • performance_dkg_validation_total - Total number of DKG result validations performed
  • performance_dkg_challenges_submitted_total - Total number of DKG challenges submitted on-chain
  • performance_dkg_approvals_submitted_total - Total number of DKG approvals submitted on-chain

Duration Metrics:

  • performance_dkg_duration_seconds - Average duration of DKG operations
  • performance_dkg_duration_seconds_count - Total count of DKG operations

Performance Insights:

  • Success Rate: dkg_joined_total / (dkg_joined_total + dkg_failed_total) - Monitor DKG participation and success rates
  • Duration Monitoring: Alert if dkg_duration_seconds exceeds 300 seconds (5 minutes) - indicates slow DKG operations
  • On-chain Activity: Track dkg_challenges_submitted_total and dkg_approvals_submitted_total to monitor dispute resolution activity
  • Validation Rate: High dkg_validation_total relative to joins indicates active validation of DKG results

2. Signing Operations Metrics (5 metrics)

Counters:

  • performance_signing_operations_total - Total number of signing operations attempted
  • performance_signing_success_total - Total number of successful signing operations
  • performance_signing_failed_total - Total number of failed signing operations
  • performance_signing_timeouts_total - Total number of signing operations that timed out

Duration Metrics:

  • performance_signing_duration_seconds - Average duration of signing operations
  • performance_signing_duration_seconds_count - Total count of signing operations

Performance Insights:

  • Success Rate: signing_success_total / signing_operations_total - Critical metric for node reliability
  • Failure Rate: Alert if signing_failed_total rate > 10% of total operations
  • Timeout Rate: signing_timeouts_total / signing_operations_total - Indicates network or coordination issues
  • Performance: Alert if signing_duration_seconds exceeds 60 seconds - indicates slow signing operations
  • Throughput: Monitor signing_operations_total rate to understand signing workload

3. Wallet Dispatcher Metrics (6 metrics)

Counters:

  • performance_wallet_actions_total - Total number of wallet actions dispatched
  • performance_wallet_action_success_total - Total number of successfully completed wallet actions
  • performance_wallet_action_failed_total - Total number of failed wallet actions
  • performance_wallet_dispatcher_rejected_total - Total number of wallet actions rejected (wallet busy)
  • performance_wallet_heartbeat_failures_total - Total number of wallet heartbeat failures

Gauges:

  • performance_wallet_dispatcher_active_actions - Current number of wallets with active actions

Duration Metrics:

  • performance_wallet_action_duration_seconds - Average duration of wallet actions
  • performance_wallet_action_duration_seconds_count - Total count of wallet actions

Performance Insights:

  • Rejection Rate: wallet_dispatcher_rejected_total / wallet_actions_total - Alert if > 5% indicates wallet saturation
  • Success Rate: wallet_action_success_total / wallet_actions_total - Monitor wallet action reliability
  • Utilization: wallet_dispatcher_active_actions shows current wallet workload
  • Bottleneck Detection: High rejection rate + high active actions = wallet bottleneck
  • Health Monitoring: wallet_heartbeat_failures_total indicates wallet connectivity issues

4. Coordination Operations Metrics (4 metrics)

Counters:

  • performance_coordination_windows_detected_total - Total number of coordination windows detected
  • performance_coordination_procedures_executed_total - Total number of coordination procedures executed successfully
  • performance_coordination_failed_total - Total number of failed coordination procedures

Duration Metrics:

  • performance_coordination_duration_seconds - Average duration of coordination procedures
  • performance_coordination_duration_seconds_count - Total count of coordination procedures

Performance Insights:

  • Execution Rate: coordination_procedures_executed_total / coordination_windows_detected_total - Success rate of coordination
  • Failure Rate: Alert if coordination_failed_total rate > 5% of detected windows
  • Window Detection: Monitor coordination_windows_detected_total to understand coordination frequency
  • Performance: Track coordination_duration_seconds to identify slow coordination operations

5. Network Operations Metrics (10 metrics)

Peer Connection Metrics:

  • performance_peer_connections_total - Total number of peer connections established
  • performance_peer_disconnections_total - Total number of peer disconnections

Message Metrics:

  • performance_message_broadcast_total - Total number of messages broadcast to the network
  • performance_message_received_total - Total number of messages received from the network

Queue Size Metrics (Gauges):

  • performance_incoming_message_queue_size - Current size of incoming message queue (with channel label)
  • performance_message_handler_queue_size - Current size of message handler queues (with channel and handler labels)

Ping Test Metrics:

  • performance_ping_test_total - Total number of ping tests performed
  • performance_ping_test_success_total - Total number of successful ping tests
  • performance_ping_test_failed_total - Total number of failed ping tests
  • performance_ping_test_duration_seconds - Average duration of ping tests
  • performance_ping_test_duration_seconds_count - Total count of ping tests

Performance Insights:

  • Network Health: peer_connections_total vs peer_disconnections_total - Monitor connection stability
  • Message Throughput: Track message_broadcast_total and message_received_total rates
  • Queue Backlog: Alert if incoming_message_queue_size > 3000 (75% of 4096 capacity) - indicates message processing bottleneck
  • Handler Backlog: Alert if message_handler_queue_size > 400 (75% of 512 capacity) - indicates handler saturation
  • Network Latency: ping_test_duration_seconds shows network round-trip time
  • Connectivity: Alert if ping_test_failed_total rate > 10% of ping tests - indicates network issues
  • Message Balance: Compare broadcast vs received to detect message loss

6. System level metrics

  • CPU Utilization: Estimated CPU utilization based on goroutine count and GC activity
  • Memory Usage: Current allocated memory (heap) in bytes
  • Goroutine Count: Current number of active goroutines

- Introduced a new  system to monitor various operations within the Keep Core node, including wallet actions, DKG processes, signing operations, coordination procedures, and network activities.
- Metrics are recorded through a new interface, allowing for optional integration without impacting performance when disabled.
- Updated relevant components to wire in metrics recording, ensuring comprehensive coverage of critical operations.
- Added documentation detailing implemented metrics and their usage.

This enhancement provides better visibility into node performance and health, facilitating monitoring and troubleshooting.
@lionakhnazarov lionakhnazarov marked this pull request as ready for review December 31, 2025 18:43
- Introduced performance metrics for deposit and redemption process, including execution and proof submission metrics.
- Updated the .gitignore file to exclude new directories: data/, logs/, and storage/.
- Enhanced existing code to wire in metrics recording for redemption actions, improving visibility into redemption performance and potential bottlenecks.
- Added documentation outlining the new metrics and their implementation details.
Copy link
Contributor

@jose-blockchain jose-blockchain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated recommendations:

  1. Fix the deadlock in wallet.go before merge - this will freeze the node if triggered, is confirmed
  2. Add context cancellation to monitorQueueSizes - minor resource leak, not urgent but good to fix
  3. Document that metrics endpoint should be firewalled - standard practice, just worth noting in docs

the code doesn't introduce direct vulnerabilities like injection or auth bypass. The metrics are useful operational data that node operators need. Just ensure port 9601 isn't exposed publicly (standard practice for any metrics endpoint).

- Updated the performance metrics initialization to accept an existing instance, preventing duplicate registrations.
- Improved error handling in the metrics observer to log duplicate registrations at the debug level instead of warnings.
- Added a method to periodically observe gauge metrics, ensuring better monitoring capabilities.
- Updated the performance metrics registration to include a suffix for duration metrics, enhancing consistency.
- Ensured that metrics recorder is set for all cached coordination executors, improving reliability in metric tracking.
- Streamlined the coordination procedure execution by eliminating redundant metric recordings, relying on the executor for accurate timing and failure counts.
- Implemented a new method to periodically collect and update system metrics, including CPU utilization, memory usage, and goroutine count.
…erver

- Simplified the registration of performance metrics by consolidating the source map creation and conditionally excluding the count metric for 'ping_test_duration_seconds'.
- Removed the unused gauge observation method, as gauge metrics are now automatically handled by the ObserveApplicationSource function.
…r's sign function for clarity and maintainability.
…on in tbtc.go for improved clarity and maintainability.
piotr-roslaniec and others added 6 commits January 14, 2026 16:44
Fix 15 issues identified by code review:

Data Race Fixes:
- channel_manager.go: move metricsRecorder assignment inside mutex lock
- wallet.go: add RWMutex to protect metricsRecorder concurrent access
- spv.go: add RWMutex for globalMetricsRecorder package-level variable
- libp2p.go: use atomic.Value for metricsRecorder in notifiee callbacks
- rpc_health.go: fix 3 data races by storing error locally before mutex unlock

Resource Leak Fixes:
- channel.go: prevent duplicate goroutines in setMetricsRecorder using sync.Once
- channel.go: fix handler queue metrics to include handler index suffix
- performance.go: add context cancellation for observeSystemMetrics goroutine

Logic and Documentation Fixes:
- start.go: add nil guards for all clientInfoRegistry usages
- signing.go: clarify error metrics recording comment
- performance.go: add overflow bucket for histogram durations >600s
- performance.go: update NewPerformanceMetrics to accept context parameter
- docs/performance-metrics.adoc: add missing coordination and relay entry count metrics

All changes validated with successful build.
- Fix double mutex lock in IncrementCounter for better performance
- Remove duplicate registry observer registrations (memory leak)
- Extract histogram magic keys (-1, -2) to named constants
- Move metrics recording outside messageHandlersMutex to prevent deadlock
- Ensure all error paths in sign() record failure metrics
- Replace hardcoded metric strings with constants across codebase
- Add comprehensive unit tests with race detection
- Remove unused metricsRecorder field from spvMaintainer struct
- Fix metric type inconsistencies in documentation (Counter→Gauge)
- Add sync.Once guard to RPCHealthChecker.Start() for concurrency safety
- Replace hardcoded metric name strings with defined constants
- Add explanatory comment for timeout metrics accuracy
- Fix flaky TestWatchCoordinationWindows test with proper synchronization
- Introduced new metrics for individual wallet actions: total, success, failed, and duration.
- Updated performance metrics registration to include these new metrics dynamically.
- Enhanced documentation to reflect the new per-action metrics structure and examples.
@piotr-roslaniec piotr-roslaniec merged commit e22493d into threshold-network:main Jan 15, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants