Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 67 additions & 62 deletions docs/metrics/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,121 +214,126 @@ Metrics for zkproof-worker are to be added in future releases, if/when needed. C

#### Metric Name: `kms_connector_gw_listener_event_received_counter`
- **Type**: Counter
- **Labels**:
- `event_type`: can be used to filter by event type (public_decryption_request, user_decryption_request, crsgen_request, ...).
- **Description**: Counts the number of events received by the GW listener.
- **Alarm**: If the counter is a flat line over a period of time.
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.
- **Alarm**: If the counter is a flat line over a period of time, only for `event_type` `public_decryption_request` and `user_decryption_request`.
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter{event_type="..."}[1m]) == 0`.

#### Metric Name: `kms_connector_gw_listener_event_received_errors`
- **Type**: Counter
- **Labels**:
- `event_type`: see [description](#metric-name-kms_connector_gw_listener_event_received_counter)
- **Description**: Counts the number of errors encountered by the GW listener while receiving events.
- **Alarm**: If the counter increases over a period of time.
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.

#### Metric Name: `kms_connector_gw_listener_event_stored_counter`
- **Type**: Counter
- **Description**: Counts the number of events successfully stored in the DB by the GW listener.
- **Alarm**: If the counter is a flat line over a period of time.
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.

#### Metric Name: `kms_connector_gw_listener_event_storage_errors`
- **Type**: Counter
- **Description**: Counts the number of errors encountered by the GW listener while storing events in the DB.
- **Alarm**: If the counter increases over a period of time.
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
- **Recommendation**: more than 60 failures in 1 minute, i.e. `sum(increase(counter[1m])) > 60`.

### kms-worker

#### Metric Name: `kms_connector_worker_event_received_counter`
- **Type**: Counter
- **Labels**:
- `event_type`: see [description](#metric-name-kms_connector_gw_listener_event_received_counter)
- **Description**: Counts the number of events received by the KMS worker.
- **Alarm**: If the counter is a flat line over a period of time.
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.
- **Alarm**: If the counter is a flat line over a period of time, only for `event_type` `public_decryption_request` and `user_decryption_request`.
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter{event_type="..."}[1m]) == 0`.

#### Metric Name: `kms_connector_worker_event_received_errors`
- **Type**: Counter
- **Labels**:
- `event_type`: see [description](#metric-name-kms_connector_gw_listener_event_received_counter)
- **Description**: Counts the number of errors encountered while listening for events in the KMS worker.
- **Alarm**: If the counter increases over a period of time.
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
- **Recommendation**: more than 60 failures in 1 minute, i.e. `sum(increase(counter[1m])) > 60`.

#### Metric Name: `kms_connector_worker_decryption_request_sent_counter`
#### Metric Name: `kms_connector_worker_grpc_request_sent_counter`
- **Type**: Counter
- **Description**: Counts the number of decryption requests sent by the KMS worker to the KMS core.
- **Alarm**: If the counter is a flat line over a period of time.
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.
- **Labels**:
- `event_type`: see [description](#metric-name-kms_connector_gw_listener_event_received_counter)
- **Description**: Number of successful GRPC requests sent by the KMS worker to the KMS Core,
- **Alarm**: If the counter is a flat line over a period of time, only for `event_type` `public_decryption_request` and `user_decryption_request`.
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter{event_type="..."}[1m]) == 0`.

#### Metric Name: `kms_connector_worker_decryption_request_sent_errors`
#### Metric Name: `kms_connector_worker_grpc_request_sent_errors`
- **Type**: Counter
- **Description**: Counts the number of errors encountered by the KMS worker while sending decryption requests to the KMS core.
- **Labels**:
- `event_type`: see [description](#metric-name-kms_connector_gw_listener_event_received_counter)
- **Description**: Counts the number of errors encountered by the KMS worker while sending grpc requests to the KMS Core.
- **Alarm**: If the counter increases over a period of time.
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
- **Recommendation**: more than 60 failures in 1 minute, i.e. `sum(increase(counter[1m])) > 60`.

#### Metric Name: `kms_connector_worker_decryption_response_counter`
#### Metric Name: `kms_connector_worker_grpc_response_polled_counter`
- **Type**: Counter
- **Description**: Counts the number of decryption responses received by the KMS worker from the KMS core.
- **Alarm**: If the counter is a flat line over a period of time.
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.
- **Labels**:
- `event_type`: see [description](#metric-name-kms_connector_gw_listener_event_received_counter)
- **Description**: Counts the number of responses successfully polled from the KMS Core via GRPC.
- **Alarm**: If the counter is a flat line over a period of time, only for `event_type` `public_decryption_request` and `user_decryption_request`.
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter{event_type="..."}[1m]) == 0`.

#### Metric Name: `kms_connector_worker_decryption_response_errors`
#### Metric Name: `kms_connector_worker_grpc_response_polled_errors`
- **Type**: Counter
- **Description**: Counts the number of errors encountered by the KMS worker while receiving decryption responses from the KMS core.
- **Labels**:
- `event_type`: see [description](#metric-name-kms_connector_gw_listener_event_received_counter)
- **Description**: Counts the number of errors encountered by the KMS worker while polling responses from the KMS Core.
- **Alarm**: If the counter increases over a period of time.
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.

#### Metric Name: `kms_connector_worker_key_management_request_sent_counter`
- **Type**: Counter
- **Description**: Counts the number of key management requests sent by the KMS worker to the KMS core.
- **Alarm**: N/A - key management requests are infrequent events.

#### Metric Name: `kms_connector_worker_key_management_request_sent_errors`
- **Type**: Counter
- **Description**: Counts the number of errors encountered by the KMS worker while sending key management requests to the KMS core.
- **Alarm**: If the counter increases from 0. Key management is an important event that should not fail.
- **Recommendation**: alarm on any failures over a 1 minute period, i.e. `increase(counter[1m]) > 0`.

#### Metric Name: `kms_connector_worker_key_management_response_counter`
- **Type**: Counter
- **Description**: Counts the number of key management responses received by the KMS worker from the KMS core.
- **Alarm**: N/A - key management responses are infrequent events.

#### Metric Name: `kms_connector_worker_key_management_response_errors`
- **Type**: Counter
- **Description**: Counts the number of errors encountered by the KMS worker while receiving key management responses from the KMS core.
- **Alarm**: If the counter increases from 0. Key management is an important event that should not fail.
- **Recommendation**: alarm on any failures over a 1 minute period, i.e. `increase(counter[1m]) > 0`.
- **Recommendation**: more than 60 failures in 1 minute, i.e. `sum(increase(counter[1m])) > 60`.

#### Metric Name: `kms_connector_worker_s3_ciphertext_retrieval_counter`
- **Type**: Counter
- **Description**: Counts the number of ciphertexts retrieved by the KMS worker from S3.
- **Alarm**: N/A - key management events are infrequent.
- **Alarm**: If the counter is a flat line over a period of time.
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.

#### Metric Name: `kms_connector_worker_s3_ciphertext_retrieval_errors`
- **Type**: Counter
- **Description**: Counts the number of errors encountered by the KMS worker while retrieving ciphertexts from S3.
- **Alarm**: If the counter increases over a period of time.
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
- **Recommendation**: more than 60 failures in 1 minute, i.e. `sum(increase(counter[1m])) > 60`.

### tx-sender

#### Metric Name: `kms_connector_tx_sender_response_received_counter`
- **Type**: Counter
- **Labels**:
- `response_type`: can be used to filter by response type (public_decryption_response, user_decryption_response, crsgen_response, ...).
- **Description**: Counts the number of responses received by the TX sender.
- **Alarm**: If the counter is a flat line over a period of time.
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.
- **Alarm**: If the counter is a flat line over a period of time, only for `response_type` `public_decryption_response` and `user_decryption_response`.
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter{response_type = "..."}[1m]) == 0`.

#### Metric Name: `kms_connector_tx_sender_response_received_errors`
- **Type**: Counter
- **Labels**:
- `response_type`: see [description](#metric-name-kms_connector_tx_sender_response_received_counter)
- **Description**: Counts the number of errors encountered by the TX sender while listening for responses.
- **Alarm**: If the counter increases over a period of time.
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
- **Recommendation**: more than 60 failures in 1 minute, i.e. `sum(increase(counter[1m])) > 60`.

#### Metric Name: `kms_connector_tx_sender_gateway_tx_sent_counter`
- **Type**: Counter
- **Labels**:
- `response_type`: see [description](#metric-name-kms_connector_tx_sender_response_received_counter)
- **Description**: Counts the number of transactions sent to the Gateway by the TX sender.
- **Alarm**: If the counter is a flat line over a period of time.
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.
- **Alarm**: If the counter is a flat line over a period of time, only for `response_type` `public_decryption_response` and `user_decryption_response`.
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter{response_type = "..."}[1m]) == 0`.

#### Metric Name: `kms_connector_tx_sender_gateway_tx_sent_errors`
- **Type**: Counter
- **Labels**:
- `response_type`: see [description](#metric-name-kms_connector_tx_sender_response_received_counter)
- **Description**: Counts the number of errors encountered by the TX sender while sending transactions to the Gateway.
- **Alarm**: If the counter increases over a period of time.
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
- **Recommendation**: more than 60 failures in 1 minute, i.e. `sum(increase(counter[1m])) > 60`.

#### Metric Name: `kms_connector_pending_events`
- **Type**: Gauge
- **Labels**:
- `event_type`: see [description](#metric-name-kms_connector_gw_listener_event_received_counter) (only available for decryption right now!)
- **Description**: Tracks the number of Gateway events not yet processed in the kms-connector's DB.
- **Alarm**: Need more experience with this metric first.

#### Metric Name: `kms_connector_pending_responses`
- **Type**: Gauge
- **Labels**:
- `response_type`: see [description](#metric-name-kms_connector_tx_sender_response_received_counter) (only available for decryption right now!)
- **Description**: Tracks the number of KMS responses not yet sent to the Gateway in the kms-connector's DB.
- **Alarm**: Need more experience with this metric first.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion kms-connector/Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion kms-connector/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ gw-listener.path = "crates/gw-listener"
kms-worker.path = "crates/kms-worker"
tx-sender.path = "crates/tx-sender"
connector-utils.path = "crates/utils"
fhevm_gateway_bindings = { git = "https://github.com/zama-ai/fhevm.git", tag = "v0.10.0-2", default-features = false }
fhevm_gateway_bindings = { git = "https://github.com/zama-ai/fhevm.git", tag = "v0.10.0", default-features = false }
kms-grpc = { git = "https://github.com/zama-ai/kms.git", tag = "v0.12.4", default-features = true }
bc2wrap = { git = "https://github.com/zama-ai/kms.git", tag = "v0.12.4", default-features = true }
tfhe = "=1.4.0-alpha.3"
Expand Down
4 changes: 4 additions & 0 deletions kms-connector/config/tx-sender.toml
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,10 @@ private_key = "8da4ef21b864d2cc526dbdb2a120bd2874c36c9d0a1fb7f8c63d7f7a8b41de8f"
# ENV: KMS_CONNECTOR_MONITORING_ENDPOINT
# monitoring_endpoint = "0.0.0.0:9100"

# The interval between updates of gauge metrics (optional, defaults to 10s)
# ENV: KMS_CONNECTOR_GAUGE_UPDATE_INTERVAL_SECS
# gauge_update_interval_secs = 10

# The timeout to perform each external service connection healthcheck (optional, defaults to 3s)
# ENV: KMS_CONNECTOR_HEALTHCHECK_TIMEOUT_SECS
# healthcheck_timeout_secs = 3
Expand Down
12 changes: 8 additions & 4 deletions kms-connector/crates/gw-listener/src/core/gw_listener.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ use crate::{
core::{publish::update_last_block_polled, publish_event},
monitoring::{
health::State,
metrics::{EVENT_RECEIVED_COUNTER, EVENT_RECEIVED_ERRORS, EVENT_STORAGE_ERRORS},
metrics::{EVENT_RECEIVED_COUNTER, EVENT_RECEIVED_ERRORS},
},
};
use alloy::{
Expand Down Expand Up @@ -198,14 +198,19 @@ where
match events.next().await {
Some(Ok((event, log))) => {
*last_block = log.block_number;
EVENT_RECEIVED_COUNTER.inc();
EVENT_RECEIVED_COUNTER
.with_label_values(&[event_type.as_str()])
.inc();

let db = self.db_pool.clone();
spawn_with_limit(handle_gateway_event(db, event.into(), log.block_number))
.await;
}
Some(Err(err)) => {
error!("Error while listening for {event_type} events: {err}");
EVENT_RECEIVED_ERRORS.inc();
EVENT_RECEIVED_ERRORS
.with_label_values(&[event_type.as_str()])
.inc();
continue;
}
None => break error!("Alloy Provider was dropped for {event_type}"),
Expand Down Expand Up @@ -274,7 +279,6 @@ async fn handle_gateway_event(
);
if let Err(err) = publish_event(&db_pool, event, block_number).await {
error!("Failed to publish event: {err}");
EVENT_STORAGE_ERRORS.inc();
}
}

Expand Down
2 changes: 0 additions & 2 deletions kms-connector/crates/gw-listener/src/core/publish.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
use crate::monitoring::metrics::EVENT_STORED_COUNTER;
use alloy::primitives::U256;
use anyhow::anyhow;
use connector_utils::{
Expand Down Expand Up @@ -44,7 +43,6 @@ pub async fn publish_event(

if query_result.rows_affected() == 1 {
info!("Event successfully stored in DB!");
EVENT_STORED_COUNTER.inc();
} else {
warn!("Unexpected query result while publishing event: {query_result:?}");
}
Expand Down
32 changes: 9 additions & 23 deletions kms-connector/crates/gw-listener/src/monitoring/metrics.rs
Original file line number Diff line number Diff line change
@@ -1,34 +1,20 @@
use prometheus::{IntCounter, register_int_counter};
use prometheus::{IntCounterVec, register_int_counter_vec};
use std::sync::LazyLock;

pub static EVENT_RECEIVED_COUNTER: LazyLock<IntCounter> = LazyLock::new(|| {
register_int_counter!(
pub static EVENT_RECEIVED_COUNTER: LazyLock<IntCounterVec> = LazyLock::new(|| {
register_int_counter_vec!(
"kms_connector_gw_listener_event_received_counter",
"Number of events received by the GatewayListener"
"Number of events received by the GatewayListener",
&["event_type"]
)
.unwrap()
});

pub static EVENT_RECEIVED_ERRORS: LazyLock<IntCounter> = LazyLock::new(|| {
register_int_counter!(
pub static EVENT_RECEIVED_ERRORS: LazyLock<IntCounterVec> = LazyLock::new(|| {
register_int_counter_vec!(
"kms_connector_gw_listener_event_received_errors",
"Number of errors encountered by the GatewayListener while receiving events"
)
.unwrap()
});

pub static EVENT_STORED_COUNTER: LazyLock<IntCounter> = LazyLock::new(|| {
register_int_counter!(
"kms_connector_gw_listener_event_stored_counter",
"Number of events stored in DB by the GatewayListener"
)
.unwrap()
});

pub static EVENT_STORAGE_ERRORS: LazyLock<IntCounter> = LazyLock::new(|| {
register_int_counter!(
"kms_connector_gw_listener_event_storage_errors",
"Number of errors encountered by the GatewayListener while trying to store events in DB"
"Number of errors encountered by the GatewayListener while receiving events",
&["event_type"]
)
.unwrap()
});
Loading