Skip to content

Commit 8a4cd6d

Browse files
committed
docs(kms-connector): update metrics documentation
1 parent 345a971 commit 8a4cd6d

File tree

2 files changed

+62
-57
lines changed

2 files changed

+62
-57
lines changed

docs/metrics/metrics.md

Lines changed: 61 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -214,121 +214,126 @@ Metrics for zkproof-worker are to be added in future releases, if/when needed. C
214214

215215
#### Metric Name: `kms_connector_gw_listener_event_received_counter`
216216
- **Type**: Counter
217+
- **Labels**:
218+
- `event_type`: can be used to filter by event type (public_decryption_request, user_decryption_request, crsgen_request, ...).
217219
- **Description**: Counts the number of events received by the GW listener.
218220
- **Alarm**: If the counter is a flat line over a period of time.
219-
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.
221+
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.
220222

221223
#### Metric Name: `kms_connector_gw_listener_event_received_errors`
222224
- **Type**: Counter
225+
- **Labels**:
226+
- `event_type`: see [description](#metric-name-kms_connector_gw_listener_event_received_counter)
223227
- **Description**: Counts the number of errors encountered by the GW listener while receiving events.
224228
- **Alarm**: If the counter increases over a period of time.
225-
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
226-
227-
#### Metric Name: `kms_connector_gw_listener_event_stored_counter`
228-
- **Type**: Counter
229-
- **Description**: Counts the number of events successfully stored in the DB by the GW listener.
230-
- **Alarm**: If the counter is a flat line over a period of time.
231-
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.
232-
233-
#### Metric Name: `kms_connector_gw_listener_event_storage_errors`
234-
- **Type**: Counter
235-
- **Description**: Counts the number of errors encountered by the GW listener while storing events in the DB.
236-
- **Alarm**: If the counter increases over a period of time.
237-
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
229+
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
238230

239231
### kms-worker
240232

241233
#### Metric Name: `kms_connector_worker_event_received_counter`
242234
- **Type**: Counter
235+
- **Labels**:
236+
- `event_type`: see [description](#metric-name-kms_connector_gw_listener_event_received_counter)
243237
- **Description**: Counts the number of events received by the KMS worker.
244238
- **Alarm**: If the counter is a flat line over a period of time.
245-
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.
239+
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.
246240

247241
#### Metric Name: `kms_connector_worker_event_received_errors`
248242
- **Type**: Counter
243+
- **Labels**:
244+
- `event_type`: see [description](#metric-name-kms_connector_gw_listener_event_received_counter)
249245
- **Description**: Counts the number of errors encountered while listening for events in the KMS worker.
250246
- **Alarm**: If the counter increases over a period of time.
251-
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
247+
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
252248

253-
#### Metric Name: `kms_connector_worker_decryption_request_sent_counter`
249+
#### Metric Name: `kms_connector_worker_grpc_request_sent_counter`
254250
- **Type**: Counter
255-
- **Description**: Counts the number of decryption requests sent by the KMS worker to the KMS core.
251+
- **Labels**:
252+
- `event_type`: see [description](#metric-name-kms_connector_gw_listener_event_received_counter)
253+
- **Description**: Number of successful GRPC requests sent by the KMS worker to the KMS Core,
256254
- **Alarm**: If the counter is a flat line over a period of time.
257-
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.
255+
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.
258256

259-
#### Metric Name: `kms_connector_worker_decryption_request_sent_errors`
257+
#### Metric Name: `kms_connector_worker_grpc_request_sent_errors`
260258
- **Type**: Counter
261-
- **Description**: Counts the number of errors encountered by the KMS worker while sending decryption requests to the KMS core.
259+
- **Labels**:
260+
- `event_type`: see [description](#metric-name-kms_connector_gw_listener_event_received_counter)
261+
- **Description**: Counts the number of errors encountered by the KMS worker while sending grpc requests to the KMS Core.
262262
- **Alarm**: If the counter increases over a period of time.
263-
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
263+
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
264264

265-
#### Metric Name: `kms_connector_worker_decryption_response_counter`
265+
#### Metric Name: `kms_connector_worker_grpc_response_polled_counter`
266266
- **Type**: Counter
267-
- **Description**: Counts the number of decryption responses received by the KMS worker from the KMS core.
267+
- **Labels**:
268+
- `event_type`: see [description](#metric-name-kms_connector_gw_listener_event_received_counter)
269+
- **Description**: Counts the number of responses successfully polled from the KMS Core via GRPC.
268270
- **Alarm**: If the counter is a flat line over a period of time.
269-
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.
271+
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.
270272

271-
#### Metric Name: `kms_connector_worker_decryption_response_errors`
273+
#### Metric Name: `kms_connector_worker_grpc_response_polled_errors`
272274
- **Type**: Counter
273-
- **Description**: Counts the number of errors encountered by the KMS worker while receiving decryption responses from the KMS core.
275+
- **Labels**:
276+
- `event_type`: see [description](#metric-name-kms_connector_gw_listener_event_received_counter)
277+
- **Description**: Counts the number of errors encountered by the KMS worker while polling responses from the KMS Core.
274278
- **Alarm**: If the counter increases over a period of time.
275-
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
276-
277-
#### Metric Name: `kms_connector_worker_key_management_request_sent_counter`
278-
- **Type**: Counter
279-
- **Description**: Counts the number of key management requests sent by the KMS worker to the KMS core.
280-
- **Alarm**: N/A - key management requests are infrequent events.
281-
282-
#### Metric Name: `kms_connector_worker_key_management_request_sent_errors`
283-
- **Type**: Counter
284-
- **Description**: Counts the number of errors encountered by the KMS worker while sending key management requests to the KMS core.
285-
- **Alarm**: If the counter increases from 0. Key management is an important event that should not fail.
286-
- **Recommendation**: alarm on any failures over a 1 minute period, i.e. `increase(counter[1m]) > 0`.
287-
288-
#### Metric Name: `kms_connector_worker_key_management_response_counter`
289-
- **Type**: Counter
290-
- **Description**: Counts the number of key management responses received by the KMS worker from the KMS core.
291-
- **Alarm**: N/A - key management responses are infrequent events.
292-
293-
#### Metric Name: `kms_connector_worker_key_management_response_errors`
294-
- **Type**: Counter
295-
- **Description**: Counts the number of errors encountered by the KMS worker while receiving key management responses from the KMS core.
296-
- **Alarm**: If the counter increases from 0. Key management is an important event that should not fail.
297-
- **Recommendation**: alarm on any failures over a 1 minute period, i.e. `increase(counter[1m]) > 0`.
279+
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
298280

299281
#### Metric Name: `kms_connector_worker_s3_ciphertext_retrieval_counter`
300282
- **Type**: Counter
301283
- **Description**: Counts the number of ciphertexts retrieved by the KMS worker from S3.
302-
- **Alarm**: N/A - key management events are infrequent.
284+
- **Alarm**: If the counter is a flat line over a period of time.
285+
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.
303286

304287
#### Metric Name: `kms_connector_worker_s3_ciphertext_retrieval_errors`
305288
- **Type**: Counter
306289
- **Description**: Counts the number of errors encountered by the KMS worker while retrieving ciphertexts from S3.
307290
- **Alarm**: If the counter increases over a period of time.
308-
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
291+
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
309292

310293
### tx-sender
311294

312295
#### Metric Name: `kms_connector_tx_sender_response_received_counter`
313296
- **Type**: Counter
297+
- **Labels**:
298+
- `response_type`: can be used to filter by response type (public_decryption_response, user_decryption_response, crsgen_response, ...).
314299
- **Description**: Counts the number of responses received by the TX sender.
315300
- **Alarm**: If the counter is a flat line over a period of time.
316-
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.
301+
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.
317302

318303
#### Metric Name: `kms_connector_tx_sender_response_received_errors`
319304
- **Type**: Counter
305+
- **Labels**:
306+
- `response_type`: see [description](#metric-name-kms_connector_tx_sender_response_received_counter)
320307
- **Description**: Counts the number of errors encountered by the TX sender while listening for responses.
321308
- **Alarm**: If the counter increases over a period of time.
322-
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
309+
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
323310

324311
#### Metric Name: `kms_connector_tx_sender_gateway_tx_sent_counter`
325312
- **Type**: Counter
313+
- **Labels**:
314+
- `response_type`: see [description](#metric-name-kms_connector_tx_sender_response_received_counter)
326315
- **Description**: Counts the number of transactions sent to the Gateway by the TX sender.
327316
- **Alarm**: If the counter is a flat line over a period of time.
328-
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.
317+
- **Recommendation**: 0 for more than 1 minute, i.e. `increase(counter[1m]) == 0`.
329318

330319
#### Metric Name: `kms_connector_tx_sender_gateway_tx_sent_errors`
331320
- **Type**: Counter
321+
- **Labels**:
322+
- `response_type`: see [description](#metric-name-kms_connector_tx_sender_response_received_counter)
332323
- **Description**: Counts the number of errors encountered by the TX sender while sending transactions to the Gateway.
333324
- **Alarm**: If the counter increases over a period of time.
334-
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
325+
- **Recommendation**: more than 60 failures in 1 minute, i.e. `increase(counter[1m]) > 60`.
326+
327+
#### Metric Name: `kms_connector_pending_events`
328+
- **Type**: Gauge
329+
- **Labels**:
330+
- `event_type`: see [description](#metric-name-kms_connector_gw_listener_event_received_counter) (only available for decryption right now!)
331+
- **Description**: Tracks the number of Gateway events not yet processed in the kms-connector's DB.
332+
- **Alarm**: Need more experience with this metric first.
333+
334+
#### Metric Name: `kms_connector_pending_responses`
335+
- **Type**: Gauge
336+
- **Labels**:
337+
- `response_type`: see [description](#metric-name-kms_connector_tx_sender_response_received_counter) (only available for decryption right now!)
338+
- **Description**: Tracks the number of KMS responses not yet sent to the Gateway in the kms-connector's DB.
339+
- **Alarm**: Need more experience with this metric first.

kms-connector/crates/gw-listener/src/monitoring/metrics.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ pub static EVENT_RECEIVED_COUNTER: LazyLock<IntCounterVec> = LazyLock::new(|| {
1313
pub static EVENT_RECEIVED_ERRORS: LazyLock<IntCounterVec> = LazyLock::new(|| {
1414
register_int_counter_vec!(
1515
"kms_connector_gw_listener_event_received_errors",
16-
"Number of errors encountered by the GatewayListener while listening for Gateway events",
16+
"Number of errors encountered by the GatewayListener while receiving events",
1717
&["event_type"]
1818
)
1919
.unwrap()

0 commit comments

Comments
 (0)