Skip to content

Conversation

@hloeffler
Copy link
Contributor

Summary

  • Adds a new "Dashboards" page under the Monitoring section (src/multiplayer-servers/monitoring/dashboards.md)
  • Documents the "BBE Probes from Nodes" dashboard explaining:
    • What BlackBox Exporter (BBE) probes are and what targets they check (AWS, Azure, GCP, DNS servers)
    • Purpose: helps identify network issues vs application bugs
    • How to interpret the dashboard (red sections, short vs prolonged failures)
    • Best practices for handling network issues (retry logic, graceful termination)
  • Updates sidebar to add "Dashboards" as the 2nd item (between Introduction and Audit Logs)

Context

This addresses customer confusion about the BBE Probes dashboard (internal ref: CSM-165). The documentation explains what the dashboard shows and helps customers understand that short probe failures are usually nothing to worry about.

@hloeffler hloeffler requested a review from a team as a code owner January 13, 2026 10:07
Copilot AI review requested due to automatic review settings January 13, 2026 10:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive documentation for the BBE (BlackBox Exporter) Probes dashboard to help customers understand network probe monitoring. The documentation explains what BBE probes are, what they monitor (cloud providers and DNS servers), how to interpret probe failures, and best practices for handling network issues in game servers.

Changes:

  • Added new "Dashboards" documentation page explaining BBE Probes from Nodes dashboard
  • Updated monitoring sidebar to include the new Dashboards page between Introduction and Audit Logs

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/multiplayer-servers/monitoring/dashboards.md New documentation page explaining BBE probe monitoring, dashboard interpretation, and best practices for handling network issues
src/multiplayer-servers/monitoring/sidebar.json Added "Dashboards" navigation item to the monitoring section sidebar


### Best Practices

Nodes can occasionally experience network issues—100% reliability is not guaranteed. Game developers should implement their servers to be tolerant of network issues by:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is written by AI, but the emdash isnt the right interpuction here, or at least the sentence should probably be inverted/changed like this:

"Full network reliability is not guaranteed{,.} [Nn]odes can occasionally experience network issues."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a highlighted section that this dashboard is not causally consistent with network issues. It only provides indicators that specific routes from the server to the predefined targets might be disrupted. Network issues may occur despite no probes failing, and vice versa, game servers might not experience connectivity issues even though there are probes failing. Failing probes do not equal "network issues" per se.

Because of the vantage point of the probes, this is also only a local view: probes towards cloud provider endpoints are a) highly selective (1 public, global endpoint per cloud provider, no regional or "other" ways to the targets -- those might not be equally disrupted as towards the global endpoints; this depends on their implementation) and b) target cloud provider services, not the entire cloud platform, thus just giving a selective view of what services might encounter issues, e.g., just because probes towards AWS S3 might be failing, that doesn't mean that all traffic towards AWS experiences connection issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, will do


- **Red sections** indicate the timespan during which a probe failed.
- **Short probe failures** are usually nothing to worry about.
- **Prolonged failures** to a single target (for example, a cloud provider your game doesn't use, or a backup DNS server) may have no impact on your game servers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment

- Add warning callout explaining probe results are not causally
  consistent with network issues
- Clarify that probes only test specific routes and cloud services,
  not entire platforms
- Restructure Best Practices section for clarity
- Improve list introduction wording per style guidelines
@hloeffler hloeffler requested a review from zoomoid January 13, 2026 10:48
Comment on lines +8 to +12
This dashboard shows BlackBox Exporter (BBE) probe results from each of your assigned nodes to predefined targets, including major cloud providers (AWS, Azure, GCP) and DNS servers (such as 1.1.1.1 and 8.8.8.8).

### Purpose

Use this dashboard to quickly identify whether game server issues are caused by network connectivity problems to a particular cloud provider rather than bugs in your application code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This dashboard shows BlackBox Exporter (BBE) probe results from each of your assigned nodes to predefined targets, including major cloud providers (AWS, Azure, GCP) and DNS servers (such as 1.1.1.1 and 8.8.8.8).
### Purpose
Use this dashboard to quickly identify whether game server issues are caused by network connectivity problems to a particular cloud provider rather than bugs in your application code.
This dashboard shows BlackBox Exporter (BBE) probe results from each of your assigned nodes to predefined targets, including major cloud providers (AWS, Azure, GCP) and DNS servers (such as 1.1.1.1 and 8.8.8.8).
This dashboard helps you determine whether game server incidents originate from cloud-provider connectivity issues rather than defects in the application.

IMO no need for a Purpose sub-section, just state the purpose directly

Comment on lines +14 to +28
### Interpreting the Dashboard

- **Red sections** indicate the timespan during which a probe failed.
- **Short probe failures** are usually nothing to worry about.
- **Prolonged failures** to a single target (for example, a cloud provider your game doesn't use, or a backup DNS server) may have no impact on your game servers.
- If probe failures to **multiple targets persist**, GameFabric automatically sets the status to degraded on [status.gamefabric.com](https://status.gamefabric.com).

:::warning Probe results are not causally consistent with network issues
Failing probes do not necessarily indicate network issues, and network issues may occur even when all probes succeed. Probes only test specific routes from nodes to predefined targets.

The dashboard provides a limited view:

- Only one public, global endpoint is probed per cloud provider. Regional routes may behave differently.
- Probes target specific cloud services (for example, AWS S3), not the entire cloud platform. Other services on the same provider may be unaffected.
:::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is a weird use for a bullet list: the 1st point is about what red sections are, and the 3 following ones are about how to interpret various durations of the 1st point. Those are not siblings/parallel if that makes sense. I'd suggest rephrasing this whole section like such, wdyt?

Suggested change
### Interpreting the Dashboard
- **Red sections** indicate the timespan during which a probe failed.
- **Short probe failures** are usually nothing to worry about.
- **Prolonged failures** to a single target (for example, a cloud provider your game doesn't use, or a backup DNS server) may have no impact on your game servers.
- If probe failures to **multiple targets persist**, GameFabric automatically sets the status to degraded on [status.gamefabric.com](https://status.gamefabric.com).
:::warning Probe results are not causally consistent with network issues
Failing probes do not necessarily indicate network issues, and network issues may occur even when all probes succeed. Probes only test specific routes from nodes to predefined targets.
The dashboard provides a limited view:
- Only one public, global endpoint is probed per cloud provider. Regional routes may behave differently.
- Probes target specific cloud services (for example, AWS S3), not the entire cloud platform. Other services on the same provider may be unaffected.
:::
### Interpreting the Dashboard
Red segments represent periods where a probe failed.
In practice:
- Brief probe failures are common and usually not actionable.
- A sustained failure to a single target may still have no impact—for example, if the target is a provider your game does not use or a backup DNS endpoint.
- If failures persist across multiple targets, GameFabric automatically marks the service as **Degraded** on [status.gamefabric.com](https://status.gamefabric.com).
:::note About probe results
Probe results are not a definitive measure of network health: a failing probe does not necessarily indicate a network issue, and network issues can occur even when probes succeed. Probes test only specific routes from our nodes to a fixed set of predefined targets.
Limitations:
- Only one public, global endpoint is probed per cloud provider; regional routes may behave differently.
- Probes target specific cloud services (for example, AWS S3), not the entire cloud platform. Other services on the same provider may be unaffected.
:::

Also, the admonition here should be more of a note than a warning, there's nothing dangerous here, it's more of an information panel/PSA IMO, would you agree?

Move network reliability guidance to production-workloads/requirements.md
where it fits better contextually (PR #103).

Addresses review comment from Ullaakut.
Copilot AI review requested due to automatic review settings January 14, 2026 09:01
hloeffler added a commit that referenced this pull request Jan 14, 2026
Add guidance on handling network issues gracefully:
- Implement retry logic for failed connections
- Gracefully terminate after multiple connection attempts fail

Content moved from Monitoring > Dashboards page (PR #191) where
it was out of context.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Comment on lines +16 to +18
- **Red sections** indicate the timespan during which a probe failed.
- **Short probe failures** are usually nothing to worry about.
- **Prolonged failures** to a single target (for example, a cloud provider your game doesn't use, or a backup DNS server) may have no impact on your game servers.
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list items lack parallel structure in punctuation. The first two items use periods, but the third item does not. According to the technical writing guidelines, list items should be consistent in punctuation. Either add a period to line 18 or remove periods from lines 16 and 17.

Copilot generated this review using guidance from repository custom instructions.
The dashboard provides a limited view:

- Only one public, global endpoint is probed per cloud provider. Regional routes may behave differently.
- Probes target specific cloud services (for example, AWS S3), not the entire cloud platform. Other services on the same provider may be unaffected.
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second list item appears to use "for example" parenthetically but is describing a specific example with "AWS S3". According to the technical writing guidelines for clarity and consistency, consider rephrasing to "Probes target specific cloud services (such as AWS S3), not the entire cloud platform." to match the usage pattern established earlier in the document on line 18.

Copilot generated this review using guidance from repository custom instructions.

## BBE Probes from Nodes

This dashboard shows BlackBox Exporter (BBE) probe results from each of your assigned nodes to predefined targets, including major cloud providers (AWS, Azure, GCP) and DNS servers (such as 1.1.1.1 and 8.8.8.8).
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list items in parentheses should follow the technical writing guideline to define acronyms/abbreviations on first use. While "AWS, Azure, GCP" are defined in the BBE expansion above, "1.1.1.1 and 8.8.8.8" are presented without context. Consider clarifying what these IP addresses represent (Cloudflare and Google DNS respectively) for readers who may not immediately recognize them.

Copilot generated this review using guidance from repository custom instructions.
- **Prolonged failures** to a single target (for example, a cloud provider your game doesn't use, or a backup DNS server) may have no impact on your game servers.
- If probe failures to **multiple targets persist**, GameFabric automatically sets the status to degraded on [status.gamefabric.com](https://status.gamefabric.com).

:::warning Probe results are not causally consistent with network issues
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning title "Probe results are not causally consistent with network issues" uses unclear terminology. The phrase "causally consistent" is ambiguous and may confuse readers. Consider rephrasing to something clearer like "Probe results do not always reflect network issues" or "Probes provide limited visibility into network issues" to better match the technical writing guideline for clarity and directness.

Copilot generated this review using guidance from repository custom instructions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants