-
Notifications
You must be signed in to change notification settings - Fork 5
docs: add dashboards documentation page with BBE Probes explanation #191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive documentation for the BBE (BlackBox Exporter) Probes dashboard to help customers understand network probe monitoring. The documentation explains what BBE probes are, what they monitor (cloud providers and DNS servers), how to interpret probe failures, and best practices for handling network issues in game servers.
Changes:
- Added new "Dashboards" documentation page explaining BBE Probes from Nodes dashboard
- Updated monitoring sidebar to include the new Dashboards page between Introduction and Audit Logs
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/multiplayer-servers/monitoring/dashboards.md | New documentation page explaining BBE probe monitoring, dashboard interpretation, and best practices for handling network issues |
| src/multiplayer-servers/monitoring/sidebar.json | Added "Dashboards" navigation item to the monitoring section sidebar |
|
|
||
| ### Best Practices | ||
|
|
||
| Nodes can occasionally experience network issues—100% reliability is not guaranteed. Game developers should implement their servers to be tolerant of network issues by: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is written by AI, but the emdash isnt the right interpuction here, or at least the sentence should probably be inverted/changed like this:
"Full network reliability is not guaranteed{,.} [Nn]odes can occasionally experience network issues."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a highlighted section that this dashboard is not causally consistent with network issues. It only provides indicators that specific routes from the server to the predefined targets might be disrupted. Network issues may occur despite no probes failing, and vice versa, game servers might not experience connectivity issues even though there are probes failing. Failing probes do not equal "network issues" per se.
Because of the vantage point of the probes, this is also only a local view: probes towards cloud provider endpoints are a) highly selective (1 public, global endpoint per cloud provider, no regional or "other" ways to the targets -- those might not be equally disrupted as towards the global endpoints; this depends on their implementation) and b) target cloud provider services, not the entire cloud platform, thus just giving a selective view of what services might encounter issues, e.g., just because probes towards AWS S3 might be failing, that doesn't mean that all traffic towards AWS experiences connection issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you, will do
|
|
||
| - **Red sections** indicate the timespan during which a probe failed. | ||
| - **Short probe failures** are usually nothing to worry about. | ||
| - **Prolonged failures** to a single target (for example, a cloud provider your game doesn't use, or a backup DNS server) may have no impact on your game servers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above comment
- Add warning callout explaining probe results are not causally consistent with network issues - Clarify that probes only test specific routes and cloud services, not entire platforms - Restructure Best Practices section for clarity - Improve list introduction wording per style guidelines
| This dashboard shows BlackBox Exporter (BBE) probe results from each of your assigned nodes to predefined targets, including major cloud providers (AWS, Azure, GCP) and DNS servers (such as 1.1.1.1 and 8.8.8.8). | ||
|
|
||
| ### Purpose | ||
|
|
||
| Use this dashboard to quickly identify whether game server issues are caused by network connectivity problems to a particular cloud provider rather than bugs in your application code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This dashboard shows BlackBox Exporter (BBE) probe results from each of your assigned nodes to predefined targets, including major cloud providers (AWS, Azure, GCP) and DNS servers (such as 1.1.1.1 and 8.8.8.8). | |
| ### Purpose | |
| Use this dashboard to quickly identify whether game server issues are caused by network connectivity problems to a particular cloud provider rather than bugs in your application code. | |
| This dashboard shows BlackBox Exporter (BBE) probe results from each of your assigned nodes to predefined targets, including major cloud providers (AWS, Azure, GCP) and DNS servers (such as 1.1.1.1 and 8.8.8.8). | |
| This dashboard helps you determine whether game server incidents originate from cloud-provider connectivity issues rather than defects in the application. |
IMO no need for a Purpose sub-section, just state the purpose directly
| ### Interpreting the Dashboard | ||
|
|
||
| - **Red sections** indicate the timespan during which a probe failed. | ||
| - **Short probe failures** are usually nothing to worry about. | ||
| - **Prolonged failures** to a single target (for example, a cloud provider your game doesn't use, or a backup DNS server) may have no impact on your game servers. | ||
| - If probe failures to **multiple targets persist**, GameFabric automatically sets the status to degraded on [status.gamefabric.com](https://status.gamefabric.com). | ||
|
|
||
| :::warning Probe results are not causally consistent with network issues | ||
| Failing probes do not necessarily indicate network issues, and network issues may occur even when all probes succeed. Probes only test specific routes from nodes to predefined targets. | ||
|
|
||
| The dashboard provides a limited view: | ||
|
|
||
| - Only one public, global endpoint is probed per cloud provider. Regional routes may behave differently. | ||
| - Probes target specific cloud services (for example, AWS S3), not the entire cloud platform. Other services on the same provider may be unaffected. | ||
| ::: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is a weird use for a bullet list: the 1st point is about what red sections are, and the 3 following ones are about how to interpret various durations of the 1st point. Those are not siblings/parallel if that makes sense. I'd suggest rephrasing this whole section like such, wdyt?
| ### Interpreting the Dashboard | |
| - **Red sections** indicate the timespan during which a probe failed. | |
| - **Short probe failures** are usually nothing to worry about. | |
| - **Prolonged failures** to a single target (for example, a cloud provider your game doesn't use, or a backup DNS server) may have no impact on your game servers. | |
| - If probe failures to **multiple targets persist**, GameFabric automatically sets the status to degraded on [status.gamefabric.com](https://status.gamefabric.com). | |
| :::warning Probe results are not causally consistent with network issues | |
| Failing probes do not necessarily indicate network issues, and network issues may occur even when all probes succeed. Probes only test specific routes from nodes to predefined targets. | |
| The dashboard provides a limited view: | |
| - Only one public, global endpoint is probed per cloud provider. Regional routes may behave differently. | |
| - Probes target specific cloud services (for example, AWS S3), not the entire cloud platform. Other services on the same provider may be unaffected. | |
| ::: | |
| ### Interpreting the Dashboard | |
| Red segments represent periods where a probe failed. | |
| In practice: | |
| - Brief probe failures are common and usually not actionable. | |
| - A sustained failure to a single target may still have no impact—for example, if the target is a provider your game does not use or a backup DNS endpoint. | |
| - If failures persist across multiple targets, GameFabric automatically marks the service as **Degraded** on [status.gamefabric.com](https://status.gamefabric.com). | |
| :::note About probe results | |
| Probe results are not a definitive measure of network health: a failing probe does not necessarily indicate a network issue, and network issues can occur even when probes succeed. Probes test only specific routes from our nodes to a fixed set of predefined targets. | |
| Limitations: | |
| - Only one public, global endpoint is probed per cloud provider; regional routes may behave differently. | |
| - Probes target specific cloud services (for example, AWS S3), not the entire cloud platform. Other services on the same provider may be unaffected. | |
| ::: |
Also, the admonition here should be more of a note than a warning, there's nothing dangerous here, it's more of an information panel/PSA IMO, would you agree?
Move network reliability guidance to production-workloads/requirements.md where it fits better contextually (PR #103). Addresses review comment from Ullaakut.
Add guidance on handling network issues gracefully: - Implement retry logic for failed connections - Gracefully terminate after multiple connection attempts fail Content moved from Monitoring > Dashboards page (PR #191) where it was out of context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| - **Red sections** indicate the timespan during which a probe failed. | ||
| - **Short probe failures** are usually nothing to worry about. | ||
| - **Prolonged failures** to a single target (for example, a cloud provider your game doesn't use, or a backup DNS server) may have no impact on your game servers. |
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list items lack parallel structure in punctuation. The first two items use periods, but the third item does not. According to the technical writing guidelines, list items should be consistent in punctuation. Either add a period to line 18 or remove periods from lines 16 and 17.
| The dashboard provides a limited view: | ||
|
|
||
| - Only one public, global endpoint is probed per cloud provider. Regional routes may behave differently. | ||
| - Probes target specific cloud services (for example, AWS S3), not the entire cloud platform. Other services on the same provider may be unaffected. |
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second list item appears to use "for example" parenthetically but is describing a specific example with "AWS S3". According to the technical writing guidelines for clarity and consistency, consider rephrasing to "Probes target specific cloud services (such as AWS S3), not the entire cloud platform." to match the usage pattern established earlier in the document on line 18.
|
|
||
| ## BBE Probes from Nodes | ||
|
|
||
| This dashboard shows BlackBox Exporter (BBE) probe results from each of your assigned nodes to predefined targets, including major cloud providers (AWS, Azure, GCP) and DNS servers (such as 1.1.1.1 and 8.8.8.8). |
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list items in parentheses should follow the technical writing guideline to define acronyms/abbreviations on first use. While "AWS, Azure, GCP" are defined in the BBE expansion above, "1.1.1.1 and 8.8.8.8" are presented without context. Consider clarifying what these IP addresses represent (Cloudflare and Google DNS respectively) for readers who may not immediately recognize them.
| - **Prolonged failures** to a single target (for example, a cloud provider your game doesn't use, or a backup DNS server) may have no impact on your game servers. | ||
| - If probe failures to **multiple targets persist**, GameFabric automatically sets the status to degraded on [status.gamefabric.com](https://status.gamefabric.com). | ||
|
|
||
| :::warning Probe results are not causally consistent with network issues |
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning title "Probe results are not causally consistent with network issues" uses unclear terminology. The phrase "causally consistent" is ambiguous and may confuse readers. Consider rephrasing to something clearer like "Probe results do not always reflect network issues" or "Probes provide limited visibility into network issues" to better match the technical writing guideline for clarity and directness.
Summary
src/multiplayer-servers/monitoring/dashboards.md)Context
This addresses customer confusion about the BBE Probes dashboard (internal ref: CSM-165). The documentation explains what the dashboard shows and helps customers understand that short probe failures are usually nothing to worry about.