docs: add dashboards documentation page with BBE Probes explanation #191

hloeffler · 2026-01-13T10:07:36Z

Summary

Adds a new "Dashboards" page under the Monitoring section (src/multiplayer-servers/monitoring/dashboards.md)
Documents the "BBE Probes from Nodes" dashboard explaining:
- What BlackBox Exporter (BBE) probes are and what targets they check (AWS, Azure, GCP, DNS servers)
- Purpose: helps identify network issues vs application bugs
- How to interpret the dashboard (red sections, short vs prolonged failures)
- Best practices for handling network issues (retry logic, graceful termination)
Updates sidebar to add "Dashboards" as the 2nd item (between Introduction and Audit Logs)

Context

This addresses customer confusion about the BBE Probes dashboard (internal ref: CSM-165). The documentation explains what the dashboard shows and helps customers understand that short probe failures are usually nothing to worry about.

Copilot

Pull request overview

This PR adds comprehensive documentation for the BBE (BlackBox Exporter) Probes dashboard to help customers understand network probe monitoring. The documentation explains what BBE probes are, what they monitor (cloud providers and DNS servers), how to interpret probe failures, and best practices for handling network issues in game servers.

Changes:

Added new "Dashboards" documentation page explaining BBE Probes from Nodes dashboard
Updated monitoring sidebar to include the new Dashboards page between Introduction and Audit Logs

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
src/multiplayer-servers/monitoring/dashboards.md	New documentation page explaining BBE probe monitoring, dashboard interpretation, and best practices for handling network issues
src/multiplayer-servers/monitoring/sidebar.json	Added "Dashboards" navigation item to the monitoring section sidebar

src/multiplayer-servers/monitoring/dashboards.md

zoomoid · 2026-01-13T10:19:19Z

src/multiplayer-servers/monitoring/dashboards.md

+
+### Best Practices
+
+Nodes can occasionally experience network issues—100% reliability is not guaranteed. Game developers should implement their servers to be tolerant of network issues by:


I know this is written by AI, but the emdash isnt the right interpuction here, or at least the sentence should probably be inverted/changed like this:

"Full network reliability is not guaranteed{,.} [Nn]odes can occasionally experience network issues."

zoomoid · 2026-01-13T10:22:17Z

src/multiplayer-servers/monitoring/dashboards.md

Please add a highlighted section that this dashboard is not causally consistent with network issues. It only provides indicators that specific routes from the server to the predefined targets might be disrupted. Network issues may occur despite no probes failing, and vice versa, game servers might not experience connectivity issues even though there are probes failing. Failing probes do not equal "network issues" per se.

Because of the vantage point of the probes, this is also only a local view: probes towards cloud provider endpoints are a) highly selective (1 public, global endpoint per cloud provider, no regional or "other" ways to the targets -- those might not be equally disrupted as towards the global endpoints; this depends on their implementation) and b) target cloud provider services, not the entire cloud platform, thus just giving a selective view of what services might encounter issues, e.g., just because probes towards AWS S3 might be failing, that doesn't mean that all traffic towards AWS experiences connection issues.

thank you, will do

zoomoid · 2026-01-13T10:23:27Z

src/multiplayer-servers/monitoring/dashboards.md

+
+- **Red sections** indicate the timespan during which a probe failed.
+- **Short probe failures** are usually nothing to worry about.
+- **Prolonged failures** to a single target (for example, a cloud provider your game doesn't use, or a backup DNS server) may have no impact on your game servers.


see above comment

- Add warning callout explaining probe results are not causally consistent with network issues - Clarify that probes only test specific routes and cloud services, not entire platforms - Restructure Best Practices section for clarity - Improve list introduction wording per style guidelines

Ullaakut · 2026-01-13T19:04:10Z

src/multiplayer-servers/monitoring/dashboards.md

+This dashboard shows BlackBox Exporter (BBE) probe results from each of your assigned nodes to predefined targets, including major cloud providers (AWS, Azure, GCP) and DNS servers (such as 1.1.1.1 and 8.8.8.8).
+
+### Purpose
+
+Use this dashboard to quickly identify whether game server issues are caused by network connectivity problems to a particular cloud provider rather than bugs in your application code.


Suggested change

This dashboard shows BlackBox Exporter (BBE) probe results from each of your assigned nodes to predefined targets, including major cloud providers (AWS, Azure, GCP) and DNS servers (such as 1.1.1.1 and 8.8.8.8).

### Purpose

Use this dashboard to quickly identify whether game server issues are caused by network connectivity problems to a particular cloud provider rather than bugs in your application code.

This dashboard shows BlackBox Exporter (BBE) probe results from each of your assigned nodes to predefined targets, including major cloud providers (AWS, Azure, GCP) and DNS servers (such as 1.1.1.1 and 8.8.8.8).

This dashboard helps you determine whether game server incidents originate from cloud-provider connectivity issues rather than defects in the application.

IMO no need for a Purpose sub-section, just state the purpose directly

Ullaakut · 2026-01-13T19:20:14Z

src/multiplayer-servers/monitoring/dashboards.md

+### Interpreting the Dashboard
+
+- **Red sections** indicate the timespan during which a probe failed.
+- **Short probe failures** are usually nothing to worry about.
+- **Prolonged failures** to a single target (for example, a cloud provider your game doesn't use, or a backup DNS server) may have no impact on your game servers.
+- If probe failures to **multiple targets persist**, GameFabric automatically sets the status to degraded on [status.gamefabric.com](https://status.gamefabric.com).
+
+:::warning Probe results are not causally consistent with network issues
+Failing probes do not necessarily indicate network issues, and network issues may occur even when all probes succeed. Probes only test specific routes from nodes to predefined targets.
+
+The dashboard provides a limited view:
+
+- Only one public, global endpoint is probed per cloud provider. Regional routes may behave differently.
+- Probes target specific cloud services (for example, AWS S3), not the entire cloud platform. Other services on the same provider may be unaffected.
+:::


IMO this is a weird use for a bullet list: the 1st point is about what red sections are, and the 3 following ones are about how to interpret various durations of the 1st point. Those are not siblings/parallel if that makes sense. I'd suggest rephrasing this whole section like such, wdyt?

Suggested change

### Interpreting the Dashboard

- **Red sections** indicate the timespan during which a probe failed.

- **Short probe failures** are usually nothing to worry about.

- **Prolonged failures** to a single target (for example, a cloud provider your game doesn't use, or a backup DNS server) may have no impact on your game servers.

- If probe failures to **multiple targets persist**, GameFabric automatically sets the status to degraded on [status.gamefabric.com](https://status.gamefabric.com).

:::warning Probe results are not causally consistent with network issues

Failing probes do not necessarily indicate network issues, and network issues may occur even when all probes succeed. Probes only test specific routes from nodes to predefined targets.

The dashboard provides a limited view:

- Only one public, global endpoint is probed per cloud provider. Regional routes may behave differently.

- Probes target specific cloud services (for example, AWS S3), not the entire cloud platform. Other services on the same provider may be unaffected.

:::

### Interpreting the Dashboard

Red segments represent periods where a probe failed.

In practice:

- Brief probe failures are common and usually not actionable.

- A sustained failure to a single target may still have no impact—for example, if the target is a provider your game does not use or a backup DNS endpoint.

- If failures persist across multiple targets, GameFabric automatically marks the service as **Degraded** on [status.gamefabric.com](https://status.gamefabric.com).

:::note About probe results

Probe results are not a definitive measure of network health: a failing probe does not necessarily indicate a network issue, and network issues can occur even when probes succeed. Probes test only specific routes from our nodes to a fixed set of predefined targets.

Limitations:

- Only one public, global endpoint is probed per cloud provider; regional routes may behave differently.

- Probes target specific cloud services (for example, AWS S3), not the entire cloud platform. Other services on the same provider may be unaffected.

:::

Also, the admonition here should be more of a note than a warning, there's nothing dangerous here, it's more of an information panel/PSA IMO, would you agree?

src/multiplayer-servers/monitoring/dashboards.md

Move network reliability guidance to production-workloads/requirements.md where it fits better contextually (PR #103). Addresses review comment from Ullaakut.

Add guidance on handling network issues gracefully: - Implement retry logic for failed connections - Gracefully terminate after multiple connection attempts fail Content moved from Monitoring > Dashboards page (PR #191) where it was out of context.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Copilot · 2026-01-14T09:03:55Z

src/multiplayer-servers/monitoring/dashboards.md

+- **Red sections** indicate the timespan during which a probe failed.
+- **Short probe failures** are usually nothing to worry about.
+- **Prolonged failures** to a single target (for example, a cloud provider your game doesn't use, or a backup DNS server) may have no impact on your game servers.


The list items lack parallel structure in punctuation. The first two items use periods, but the third item does not. According to the technical writing guidelines, list items should be consistent in punctuation. Either add a period to line 18 or remove periods from lines 16 and 17.

Copilot · 2026-01-14T09:03:56Z

src/multiplayer-servers/monitoring/dashboards.md

+The dashboard provides a limited view:
+
+- Only one public, global endpoint is probed per cloud provider. Regional routes may behave differently.
+- Probes target specific cloud services (for example, AWS S3), not the entire cloud platform. Other services on the same provider may be unaffected.


The second list item appears to use "for example" parenthetically but is describing a specific example with "AWS S3". According to the technical writing guidelines for clarity and consistency, consider rephrasing to "Probes target specific cloud services (such as AWS S3), not the entire cloud platform." to match the usage pattern established earlier in the document on line 18.

Copilot · 2026-01-14T09:03:56Z

src/multiplayer-servers/monitoring/dashboards.md

+
+## BBE Probes from Nodes
+
+This dashboard shows BlackBox Exporter (BBE) probe results from each of your assigned nodes to predefined targets, including major cloud providers (AWS, Azure, GCP) and DNS servers (such as 1.1.1.1 and 8.8.8.8).


The list items in parentheses should follow the technical writing guideline to define acronyms/abbreviations on first use. While "AWS, Azure, GCP" are defined in the BBE expansion above, "1.1.1.1 and 8.8.8.8" are presented without context. Consider clarifying what these IP addresses represent (Cloudflare and Google DNS respectively) for readers who may not immediately recognize them.

Copilot · 2026-01-14T09:03:56Z

src/multiplayer-servers/monitoring/dashboards.md

+- **Prolonged failures** to a single target (for example, a cloud provider your game doesn't use, or a backup DNS server) may have no impact on your game servers.
+- If probe failures to **multiple targets persist**, GameFabric automatically sets the status to degraded on [status.gamefabric.com](https://status.gamefabric.com).
+
+:::warning Probe results are not causally consistent with network issues


The warning title "Probe results are not causally consistent with network issues" uses unclear terminology. The phrase "causally consistent" is ambiguous and may confuse readers. Consider rephrasing to something clearer like "Probe results do not always reflect network issues" or "Probes provide limited visibility into network issues" to better match the technical writing guideline for clarity and directness.

docs: add dashboards documentation page with BBE Probes explanation

4d2aadf

hloeffler requested a review from a team as a code owner January 13, 2026 10:07

Copilot AI review requested due to automatic review settings January 13, 2026 10:07

Copilot started reviewing on behalf of hloeffler January 13, 2026 10:08 View session

Copilot AI reviewed Jan 13, 2026

View reviewed changes

src/multiplayer-servers/monitoring/dashboards.md Outdated Show resolved Hide resolved

zoomoid reviewed Jan 13, 2026

View reviewed changes

hloeffler requested a review from zoomoid January 13, 2026 10:48

Ullaakut requested changes Jan 13, 2026

View reviewed changes

docs: remove Best Practices section from dashboards page

5eaff10

Move network reliability guidance to production-workloads/requirements.md where it fits better contextually (PR #103). Addresses review comment from Ullaakut.

Copilot AI review requested due to automatic review settings January 14, 2026 09:01

Copilot started reviewing on behalf of hloeffler January 14, 2026 09:02 View session

Copilot AI reviewed Jan 14, 2026

View reviewed changes


		### Best Practices

		Nodes can occasionally experience network issues—100% reliability is not guaranteed. Game developers should implement their servers to be tolerant of network issues by:


		## BBE Probes from Nodes

		This dashboard shows BlackBox Exporter (BBE) probe results from each of your assigned nodes to predefined targets, including major cloud providers (AWS, Azure, GCP) and DNS servers (such as 1.1.1.1 and 8.8.8.8).

docs: add dashboards documentation page with BBE Probes explanation #191

Are you sure you want to change the base?

docs: add dashboards documentation page with BBE Probes explanation #191

Uh oh!

Conversation

hloeffler commented Jan 13, 2026

Summary

Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

zoomoid Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

zoomoid Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

hloeffler Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

zoomoid Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Ullaakut Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Ullaakut Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants