Skip to content

Conversation

@nschimme
Copy link
Contributor

@nschimme nschimme commented Dec 22, 2025

Summary by Sourcery

Improve Unicode-to-Latin1 handling and newline normalization to prevent crashes and ensure consistent user-facing messages.

New Features:

  • Map the narrow non-breaking space (U+202F) to a non-breaking space in the charset transliteration tables.

Bug Fixes:

  • Normalize output strings in sendToUser to always end with a newline, avoiding crashes caused by missing line terminators.
  • Correct the GMCP failure message format to use the proper argument placeholders and include a trailing newline.
  • Ensure Unicode-to-Latin1 conversion produces valid Latin1 characters or a fallback '?' for unsupported code points.

Enhancements:

  • Relax the ASCII validation helper to treat all non-control Latin1 bytes as valid and mark intermediate conversion results as maybe-unused to avoid warnings.

Tests:

  • Add regression tests for narrow non-breaking space handling and mixed surrogate/narrow-NBSP input in the Latin1 conversion path.

@sourcery-ai
Copy link

sourcery-ai bot commented Dec 22, 2025

Reviewer's Guide

Updates Latin-1 transliteration and validation to avoid crashes when converting certain Unicode characters (notably narrow non-breaking space) to Latin-1, centralizes QLatin1Char conversion logic, adds tests, and ensures user-visible messages always end with a newline, including a specific GMCP failure message path.

Sequence diagram for GMCP failure message newline-safe delivery

sequenceDiagram
    participant M as MudTelnet
    participant G as global_sendToUser
    participant R as SendToUser_Callback
    participant U as UI

    M->>G: sendToUser("Failure %1 remote message: %2\n".arg(action, result))
    activate G
    alt str endsWith(QC_NEWLINE)
        G->>R: invoke(str)
    else missing newline
        G->>R: invoke(str + QS_NEWLINE)
    end
    deactivate G

    R->>U: onMessage(str_with_newline)
    activate U
    U-->>U: render message line
    deactivate U
Loading

Class diagram for updated Latin1 transliteration and sendToUser flow

classDiagram
    class mmqt_Charset {
        <<namespace>>
        +bool is_latin_control(uint8_t uc)
        +bool is_noncontrol_latin1(char byte)
        +char16_t simple_unicode_translit(char16_t input)
        +QChar simple_unicode_translit(QChar qc)
        +bool isValidLatin1(char latin1)
        +QLatin1Char toQLatin1Char(QChar input)
        +QString~ref~ toLatin1InPlace(QString~ref~ str)
        +QString toLatin1(QString str)
    }

    class global_SendToUser {
        <<namespace>>
        +void registerSendToUser(Signal2Lifetime~ref~ lifetime, Signal2QString_Function callback)
        +Signal2QString_Function~ref~ getSendToUser()
        +void sendToUser(QString~const ref~ str)
    }

    class MudTelnet {
        +void virt_receiveGmcpMessage(GmcpMessage~const ref~ msg)
    }

    class UIMessageSink {
        <<interface>>
        +void onMessage(QString~const ref~ str)
    }

    mmqt_Charset ..> global_SendToUser : uses_via_QString
    global_SendToUser ..> UIMessageSink : invokes_callback
    MudTelnet ..> global_SendToUser : calls_sendToUser

    %% Relationships inside mmqt_Charset
    mmqt_Charset : isValidLatin1(char latin1) calls is_latin_control(uint8_t uc)
    mmqt_Charset : toQLatin1Char(QChar input) calls simple_unicode_translit(QChar qc)
    mmqt_Charset : toQLatin1Char(QChar input) calls isValidLatin1(char latin1)
    mmqt_Charset : toLatin1InPlace(QString~ref~ str) calls toQLatin1Char(QChar input)
Loading

File-Level Changes

Change Details Files
Broaden and tighten Latin-1 validation/transliteration, including support for narrow non‑breaking space and safer ASCII/Latin-1 control handling.
  • Add Unicode mapping entry for narrow non-breaking space (U+202F) to the transliteration table.
  • Replace is_noncontrol_ascii helper with is_noncontrol_latin1 that reuses is_latin_control and operates on bytes instead of raw ASCII mask checks.
  • Update static_asserts for transliteration tables to use the new is_noncontrol_latin1 helper.
  • Avoid unused-variable warnings in simple_unicode_translit by marking the windows-125x lookup variable MAYBE_UNUSED.
src/global/Charset.cpp
Introduce a robust helper for converting QChar to QLatin1Char with validation and fallback, and use it for in-place Latin-1 conversion.
  • Add isValidLatin1 helper that rejects NUL and Latin-1 control characters while still allowing whitespace controls like tab and newlines.
  • Add toQLatin1Char that runs simple_unicode_translit on a QChar, converts to Latin-1, validates via isValidLatin1, and falls back to '?' when invalid.
  • Refactor toLatin1InPlace to use toQLatin1Char instead of manually calling simple_unicode_translit_in_place and direct QLatin1Char construction.
src/global/Charset.cpp
Extend tests to cover narrow non-breaking space behavior and surrogate-driven code paths in mmqt Latin-1 conversion.
  • Add tests verifying that U+202F narrow NBSP is transliterated to NBSP under mmqt::toLatin1.
  • Add a mixed-string test that exercises the surrogate-path mmqt::toLatin1 code, ensuring NNBSP, commas, replacement for unsupported characters, and newlines behave as expected.
src/global/Charset.cpp
Ensure sendToUser messages are always newline-terminated and fix a formatting bug in a GMCP failure message.
  • Wrap sendToUser to append a newline when the input string does not end with QC_NEWLINE, logging a warning when it has to do so.
  • Update the GMCP "Failure remote message" path to use the corrected format string with %2 and an explicit trailing newline so sendToUser does not have to patch it.
  • Import mmqt namespace in SendToUser.cpp to use the QC_NEWLINE and QS_NEWLINE constants.
src/global/SendToUser.cpp
src/proxy/MudTelnet.cpp

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • The new isValidLatin1/is_noncontrol_latin1 logic relies on char but passes values through is_latin_control, which works in practice but is subtle with signed char; consider making these helpers explicitly take uint8_t (or unsigned char) to avoid any ambiguity around negative values or platform-dependent char signedness.
  • The sendToUser change will now emit a qWarning() for every missing newline; if this can be hit frequently (e.g., in loops or from noisy sources), consider downgrading to debug logging or adding some form of rate limiting to avoid log spam.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new `isValidLatin1`/`is_noncontrol_latin1` logic relies on `char` but passes values through `is_latin_control`, which works in practice but is subtle with signed `char`; consider making these helpers explicitly take `uint8_t` (or `unsigned char`) to avoid any ambiguity around negative values or platform-dependent `char` signedness.
- The `sendToUser` change will now emit a `qWarning()` for every missing newline; if this can be hit frequently (e.g., in loops or from noisy sources), consider downgrading to debug logging or adding some form of rate limiting to avoid log spam.

## Individual Comments

### Comment 1
<location> `src/global/Charset.cpp:80-83` </location>
<code_context>
 }
 template<typename T>
-NODISCARD constexpr bool is_noncontrol_ascii(const T ascii)
+NODISCARD constexpr bool is_noncontrol_latin1(const T byte)
 {
     // non-control ASCII
-    return std::is_same_v<char, T>                                                 //
-           && (static_cast<uint8_t>(ascii) & 0x7Fu) == static_cast<uint8_t>(ascii) //
-           && static_cast<uint8_t>(ascii) >= 0x20u;                                //
+    return std::is_same_v<char, T> && !is_latin_control(static_cast<uint8_t>(byte));
 }
 template<typename T>
</code_context>

<issue_to_address>
**issue (bug_risk):** The new is_noncontrol_latin1 no longer enforces the non-control range invariant due to a type mismatch with is_latin_control.

`is_noncontrol_ascii` previously guaranteed `_ascii` was a non-control ASCII char (0x20–0x7F). In the new version, `is_latin_control` is constrained to `T == char16_t`, but here `T` is deduced as `uint8_t`, so `is_latin_control(static_cast<uint8_t>(byte))` is never actually called and effectively always yields false. The condition then collapses to `std::is_same_v<char, T>`, so `static_assert(is_noncontrol_latin1(_ascii));` no longer excludes Latin1 control bytes or out-of-range values. To preserve the invariant, either add an 8-bit overload of `is_latin_control` or restore an explicit range check on the `uint8_t` value.
</issue_to_address>

### Comment 2
<location> `src/global/Charset.cpp:298-302` </location>
<code_context>
     return QChar{charset::simple_unicode_translit(static_cast<char16_t>(qc.unicode()))};
 }

+NODISCARD static bool isValidLatin1(char latin1)
+{
+    // Eventually we may want to exclude some other ASCII control codes,
+    // but be careful not to exclude things like "\t\r\n".
+    return latin1 != C_NUL && !is_latin_control(latin1);
+}
+
</code_context>

<issue_to_address>
**issue (bug_risk):** isValidLatin1 also relies on is_latin_control with a mismatched type, so Latin1 control characters will be treated as valid.

Here `latin1` is a `char`, so this calls `is_latin_control<char>`, whose implementation always returns `false` because it only handles `char16_t`. As a result, `isValidLatin1` only rejects `C_NUL` and treats all Latin1 control characters (0x80–0x9F) as valid. To actually filter those controls, cast to a 16-bit type (e.g. `char16_t`) before calling `is_latin_control`, or add a dedicated 8-bit overload and use it consistently (including in `is_noncontrol_latin1`).
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +80 to +83
NODISCARD constexpr bool is_noncontrol_latin1(const T byte)
{
// non-control ASCII
return std::is_same_v<char, T> //
&& (static_cast<uint8_t>(ascii) & 0x7Fu) == static_cast<uint8_t>(ascii) //
&& static_cast<uint8_t>(ascii) >= 0x20u; //
return std::is_same_v<char, T> && !is_latin_control(static_cast<uint8_t>(byte));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): The new is_noncontrol_latin1 no longer enforces the non-control range invariant due to a type mismatch with is_latin_control.

is_noncontrol_ascii previously guaranteed _ascii was a non-control ASCII char (0x20–0x7F). In the new version, is_latin_control is constrained to T == char16_t, but here T is deduced as uint8_t, so is_latin_control(static_cast<uint8_t>(byte)) is never actually called and effectively always yields false. The condition then collapses to std::is_same_v<char, T>, so static_assert(is_noncontrol_latin1(_ascii)); no longer excludes Latin1 control bytes or out-of-range values. To preserve the invariant, either add an 8-bit overload of is_latin_control or restore an explicit range check on the uint8_t value.

Comment on lines +298 to +302
NODISCARD static bool isValidLatin1(char latin1)
{
// Eventually we may want to exclude some other ASCII control codes,
// but be careful not to exclude things like "\t\r\n".
return latin1 != C_NUL && !is_latin_control(latin1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): isValidLatin1 also relies on is_latin_control with a mismatched type, so Latin1 control characters will be treated as valid.

Here latin1 is a char, so this calls is_latin_control<char>, whose implementation always returns false because it only handles char16_t. As a result, isValidLatin1 only rejects C_NUL and treats all Latin1 control characters (0x80–0x9F) as valid. To actually filter those controls, cast to a 16-bit type (e.g. char16_t) before calling is_latin_control, or add a dedicated 8-bit overload and use it consistently (including in is_noncontrol_latin1).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant