Skip to content

Conversation

@youngbrioche
Copy link

@youngbrioche youngbrioche commented Sep 2, 2025

This PR removes misleading GPT‑4o “reasoning” labels and ensures reasoning_effort is only sent to models that support it.

Why

  • Pages like “ChatGPT-4o (Medium/High Reasoning)” implied capabilities 4o doesn’t have.
  • Passing reasoning_effort to unsupported models is incorrect and may cause API errors.

What

  • assess.py: remove 4o reasoning variants; keep a single “ChatGPT-4o”.
  • models/openai.py: add ALLOWED_REASONING_MODELS (O‑series + GPT‑5) and gate reasoning_effort; minor refactor/formatting.

Impact

  • Accurate labeling for 4o.
  • No reasoning_effort sent to unsupported models.
  • No changes to evaluation logic; focuses on labeling/parameter hygiene.

…dels

The site was generating pages like “ChatGPT-4o (Medium/High Reasoning)”
and passing a reasoning_effort parameter to 4o. GPT‑4o is not a
reasoning model, so these labels were misleading and suggested eval
claims that aren’t true. Passing reasoning_effort to unsupported models
is also incorrect and can lead to API errors or undefined behavior.

Changes
- assess.py:
  - Remove “ChatGPT-4o (Medium/High Reasoning)” entries; expose a single
    canonical “ChatGPT-4o”.
  - Drop legacy normalization hook to avoid silently rewriting model keys.
- models/openai.py:
  - Add ALLOWED_REASONING_MODELS and only include reasoning_effort for
    models that support it (O-series + GPT‑5 family).
  - Include GPT‑5 variants in the allowed list.
  - Refactor request construction to conditionally attach
    reasoning_effort via kwargs.
  - Minor formatting of SKIP_TEMPERATURE for readability.

Impact
- Eliminates misleading 4o “reasoning” pages/URLs and labels.
- Prevents sending reasoning_effort to unsupported models.
- Keeps 4o results page without any reasoning level.
- No change to evaluation logic; only labeling/parameter hygiene.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant