Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions docs/case-studies/issue-96/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Case Study: Issue #96 — `agent --version` output missing in screen isolation

## Summary

When running a quick-completing command (like `agent --version`) through screen isolation:

```
$ --isolated screen -- agent --version
```

the command executes successfully (exit code 0) but the output is **not displayed** — the
version string `0.13.2` is silently swallowed.

## Reproduction

```
$ --isolated screen -- agent --version
│ session 1d4a5afb-00f3-40cb-b506-0edde9288b77
│ isolation screen
│ mode attached
│ screen screen-1773491715194-1ppv87
$ agent --version


│ exit 0
│ duration 0.391s
```

The expected output `0.13.2` is missing between `$ agent --version` and the `✓` marker.

## Fix (v0.24.9 / PR #97)

Two changes were made:

### 1. Add `logfile flush 0` to screen's configuration

For the **native logging path** (`-L -Logfile`, screen ≥ 4.5.1), screen uses a periodic
flush timer that fires every **10 seconds** by default. If the command completes and the
screen session terminates before this timer fires, output buffered inside screen's internal
`FILE*` buffer may not be flushed to the log file before the session ends.

The fix passes a temporary screenrc with `logfile flush 0` via screen's `-c` option. This
forces screen to flush the log after every write, eliminating the race condition.

Before fix:
```
screen -dmS <session> -L -Logfile <logfile> <shell> -c '<command>'
```

After fix:
```
screen -dmS <session> -c <screenrc> -L -Logfile <logfile> <shell> -c '<command>'
```

where `<screenrc>` contains `logfile flush 0`.

### 2. Add integration test for quick-completing commands

Added a test case specifically for the issue: `runInScreen('agent --version')` must
capture the version output correctly.

## See Also

- [root-cause.md](root-cause.md) — Detailed root cause analysis
- [timeline.md](timeline.md) — Sequence of events
- [solutions.md](solutions.md) — Solutions considered and chosen
- Related: [Case Study issue-15](../issue-15/README.md) — Original screen output capture fix
- Related: [Case Study issue-25](../issue-25/README.md) — Screen quoting fix
148 changes: 148 additions & 0 deletions docs/case-studies/issue-96/root-cause.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# Root Cause Analysis: Issue #96

## The Bug

`agent --version` (or any quick-completing command) run through screen isolation
produces no output, even though the command succeeds with exit code 0.

---

## Root Cause 1: GNU Screen's Periodic Log Flush (Native Logging Path, Screen ≥ 4.5.1)

### GNU Screen log flush architecture

GNU Screen's `-L` logging feature writes output to a libc `FILE*` buffer (not directly
to disk). Flush is controlled by the `log_flush` variable (default: `10` seconds):

```c
// src/screen.c
int log_flush = 10;

// src/ansi.c — WLogString (called for every logged character)
static void WLogString(Window *win, char *buf, size_t len)
{
if (!win->w_log) return;
logfwrite(win->w_log, buf, len); // writes to FILE* buffer
if (!log_flush)
logfflush(win->w_log); // only flushes immediately if log_flush == 0
}
```

With `log_flush = 10` (the default), `logfflush` is **not called** after each write.
Instead, the periodic flush timer fires every 10 seconds:

```c
// src/window.c — DoStartLog
SetTimeout(&logflushev, n * 1000); // n = log_flush = 10
evenq(&logflushev);
```

For `agent --version` which runs in ~50ms, the screen session exits before this timer
ever fires.

### Why fclose should flush (but may not)

On graceful shutdown, `Finit()` → `FreeWindow()` → `logfclose()` → `fclose()` is called.
POSIX guarantees that `fclose()` flushes the stdio buffer before closing. So in the
**normal exit path**, data is not lost.

However, if screen is terminated via a signal that invokes `_exit()` rather than `exit()`,
stdio buffers are NOT flushed. This can happen when:
1. The child process terminates and screen receives SIGCHLD
2. Screen's signal handler calls `_exit()` directly in certain code paths

This makes the issue **intermittent** — it depends on the OS, screen version, and exact
timing of the signal handling.

### The fix

Add `logfile flush 0` to a temporary screenrc passed via `screen -c`:

```
logfile flush 0
```

This sets `log_flush = 0` in screen's internal variable, causing `WLogString` to call
`logfflush()` after every write:

```c
if (!log_flush)
logfflush(win->w_log); // now always executes
```

This completely eliminates the flush delay for log file writes.

---

## Root Cause 2: Tee Pipe Buffering Race (Tee Fallback, Screen < 4.5.1)

### macOS bundled screen 4.0.3

On macOS, the system-bundled GNU Screen is version 4.0.3, which predates the `-Logfile`
option (added in 4.5.1). The code falls back to:

```js
effectiveCommand = `(${effectiveCommand}) 2>&1 | tee "${logFile}"`;
```

The screen session runs:
```
screen -dmS session /bin/zsh -c "(agent --version) 2>&1 | tee /tmp/logfile"
```

### The race condition

When `agent --version` completes in ~5ms:
1. The agent process writes `0.13.2\n` to stdout
2. The tee process receives the data and writes it to the log file
3. The zsh process exits
4. Screen detects the child exit and terminates the session

The RACE: between step 2 (tee writes to OS page cache) and our poller reading the file,
there may be a brief window where:
- The log file exists but the `write()` syscall from tee hasn't been committed to the
page cache yet (on some OS implementations)
- OR the file still has 0 bytes because tee's userspace buffer hasn't been `fwrite`'d
yet when screen terminates it

This is a TOCTOU (time-of-check-time-of-use) race between the session appearing gone in
`screen -ls` and the log file having its complete content.

### The fix

The `logfile flush 0` fix in the screenrc does NOT directly help the tee fallback path
(since tee is an external process, not governed by screen's log flush). However, an
additional retry mechanism can be added: if the log file is empty when first read but the
session just terminated, retry the read after a brief delay to let the OS flush complete.

In practice, on Linux the tee fallback is not used (screen >= 4.5.1 is available), and
on macOS the `logfile flush 0` option works on screen 4.0.3 as well (it's a screenrc
command, not a version-gated feature).

---

## Why This Is Intermittent

The issue doesn't always reproduce because:

1. **On Linux with screen 4.09.01**: The normal `fclose()` path usually flushes the buffer
correctly. Only under certain timing conditions (SIGCHLD handling, `_exit`) does it fail.

2. **On macOS with screen 4.0.3**: The tee fallback's race window is very small (~1ms).
Most of the time the file has content when read. But for very fast commands with a
busy system, the window widens.

3. **The `screen -ls` timing**: The check `!sessions.includes(sessionName)` returns true
as soon as the session process exits, but the OS may still be in the middle of flushing
the log file's write buffers to the page cache.

---

## Test Coverage Gap

There were no tests specifically for:
1. Quick-completing commands (like `--version` flags) in screen isolation
2. Verification that version-string output (short, non-whitespace text) is captured

The existing tests use `echo "hello"` which is also fast but the string is longer and
may flush more reliably in certain conditions.
108 changes: 108 additions & 0 deletions docs/case-studies/issue-96/solutions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Solutions: Issue #96

## Problem Statement

Quick-completing commands (like `agent --version`) run through screen isolation
produce no output. The screen session exits cleanly but the log file is empty.

---

## Solutions Considered

### Option A: Add `sleep` after the command in screen

**Approach:** Append `; sleep 0.1` to every command run in screen to give the log a
chance to flush before the session exits.

**Pros:** Simple, one-line change.

**Cons:**
- Slows every command by at least 100ms (unacceptable for interactive use)
- Hacky — fights symptoms rather than addressing the root cause
- Still a race condition (just smaller)

### Option B: Retry reading the log file with exponential backoff

**Approach:** After detecting session completion, if the log file is empty, retry
reading it up to 3 times with 10ms, 20ms, 40ms delays.

**Pros:** Addresses the TOCTOU race condition for the tee fallback path.

**Cons:**
- Does NOT fix the native logging path's `log_flush = 10` issue
- Adds complexity to the polling logic
- Still a race for very slow systems

### Option C: Use a temporary screenrc with `logfile flush 0` (CHOSEN)

**Approach:** For the native logging path (`-L -Logfile`, screen ≥ 4.5.1), create a
temporary screenrc file containing `logfile flush 0` and pass it to screen via the
`-c` option.

```
echo "logfile flush 0" > /tmp/screen-rc-XXXX.rc
screen -dmS session -c /tmp/screen-rc-XXXX.rc -L -Logfile logfile shell -c command
```

`logfile flush 0` sets `log_flush = 0` in screen's internal state, which causes
`WLogString()` in `ansi.c` to call `logfflush()` (equivalent to `fflush()`) after
every write. This completely eliminates the 10-second buffering delay.

**Pros:**
- Eliminates the flush delay entirely for the native logging path
- Standard approach recommended in GNU Screen documentation
- No performance impact on the command itself (the flush is async in screen's event loop)
- Clean implementation — the temp file is created and deleted by our code
- Works on all screen versions that support the `-c` option (very old)

**Cons:**
- Requires creating/deleting an additional temp file per invocation
- The `-c` option overrides the user's `~/.screenrc` — but since we only put
`logfile flush 0` in it, there's no conflict for our use case

**Note on the tee fallback path (screen < 4.5.1):**
The `logfile flush 0` setting in a screenrc also works on older screen versions (it's
a screenrc directive, not version-gated). However, for the tee fallback path, screen
itself doesn't write the log — `tee` does. So `logfile flush 0` doesn't help for
the tee path. For this path, Option B (retry) is added as a complementary fix.

### Option D: Use `script` instead of screen's `-L`

**Approach:** Replace screen's log capture with `script -c "command" logfile`, which
uses a pty and has more reliable flush-on-exit behavior.

**Pros:** `script` is designed for output capture and handles flush correctly.

**Cons:**
- `script` behavior varies significantly between Linux (util-linux) and macOS (BSD)
- macOS `script` uses different flags (`-q`, `-c` not available)
- Complex to handle cross-platform
- Loses screen's session management features

---

## Chosen Solution: Option C + Option B

**Implementation:**

For native logging path (screen ≥ 4.5.1):
```js
// Create temporary screenrc with immediate log flush
const screenrcFile = path.join(os.tmpdir(), `screenrc-${sessionName}`);
fs.writeFileSync(screenrcFile, 'logfile flush 0\n');

const logArgs = ['-dmS', sessionName, '-c', screenrcFile, '-L', '-Logfile', logFile];
screenArgs = [...logArgs, shell, shellArg, effectiveCommand];

// Clean up screenrc on completion
try { fs.unlinkSync(screenrcFile); } catch {}
```

For tee fallback path (screen < 4.5.1): add a brief retry when the log file is empty
immediately after session completion.

**Why this is correct:**
- Setting `logfile flush 0` is the standard, official way to control screen's log
flush behavior, documented in the GNU Screen manual
- It directly addresses the root cause (buffered writes) rather than fighting symptoms
- The retry for the tee path handles the TOCTOU race for the remaining edge case
Loading
Loading