Skip to content

0.16.2: concurrent sessions, web LiteRT-LM inference, backend reporting#294

Merged
DenisovAV merged 28 commits into
mainfrom
release/0.16.2
May 29, 2026
Merged

0.16.2: concurrent sessions, web LiteRT-LM inference, backend reporting#294
DenisovAV merged 28 commits into
mainfrom
release/0.16.2

Conversation

@DenisovAV
Copy link
Copy Markdown
Owner

0.16.2

Release PR for 0.16.2. Highlights:

Concurrent sessions (#226)

openSession() / openChat() let one loaded model serve several independent dialogues — shared weights, isolated history per session. Concurrent contexts, serialized inference: only one session generates at a time (parallel on-device inference would contend for the accelerator and risk OOM). Single InferenceModel interface across all paths; legacy createSession() / session singleton untouched; optional maxConcurrentSessions cap.

  • .litertlm (FFI, all native): virtual-session multiplexer — the engine allows one live conversation, so the active session's history is replayed (via the messages_json preface) on switch.
  • .litertlm (web, @litert-lm/core): separate conversations, serialized.
  • .task (MediaPipe, Android/iOS): N real LlmInferenceSession live at once (each its own KV cache), generation serialized by a mutex. Added via session-scoped HostApi methods keyed by sessionId (pigeon bumped 24→26).
  • .task (MediaPipe web): not yet — openSession() throws UnsupportedError.

Verified end-to-end: .litertlm 20/20 on macOS, Android, Linux (T4 GPU), Windows (Lunar Lake), iOS; web probe in Chrome; .task 3/3 on Android (Pixel 8) and iOS Simulator. Legacy 18-test FFI gate stays green on every native platform.

Web .litertlm inference

Gemma .litertlm models run in the browser via @litert-lm/core (WebGPU + WASM, early preview). Text-only subset — no vision/audio/thinking/function-calling/LoRA yet; stopGeneration() is best-effort via conversation.cancel(); OPFS streaming (WebStorageMode.streaming) for models >2 GB. Documented in README under "Web .litertlm support & limitations".

Backend reporting (#288, thanks @merlinnot)

InferenceModel.activeBackend getter + NPU→GPU→CPU fallback on the FFI path with BackendInitException carrying per-attempt detail.

Other

Native dylibs unchanged (no native-v bump).

DenisovAV added 28 commits May 24, 2026 15:10
Brings @litert-lm/core (LiteRT-LM v0.12.0+ web JS API) into the web
platform. .litertlm models now run in the browser via WebGPU/WASM,
text-only per the upstream early-preview status. Verified end-to-end
in Chrome with Gemma 4 E2B web variant (~2 GB).

- new lib/web/web_model_source.dart: sealed WebModelSource +
  WebModelSourceResolver — single resolver shared by both MediaPipe
  (WebInferenceModel) and LiteRT-LM (LiteRtLmWebInferenceModel) paths.
  Replaces the inline activeModel-lookup + storage-mode branch that
  used to live in WebInferenceModel.createSession.
- new lib/web/litert_lm_web.dart + lib/web/litert_lm_web_inference.dart:
  JS interop wrappers for @litert-lm/core Engine / Conversation, and
  the Dart-side LiteRtLmWebInferenceModel + LiteRtLmWebSession.
  Engine routed via @js('Engine') to match the host page's
  window.Engine = m.Engine ESM shim. AsyncIterable returned by
  sendMessageStreaming is normalised via [Symbol.asyncIterator]() to
  an AsyncIterator before pumping with .next() (dart:js_interop has
  no first-class async-iterator type yet — dart-lang/sdk#60457).
  stopGeneration now also calls conversation.cancel() upstream.
- lib/web/flutter_gemma_web.dart: createModel() branches on
  ModelFileType.litertlm vs .task and hands both engines the same
  WebModelSourceResolver. Pre-existing cast-on-singleton bug fixed
  (it would throw the moment a second InferenceModel type existed
  on web).
- lib/core/infrastructure/web_opfs_*: add getStream(filename) →
  Future<JSAny> alongside the existing getStreamReader, returning a
  raw ReadableStream rather than a reader (the form @litert-lm/core
  Engine.create accepts).
- example/web/index.html: load @litert-lm/core@0.12.1 ESM and expose
  window.Engine + window.litertLmReady promise so Dart can await the
  module before static interop calls.
- example/lib/main.dart: switch default WebStorageMode to streaming
  (required for .litertlm web models >2 GB to avoid Chrome's blob
  fetch limit).
- example/lib/models/model.dart: add gemma4_E2B_litertlm and
  gemma4_E4B_litertlm entries pointing at the upstream -web.litertlm
  HF artefacts; expose webUrl on existing gemma3n_*_litertlm entries
  so they show up on web too.
- pubspec.yaml + ios/podspec + CLAUDE.md: bump version to 0.16.2,
  large_file_handler ^0.3.1 → ^0.4.0.
- CHANGELOG.md + README.md "What's new in 0.16" + test/web smoke +
  example/integration_test/litertlm_web_test.dart for flutter drive
  -d chrome (see updated Rule 6 in CLAUDE.md — web target needs
  flutter drive, native targets still must not).
- New lib/core/parsing/sdk_text_extractor.dart: shared text extractor for
  LiteRT-LM JSON response chunks (text vs `channels.thought`). Single source
  of truth for both the native FFI client (lib/core/ffi/litert_lm_client.dart)
  and the web @litert-lm/core path. FFI client now delegates to it.
- LiteRtLmWebSession now `with RawSdkResponseSession` + accumulates
  _lastRawResponse on the Gemma 4 branch — chat.dart:151 reads it and runs
  SdkResponseParser.extractToolCalls automatically, no changes to chat.dart.
- Tools path uses SdkResponseParser.serializeToolsForSdk(tools) so the
  preface.tools[] JSON shape is byte-identical to what native FFI sends to
  the same SDK. Sets enableConstrainedDecoding when tools are present.
- LiteRtLmConversationOptions gains enableConstrainedDecoding + audit
  surface (sessionConfig, filterChannelContentFromKvCache,
  prefillPrefaceOnInit) to mirror native conversation_config setters.
- sendMessageStreaming widened from JSString to JSAny so multimodal
  Message objects (role+content[]) can be passed alongside text strings.
- Cancel surface: conversation.cancel() wired in for stopGeneration().
- example/lib/models/model.dart: add functionGemma_270M_litertlm entry
  for probing tool-calling on @litert-lm/core (blocked by upstream
  Streaming kTfLitePrefillDecode not supported — see drafts/email).
- .gitignore: /.drafts/ so local email drafts stay local.
PR #288 added `activeBackend` as a required getter on `InferenceModel`.
The web LiteRT-LM inference path (LiteRtLmWebInferenceModel) was
introduced separately in release/0.16.2 and didn't pick up the new
member during the merge — implement it as null (same as
WebInferenceModel: the @litert-lm/core engine doesn't surface a final
backend).

Also document #288 in the 0.16.2 CHANGELOG entry.
Add openSession() / openChat() / sessions getter to InferenceModel
as the public surface for concurrent dialogues on a single loaded
model (#226).

Default impls:
- openSession() throws UnsupportedError with a message pointing to
  the .litertlm + .task concrete impls landing in subsequent steps.
- openChat() builds an InferenceChat with sessionCreator routed
  through openSession() — the chat owns an independent session that
  doesn't touch the legacy `session` field.
- sessions getter returns an unmodifiable view of [session ?? nothing]
  on the abstract base; concrete impls extend this with their open
  sessions in later steps.

Add dartdoc on the legacy `session` getter clarifying that it tracks
only the createSession() singleton; multi-session apps should read
sessions instead. No @deprecated annotation yet — that's a 1.0 call.

No behavior change for existing callers: createSession() / createChat()
contracts are unchanged.
Introduce LiteRtLmConversationHandle to decouple conversation lifetime
from LiteRtLmFfiClient (#226). Each handle owns one
Pointer<LiteRtLmConversation>; the client holds the engine and tracks
live handles in a Set for shutdown cleanup. The LiteRT-LM C API already
supports multiple conversations per engine — this removes the
single-conversation assumption from the Dart wrapper.

- createConversationHandle() is the new factory returning a handle.
- Per-conversation native calls moved to private _…On(conv, …) methods
  (_chatOn, _chatRawOn, _sendMessageOn, _sendMessageStreamRawOn,
  _cancelOn, _getMetricsOn, _deleteConversation). The handle delegates
  to these with its own conversation pointer.
- Legacy single-session methods (createConversation, chat, chatRaw,
  sendMessage, sendMessageStreamRaw, cancelGeneration,
  getSessionMetrics, closeConversation) now route through an internal
  _legacyHandle, preserving the existing FfiInferenceModelSession path
  unchanged. _assertConversation checks the legacy handle.
- shutdown() closes every live handle before deleting the engine.

Verified: flutter analyze clean; 387 unit tests green. No behavior
change for the existing single-session FFI path.

Note: `flutter build macos --debug` currently fails with a "Cycle
inside Flutter Assemble" from a duplicated "[flutter_gemma] Setup
LiteRT-LM macOS" script phase in example/macos/.../project.pbxproj —
this is preexisting (reproduces on the prior commit) and unrelated to
this Dart-only refactor; tracked separately.
Wire openSession() into FfiInferenceModel (#226). Each open session
owns its own LiteRtLmConversationHandle — independent KV cache,
history, and raw-response buffer.

- Extract ConversationHandle interface (litert_lm_client.dart) so the
  session depends on an abstraction the test layer can fake.
  LiteRtLmConversationHandle implements it.
- FfiInferenceModelSession now takes a ConversationHandle instead of
  the shared LiteRtLmFfiClient; routes chat/chatRaw/cancel/metrics/close
  through its own handle. Static extractTextFromResponse is unchanged.
- FfiInferenceModel: createSession() uses createConversationHandle()
  for the legacy singleton lane (overwrite + close old). New
  openSession() appends a detached session to _openSessions. sessions
  getter returns the union. close() cascade-closes every session in
  both lanes before engine shutdown.

Verification (host VM, no native engine needed):
- test/core/ffi/multi_session_test.dart — 6 tests using a fake
  ConversationHandle. Proves session-level isolation: two sessions with
  distinct handles produce distinct outputs ("I am A" vs "I am B");
  close() of one doesn't touch the other; Gemma 4 raw-response capture;
  StateError after close; stopGeneration routes to the handle.
- flutter analyze clean; full 387-test suite green.

The fake-handle seam mirrors PR #288's injectable-client pattern and
lets us verify multi-session orchestration without the native build
(macOS `flutter build` is blocked by a preexisting Xcode script-phase
cycle, tracked separately).
The example macOS build failed with "Cycle inside Flutter Assemble"
because the "[flutter_gemma] Setup LiteRT-LM macOS" post_install
script phase was added to both the Runner and RunnerTests targets,
and declared no outputs.

RunnerTests inherits Runner's framework search paths (`inherit!
:search_paths`) and has no Contents/Frameworks of its own, so a copy
of the phase there created a cross-target dependency on Runner's
framework output that Xcode flagged as a cycle. The missing declared
output also made Xcode treat the phase as "runs every build" and
prevented deterministic ordering relative to the qdrant native-asset
node.

Fix in example/macos/Podfile post_install:
- Only attach the phase to the `Runner` app target; remove any stale
  copy from non-app targets (covers projects that ran the old Podfile).
- Declare a sentinel output
  ($(DERIVED_FILE_DIR)/flutter_gemma_litertlm_macos.stamp) and `touch`
  it at the end of the script so Xcode can order the phase.

Regenerated project.pbxproj (single phase on Runner, with outputPaths)
and Podfile.lock. `flutter clean && flutter build macos --debug` now
succeeds. Unblocks macOS integration tests for the multi-session work.
Add package:mutex ^3.1.0 and serialize native conversation generation
on LiteRtLmFfiClient (#226). The LiteRT-LM C API is not documented as
reentrant on a single engine, so two concurrent sessions could race
inside liblitert_lm.

- _sendMessageStreamRawOn now acquires _nativeMutex for the whole
  generation (async* wrapper around the renamed _doSendMessageStreamRawOn
  body) and releases on completion/error.
- _sendMessageOn wraps its native call in _nativeMutex.protect().
- Cancel (_cancelOn) intentionally does NOT take the lock — it must be
  able to interrupt an in-flight streaming call.

Concurrent sessions live independently (own KV cache + history) but
their inference serializes at the native boundary — "concurrent
contexts, serialized inference". Uncontended on the single-session
fast path (one acquire/release on an empty lock).

Verified: flutter analyze clean; 393 unit tests green; macOS native
build succeeds (after `flutter clean` — the Flutter Native Assets
graph requires a clean after dependency changes to avoid the
qdrant-hook Flutter Assemble cycle).
Wire openSession() into LiteRtLmWebInferenceModel (#226). Each open
session owns its own @litert-lm/core Conversation JS object —
independent KV cache + history; the upstream Engine.createConversation()
already supports multiple conversations per Engine.

- Extract _buildConversation() helper shared by createSession (legacy
  singleton, overwrites _session) and openSession (detached, appends to
  _openSessions). Keeps the sampler/preface/tools/thinking JS interop
  wiring in one place so the two lanes can't drift.
- Add _openSessions Set + sessions getter (union of legacy + open) +
  close() cascade-closes both lanes before engine.delete().
- Vision/audio remain force-disabled on both lanes (upstream
  @litert-lm/core@0.12.1 doesn't expose the executor setters).

Also fix a preexisting web-compile gap: litert_lm_client_stub.dart was
missing shutdown(), which flutter_gemma_mobile.dart references via the
PR #288 initializeFfiRuntime shutdownClient callback. The chrome web
test wasn't run after #288 so it stayed latent; add the stub method.

Verified: flutter analyze clean; web library compiles in Chrome
(test/web/litert_lm_web_test.dart); 393 host unit tests green.
Add optional `int? maxConcurrentSessions` to createModel() /
getActiveModel(), threaded through every InferenceModel impl (#226).
Default null = no cap (backward-compatible). When set, the (cap+1)-th
openSession() throws StateError so callers must close a session first
— a guard against OOM from multiple concurrent KV caches on mobile.

Threaded through:
- interface createModel() + getActiveModel()
- FfiInferenceModel (cap enforced in openSession, before native call)
- LiteRtLmWebInferenceModel (cap enforced in openSession)
- MobileInferenceModel + WebInferenceModel (field plumbed; their
  openSession still throws UnsupportedError until the MediaPipe
  ProxyApi path lands in step 7)
- FfiInferenceModel web stub constructor (so dart2js compiles the
  mobile createModel call site that passes the param)
- all four createModel call sites (mobile/desktop/web)

Verified: flutter analyze clean; 395 host tests green incl two new cap
tests (cap=0 → StateError before native; null → unlimited); web
library compiles in Chrome.
LiteRT-LM allows only one live conversation per engine, so openSession()
sessions multiplex: each virtual session keeps its history in Dart and
replays it into the single shared conversation via a messages_json preface
on switch. Mutex-serialized. Logically concurrent contexts, serialized
inference. Verified on a real macOS engine (isolated A/B histories) and the
18-test FFI regression gate stays green. Web probe confirms @litert-lm/core
has no such limit, so the web openSession (N real Conversations) is correct.
Move the two multi-session integration tests (isolated A/B history; close
one, other survives) into the canonical litertlm_ffi_test.dart gate as a
Multi-session group, reusing _localPath() for cross-platform model
resolution and the shared GPU model. Drops the standalone macOS-hardcoded
file. Gate now 20/20 on macOS GPU (18 legacy + 2 multi-session).
pubspec.lock picks up the mutex dependency added for the multi-session
conversation serializer; iOS Podfile.lock syncs integration_test + audio
pods from the test run.
Bump pigeon dev dep ^24.1.0 → ^26.0.0 (resolved 26.1.0). Schema
unchanged; regenerated pigeon.g.dart, PigeonInterface.g.kt,
PigeonInterface.g.swift. Diff is internal generator renames only
(pigeonVar_ prefixes) — public PlatformService methods unchanged.
Verified: analyze clean, 395 unit tests pass, apk + ios build green.
Preparation for the MediaPipe MultiSession ProxyApi in 7b.
Add 9 session-scoped methods to PlatformService keyed by int sessionId
(createSessionForId, closeSessionId, addQueryChunkToSession,
addImageToSession, addAudioToSession, generateResponseForSession,
generateResponseAsyncForSession, stopGenerationForSession,
sizeInTokensForSession). Legacy singleton methods unchanged. Regenerated
pigeon bridge. Native Kotlin/Swift impls land in 7c/7d (intentionally
incomplete native state until then).
Add sessionMap<Long, InferenceSession> alongside the singleton session and
implement the 9 session-scoped pigeon methods. Each resolves the session by
id and delegates to the existing InferenceSession API; createSessionForId
builds via engine.createSession (MediaPipe allows N live LlmInferenceSession
per engine). generateResponseAsyncForSession uses a new
MediaPipeSession.generateResponseAsyncTagged that tags each chunk with
sessionId and pushes over the shared event channel directly — no
endOfStream(), so other sessions' streams aren't closed. Legacy singleton
path untouched. apk build green.
Mirror the Kotlin session-scoped methods on iOS: sessionMap<Int64,
InferenceSession> guarded by a serial queue, plus the 9 *ForSession pigeon
methods. createSessionForId builds an independent InferenceSession (MediaPipe
allows N live sessions per engine); generateResponseAsyncForSession tags each
token with sessionId and emits a tagged done event instead of
FlutterEndOfEventStream (which would close the channel for other sessions).
closeModel now clears sessionMap. Legacy singleton path untouched. ios build
green.
MobileInferenceModel.openSession() now creates concurrent MediaPipe sessions
via the session-scoped pigeon methods: a Set of MultiSessionMobileInference-
ModelSession, a monotonic sessionId, maxConcurrentSessions cap, sessions
getter (union of legacy + open), and a close() cascade — mirroring the FFI
path. The new session class routes through *ForSession methods keyed by id,
serializes generation through a shared Mutex, and demuxes the shared
flutter_gemma_stream EventChannel by sessionId (closing on the tagged
done event). Legacy singleton listener now ignores tagged events. openChat()
works via the interface default (routes through openSession). analyze clean,
395 unit tests pass.
Add multi_session_mediapipe_test.dart: two openSession dialogues on a .task
model keep isolated history, closing one leaves the other usable, and legacy
createSession still works after openSession. Verified 3/3 on Android (Pixel 8,
gemma3-1b-it-int4.task, CPU): A recalled "alice", B "bob".
Add IOS_TEST_DOCS_DIR dart-define support to the MediaPipe multi-session
test so the iOS Simulator can read a host-side .task (its app sandbox is
ephemeral across runs). Verified 3/3 on iOS Simulator (Swift, CPU):
A recalled "alice", B "bob", legacy createSession unaffected — confirming
the Swift session-scoped impl alongside the Android/Kotlin one.
Document the new multi-session API (openSession/openChat): why it exists
(shared weights, N independent dialogues), when to use it, the
concurrent-contexts/serialized-inference contract, per-platform behavior
table, and the memory/cap caveats — in both the README and the
InferenceModel.openSession dartdoc. Add a "Web .litertlm support &
limitations" README section spelling out the @litert-lm/core early-preview
subset (text-only; no vision/audio/thinking/function-calling/LoRA;
best-effort stopGeneration via conversation.cancel; OPFS streaming for
>2GB). Fix the stale web stopGeneration docstring (it does call
conversation.cancel). No hard version promise for web .task multi-session.
…ate guard

C1: serialize generation across web sessions with a shared Mutex (the
documented "concurrent contexts, serialized inference" was not enforced on
web — separate JS conversations ran in parallel). getResponseAsync acquires
it in onListen and releases in every terminal path (done/error/cancel) so an
abandoned stream can't hold it forever.

Also: close() now calls conversation.cancel() before the model frees the
engine (prevents use-after-free of WASM/WebGPU state while an iter.next()
Promise is in flight); _ensureEngine() gets a concurrent-call guard so two
overlapping openSession/createSession calls can't both run Engine.create and
leak the first engine.
C2: startVirtualTurn now uses a StreamController (not async*) so the native
mutex is released on done/error AND consumer cancel/abandon — an abandoned
stream no longer holds the lock forever, deadlocking other sessions.
C4: cancelVirtualTurn is token-scoped — a session's stopGeneration() can no
longer cancel a different session's in-flight generation.
C5: releaseVirtualConversation defers the native teardown (and cancels) when
a turn is in flight, instead of deleting the conversation pointer the live
stream is using (use-after-free on close-during-generate).
I2: _run records the user+assistant turns in a finally, so a mid-stream error
doesn't silently drop a turn the native already saw (context divergence).
Plus: virtual sessions reject image/audio loudly (text-only history replay)
instead of silently dropping them.
C2: MultiSessionMobileInferenceModelSession.getResponseAsync uses a
StreamController (not async*/yield*) so the generation mutex is released on
done/error AND consumer cancel/abandon — an abandoned stream no longer holds
it forever and deadlocks other sessions.
C3: a native generation error now arrives as a TAGGED DATA event
{code: ERROR, sessionId, message} which is demuxed to the right session,
closes its controller, and releases the mutex. A synchronous failure of the
generateResponseAsyncForSession RPC is caught (.catchError) so it surfaces
and releases the mutex instead of hanging the controller.
…+iOS)

C3 (native half): a .task generation-time error is now emitted as a TAGGED
DATA event {code: ERROR, sessionId, message} over the shared event channel,
not via eventSink.error()/FlutterError. A channel error reaches every
session's listener and carries no usable sessionId, so Dart could neither
route it nor close+release the right session (deadlock). The tagged data
event is demuxed by sessionId on the Dart side.
@DenisovAV DenisovAV merged commit 2785056 into main May 29, 2026
3 checks passed
DenisovAV added a commit that referenced this pull request May 29, 2026
…se tests (#295)

* 0.16.2: tests — edge cases for the multi-session review fixes

Add three integration tests to the Multi-session group covering the PR #294
review fixes: abandoning a stream releases the mutex (C2, no deadlock),
closing a session mid-generation is safe (C5, no use-after-free), and
openSession rejects image/audio loudly on the text-only .litertlm virtual
path.

* 0.16.2: exclude chromedriver from pub package

The chromedriver binary (a maintainer tool for running web integration
tests via `flutter drive -d chrome`) sits at repo root and is ~16 MB. It
leaked into the published package, bloating the archive to 8 MB. Add
`chromedriver/` to .pubignore — archive drops to ~683 KB.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant