0.16.2: concurrent sessions, web LiteRT-LM inference, backend reporting by DenisovAV · Pull Request #294 · DenisovAV/flutter_gemma

DenisovAV · 2026-05-29T13:21:57Z

0.16.2

Release PR for 0.16.2. Highlights:

Concurrent sessions (#226)

openSession() / openChat() let one loaded model serve several independent dialogues — shared weights, isolated history per session. Concurrent contexts, serialized inference: only one session generates at a time (parallel on-device inference would contend for the accelerator and risk OOM). Single InferenceModel interface across all paths; legacy createSession() / session singleton untouched; optional maxConcurrentSessions cap.

.litertlm (FFI, all native): virtual-session multiplexer — the engine allows one live conversation, so the active session's history is replayed (via the messages_json preface) on switch.
.litertlm (web, @litert-lm/core): separate conversations, serialized.
.task (MediaPipe, Android/iOS): N real LlmInferenceSession live at once (each its own KV cache), generation serialized by a mutex. Added via session-scoped HostApi methods keyed by sessionId (pigeon bumped 24→26).
.task (MediaPipe web): not yet — openSession() throws UnsupportedError.

Verified end-to-end: .litertlm 20/20 on macOS, Android, Linux (T4 GPU), Windows (Lunar Lake), iOS; web probe in Chrome; .task 3/3 on Android (Pixel 8) and iOS Simulator. Legacy 18-test FFI gate stays green on every native platform.

Web `.litertlm` inference

Gemma .litertlm models run in the browser via @litert-lm/core (WebGPU + WASM, early preview). Text-only subset — no vision/audio/thinking/function-calling/LoRA yet; stopGeneration() is best-effort via conversation.cancel(); OPFS streaming (WebStorageMode.streaming) for models >2 GB. Documented in README under "Web .litertlm support & limitations".

Backend reporting (#288, thanks @merlinnot)

InferenceModel.activeBackend getter + NPU→GPU→CPU fallback on the FFI path with BackendInitException carrying per-attempt detail.

Other

Fix getActiveModel() after app restart (Get Active Model fails, even when plugin reports model is installed. #227): mobile + web auto-restore from prefs.
Web function calling, symmetric with the native FFI parsers.
Bump large_file_handler ^0.3.1 → ^0.4.0; add mutex ^3.1.0 (session serializer).

Native dylibs unchanged (no native-v bump).

@js

Brings @litert-lm/core (LiteRT-LM v0.12.0+ web JS API) into the web platform. .litertlm models now run in the browser via WebGPU/WASM, text-only per the upstream early-preview status. Verified end-to-end in Chrome with Gemma 4 E2B web variant (~2 GB). - new lib/web/web_model_source.dart: sealed WebModelSource + WebModelSourceResolver — single resolver shared by both MediaPipe (WebInferenceModel) and LiteRT-LM (LiteRtLmWebInferenceModel) paths. Replaces the inline activeModel-lookup + storage-mode branch that used to live in WebInferenceModel.createSession. - new lib/web/litert_lm_web.dart + lib/web/litert_lm_web_inference.dart: JS interop wrappers for @litert-lm/core Engine / Conversation, and the Dart-side LiteRtLmWebInferenceModel + LiteRtLmWebSession. Engine routed via @js('Engine') to match the host page's window.Engine = m.Engine ESM shim. AsyncIterable returned by sendMessageStreaming is normalised via [Symbol.asyncIterator]() to an AsyncIterator before pumping with .next() (dart:js_interop has no first-class async-iterator type yet — dart-lang/sdk#60457). stopGeneration now also calls conversation.cancel() upstream. - lib/web/flutter_gemma_web.dart: createModel() branches on ModelFileType.litertlm vs .task and hands both engines the same WebModelSourceResolver. Pre-existing cast-on-singleton bug fixed (it would throw the moment a second InferenceModel type existed on web). - lib/core/infrastructure/web_opfs_*: add getStream(filename) → Future<JSAny> alongside the existing getStreamReader, returning a raw ReadableStream rather than a reader (the form @litert-lm/core Engine.create accepts). - example/web/index.html: load @litert-lm/core@0.12.1 ESM and expose window.Engine + window.litertLmReady promise so Dart can await the module before static interop calls. - example/lib/main.dart: switch default WebStorageMode to streaming (required for .litertlm web models >2 GB to avoid Chrome's blob fetch limit). - example/lib/models/model.dart: add gemma4_E2B_litertlm and gemma4_E4B_litertlm entries pointing at the upstream -web.litertlm HF artefacts; expose webUrl on existing gemma3n_*_litertlm entries so they show up on web too. - pubspec.yaml + ios/podspec + CLAUDE.md: bump version to 0.16.2, large_file_handler ^0.3.1 → ^0.4.0. - CHANGELOG.md + README.md "What's new in 0.16" + test/web smoke + example/integration_test/litertlm_web_test.dart for flutter drive -d chrome (see updated Rule 6 in CLAUDE.md — web target needs flutter drive, native targets still must not).

- New lib/core/parsing/sdk_text_extractor.dart: shared text extractor for LiteRT-LM JSON response chunks (text vs `channels.thought`). Single source of truth for both the native FFI client (lib/core/ffi/litert_lm_client.dart) and the web @litert-lm/core path. FFI client now delegates to it. - LiteRtLmWebSession now `with RawSdkResponseSession` + accumulates _lastRawResponse on the Gemma 4 branch — chat.dart:151 reads it and runs SdkResponseParser.extractToolCalls automatically, no changes to chat.dart. - Tools path uses SdkResponseParser.serializeToolsForSdk(tools) so the preface.tools[] JSON shape is byte-identical to what native FFI sends to the same SDK. Sets enableConstrainedDecoding when tools are present. - LiteRtLmConversationOptions gains enableConstrainedDecoding + audit surface (sessionConfig, filterChannelContentFromKvCache, prefillPrefaceOnInit) to mirror native conversation_config setters. - sendMessageStreaming widened from JSString to JSAny so multimodal Message objects (role+content[]) can be passed alongside text strings. - Cancel surface: conversation.cancel() wired in for stopGeneration(). - example/lib/models/model.dart: add functionGemma_270M_litertlm entry for probing tool-calling on @litert-lm/core (blocked by upstream Streaming kTfLitePrefillDecode not supported — see drafts/email). - .gitignore: /.drafts/ so local email drafts stay local.

PR #288 added `activeBackend` as a required getter on `InferenceModel`. The web LiteRT-LM inference path (LiteRtLmWebInferenceModel) was introduced separately in release/0.16.2 and didn't pick up the new member during the merge — implement it as null (same as WebInferenceModel: the @litert-lm/core engine doesn't surface a final backend). Also document #288 in the 0.16.2 CHANGELOG entry.

@deprecated

Add openSession() / openChat() / sessions getter to InferenceModel as the public surface for concurrent dialogues on a single loaded model (#226). Default impls: - openSession() throws UnsupportedError with a message pointing to the .litertlm + .task concrete impls landing in subsequent steps. - openChat() builds an InferenceChat with sessionCreator routed through openSession() — the chat owns an independent session that doesn't touch the legacy `session` field. - sessions getter returns an unmodifiable view of [session ?? nothing] on the abstract base; concrete impls extend this with their open sessions in later steps. Add dartdoc on the legacy `session` getter clarifying that it tracks only the createSession() singleton; multi-session apps should read sessions instead. No @deprecated annotation yet — that's a 1.0 call. No behavior change for existing callers: createSession() / createChat() contracts are unchanged.

Introduce LiteRtLmConversationHandle to decouple conversation lifetime from LiteRtLmFfiClient (#226). Each handle owns one Pointer<LiteRtLmConversation>; the client holds the engine and tracks live handles in a Set for shutdown cleanup. The LiteRT-LM C API already supports multiple conversations per engine — this removes the single-conversation assumption from the Dart wrapper. - createConversationHandle() is the new factory returning a handle. - Per-conversation native calls moved to private _…On(conv, …) methods (_chatOn, _chatRawOn, _sendMessageOn, _sendMessageStreamRawOn, _cancelOn, _getMetricsOn, _deleteConversation). The handle delegates to these with its own conversation pointer. - Legacy single-session methods (createConversation, chat, chatRaw, sendMessage, sendMessageStreamRaw, cancelGeneration, getSessionMetrics, closeConversation) now route through an internal _legacyHandle, preserving the existing FfiInferenceModelSession path unchanged. _assertConversation checks the legacy handle. - shutdown() closes every live handle before deleting the engine. Verified: flutter analyze clean; 387 unit tests green. No behavior change for the existing single-session FFI path. Note: `flutter build macos --debug` currently fails with a "Cycle inside Flutter Assemble" from a duplicated "[flutter_gemma] Setup LiteRT-LM macOS" script phase in example/macos/.../project.pbxproj — this is preexisting (reproduces on the prior commit) and unrelated to this Dart-only refactor; tracked separately.

Wire openSession() into FfiInferenceModel (#226). Each open session owns its own LiteRtLmConversationHandle — independent KV cache, history, and raw-response buffer. - Extract ConversationHandle interface (litert_lm_client.dart) so the session depends on an abstraction the test layer can fake. LiteRtLmConversationHandle implements it. - FfiInferenceModelSession now takes a ConversationHandle instead of the shared LiteRtLmFfiClient; routes chat/chatRaw/cancel/metrics/close through its own handle. Static extractTextFromResponse is unchanged. - FfiInferenceModel: createSession() uses createConversationHandle() for the legacy singleton lane (overwrite + close old). New openSession() appends a detached session to _openSessions. sessions getter returns the union. close() cascade-closes every session in both lanes before engine shutdown. Verification (host VM, no native engine needed): - test/core/ffi/multi_session_test.dart — 6 tests using a fake ConversationHandle. Proves session-level isolation: two sessions with distinct handles produce distinct outputs ("I am A" vs "I am B"); close() of one doesn't touch the other; Gemma 4 raw-response capture; StateError after close; stopGeneration routes to the handle. - flutter analyze clean; full 387-test suite green. The fake-handle seam mirrors PR #288's injectable-client pattern and lets us verify multi-session orchestration without the native build (macOS `flutter build` is blocked by a preexisting Xcode script-phase cycle, tracked separately).

The example macOS build failed with "Cycle inside Flutter Assemble" because the "[flutter_gemma] Setup LiteRT-LM macOS" post_install script phase was added to both the Runner and RunnerTests targets, and declared no outputs. RunnerTests inherits Runner's framework search paths (`inherit! :search_paths`) and has no Contents/Frameworks of its own, so a copy of the phase there created a cross-target dependency on Runner's framework output that Xcode flagged as a cycle. The missing declared output also made Xcode treat the phase as "runs every build" and prevented deterministic ordering relative to the qdrant native-asset node. Fix in example/macos/Podfile post_install: - Only attach the phase to the `Runner` app target; remove any stale copy from non-app targets (covers projects that ran the old Podfile). - Declare a sentinel output ($(DERIVED_FILE_DIR)/flutter_gemma_litertlm_macos.stamp) and `touch` it at the end of the script so Xcode can order the phase. Regenerated project.pbxproj (single phase on Runner, with outputPaths) and Podfile.lock. `flutter clean && flutter build macos --debug` now succeeds. Unblocks macOS integration tests for the multi-session work.

Add package:mutex ^3.1.0 and serialize native conversation generation on LiteRtLmFfiClient (#226). The LiteRT-LM C API is not documented as reentrant on a single engine, so two concurrent sessions could race inside liblitert_lm. - _sendMessageStreamRawOn now acquires _nativeMutex for the whole generation (async* wrapper around the renamed _doSendMessageStreamRawOn body) and releases on completion/error. - _sendMessageOn wraps its native call in _nativeMutex.protect(). - Cancel (_cancelOn) intentionally does NOT take the lock — it must be able to interrupt an in-flight streaming call. Concurrent sessions live independently (own KV cache + history) but their inference serializes at the native boundary — "concurrent contexts, serialized inference". Uncontended on the single-session fast path (one acquire/release on an empty lock). Verified: flutter analyze clean; 393 unit tests green; macOS native build succeeds (after `flutter clean` — the Flutter Native Assets graph requires a clean after dependency changes to avoid the qdrant-hook Flutter Assemble cycle).

Wire openSession() into LiteRtLmWebInferenceModel (#226). Each open session owns its own @litert-lm/core Conversation JS object — independent KV cache + history; the upstream Engine.createConversation() already supports multiple conversations per Engine. - Extract _buildConversation() helper shared by createSession (legacy singleton, overwrites _session) and openSession (detached, appends to _openSessions). Keeps the sampler/preface/tools/thinking JS interop wiring in one place so the two lanes can't drift. - Add _openSessions Set + sessions getter (union of legacy + open) + close() cascade-closes both lanes before engine.delete(). - Vision/audio remain force-disabled on both lanes (upstream @litert-lm/core@0.12.1 doesn't expose the executor setters). Also fix a preexisting web-compile gap: litert_lm_client_stub.dart was missing shutdown(), which flutter_gemma_mobile.dart references via the PR #288 initializeFfiRuntime shutdownClient callback. The chrome web test wasn't run after #288 so it stayed latent; add the stub method. Verified: flutter analyze clean; web library compiles in Chrome (test/web/litert_lm_web_test.dart); 393 host unit tests green.

Add optional `int? maxConcurrentSessions` to createModel() / getActiveModel(), threaded through every InferenceModel impl (#226). Default null = no cap (backward-compatible). When set, the (cap+1)-th openSession() throws StateError so callers must close a session first — a guard against OOM from multiple concurrent KV caches on mobile. Threaded through: - interface createModel() + getActiveModel() - FfiInferenceModel (cap enforced in openSession, before native call) - LiteRtLmWebInferenceModel (cap enforced in openSession) - MobileInferenceModel + WebInferenceModel (field plumbed; their openSession still throws UnsupportedError until the MediaPipe ProxyApi path lands in step 7) - FfiInferenceModel web stub constructor (so dart2js compiles the mobile createModel call site that passes the param) - all four createModel call sites (mobile/desktop/web) Verified: flutter analyze clean; 395 host tests green incl two new cap tests (cap=0 → StateError before native; null → unlimited); web library compiles in Chrome.

LiteRT-LM allows only one live conversation per engine, so openSession() sessions multiplex: each virtual session keeps its history in Dart and replays it into the single shared conversation via a messages_json preface on switch. Mutex-serialized. Logically concurrent contexts, serialized inference. Verified on a real macOS engine (isolated A/B histories) and the 18-test FFI regression gate stays green. Web probe confirms @litert-lm/core has no such limit, so the web openSession (N real Conversations) is correct.

Move the two multi-session integration tests (isolated A/B history; close one, other survives) into the canonical litertlm_ffi_test.dart gate as a Multi-session group, reusing _localPath() for cross-platform model resolution and the shared GPU model. Drops the standalone macOS-hardcoded file. Gate now 20/20 on macOS GPU (18 legacy + 2 multi-session).

pubspec.lock picks up the mutex dependency added for the multi-session conversation serializer; iOS Podfile.lock syncs integration_test + audio pods from the test run.

Bump pigeon dev dep ^24.1.0 → ^26.0.0 (resolved 26.1.0). Schema unchanged; regenerated pigeon.g.dart, PigeonInterface.g.kt, PigeonInterface.g.swift. Diff is internal generator renames only (pigeonVar_ prefixes) — public PlatformService methods unchanged. Verified: analyze clean, 395 unit tests pass, apk + ios build green. Preparation for the MediaPipe MultiSession ProxyApi in 7b.

Add 9 session-scoped methods to PlatformService keyed by int sessionId (createSessionForId, closeSessionId, addQueryChunkToSession, addImageToSession, addAudioToSession, generateResponseForSession, generateResponseAsyncForSession, stopGenerationForSession, sizeInTokensForSession). Legacy singleton methods unchanged. Regenerated pigeon bridge. Native Kotlin/Swift impls land in 7c/7d (intentionally incomplete native state until then).

Add sessionMap<Long, InferenceSession> alongside the singleton session and implement the 9 session-scoped pigeon methods. Each resolves the session by id and delegates to the existing InferenceSession API; createSessionForId builds via engine.createSession (MediaPipe allows N live LlmInferenceSession per engine). generateResponseAsyncForSession uses a new MediaPipeSession.generateResponseAsyncTagged that tags each chunk with sessionId and pushes over the shared event channel directly — no endOfStream(), so other sessions' streams aren't closed. Legacy singleton path untouched. apk build green.

Mirror the Kotlin session-scoped methods on iOS: sessionMap<Int64, InferenceSession> guarded by a serial queue, plus the 9 *ForSession pigeon methods. createSessionForId builds an independent InferenceSession (MediaPipe allows N live sessions per engine); generateResponseAsyncForSession tags each token with sessionId and emits a tagged done event instead of FlutterEndOfEventStream (which would close the channel for other sessions). closeModel now clears sessionMap. Legacy singleton path untouched. ios build green.

MobileInferenceModel.openSession() now creates concurrent MediaPipe sessions via the session-scoped pigeon methods: a Set of MultiSessionMobileInference- ModelSession, a monotonic sessionId, maxConcurrentSessions cap, sessions getter (union of legacy + open), and a close() cascade — mirroring the FFI path. The new session class routes through *ForSession methods keyed by id, serializes generation through a shared Mutex, and demuxes the shared flutter_gemma_stream EventChannel by sessionId (closing on the tagged done event). Legacy singleton listener now ignores tagged events. openChat() works via the interface default (routes through openSession). analyze clean, 395 unit tests pass.

Add multi_session_mediapipe_test.dart: two openSession dialogues on a .task model keep isolated history, closing one leaves the other usable, and legacy createSession still works after openSession. Verified 3/3 on Android (Pixel 8, gemma3-1b-it-int4.task, CPU): A recalled "alice", B "bob".

Add IOS_TEST_DOCS_DIR dart-define support to the MediaPipe multi-session test so the iOS Simulator can read a host-side .task (its app sandbox is ephemeral across runs). Verified 3/3 on iOS Simulator (Swift, CPU): A recalled "alice", B "bob", legacy createSession unaffected — confirming the Swift session-scoped impl alongside the Android/Kotlin one.

Document the new multi-session API (openSession/openChat): why it exists (shared weights, N independent dialogues), when to use it, the concurrent-contexts/serialized-inference contract, per-platform behavior table, and the memory/cap caveats — in both the README and the InferenceModel.openSession dartdoc. Add a "Web .litertlm support & limitations" README section spelling out the @litert-lm/core early-preview subset (text-only; no vision/audio/thinking/function-calling/LoRA; best-effort stopGeneration via conversation.cancel; OPFS streaming for >2GB). Fix the stale web stopGeneration docstring (it does call conversation.cancel). No hard version promise for web .task multi-session.

…ate guard C1: serialize generation across web sessions with a shared Mutex (the documented "concurrent contexts, serialized inference" was not enforced on web — separate JS conversations ran in parallel). getResponseAsync acquires it in onListen and releases in every terminal path (done/error/cancel) so an abandoned stream can't hold it forever. Also: close() now calls conversation.cancel() before the model frees the engine (prevents use-after-free of WASM/WebGPU state while an iter.next() Promise is in flight); _ensureEngine() gets a concurrent-call guard so two overlapping openSession/createSession calls can't both run Engine.create and leak the first engine.

C2: startVirtualTurn now uses a StreamController (not async*) so the native mutex is released on done/error AND consumer cancel/abandon — an abandoned stream no longer holds the lock forever, deadlocking other sessions. C4: cancelVirtualTurn is token-scoped — a session's stopGeneration() can no longer cancel a different session's in-flight generation. C5: releaseVirtualConversation defers the native teardown (and cancels) when a turn is in flight, instead of deleting the conversation pointer the live stream is using (use-after-free on close-during-generate). I2: _run records the user+assistant turns in a finally, so a mid-stream error doesn't silently drop a turn the native already saw (context divergence). Plus: virtual sessions reject image/audio loudly (text-only history replay) instead of silently dropping them.

C2: MultiSessionMobileInferenceModelSession.getResponseAsync uses a StreamController (not async*/yield*) so the generation mutex is released on done/error AND consumer cancel/abandon — an abandoned stream no longer holds it forever and deadlocks other sessions. C3: a native generation error now arrives as a TAGGED DATA event {code: ERROR, sessionId, message} which is demuxed to the right session, closes its controller, and releases the mutex. A synchronous failure of the generateResponseAsyncForSession RPC is caught (.catchError) so it surfaces and releases the mutex instead of hanging the controller.

…+iOS) C3 (native half): a .task generation-time error is now emitted as a TAGGED DATA event {code: ERROR, sessionId, message} over the shared event channel, not via eventSink.error()/FlutterError. A channel error reaches every session's listener and carries no usable sessionId, so Dart could neither route it nor close+release the right session (deadlock). The tagged data event is demuxed by sessionId on the Dart side.

…se tests (#295) * 0.16.2: tests — edge cases for the multi-session review fixes Add three integration tests to the Multi-session group covering the PR #294 review fixes: abandoning a stream releases the mutex (C2, no deadlock), closing a session mid-generation is safe (C5, no use-after-free), and openSession rejects image/audio loudly on the text-only .litertlm virtual path. * 0.16.2: exclude chromedriver from pub package The chromedriver binary (a maintainer tool for running web integration tests via `flutter drive -d chrome`) sits at repo root and is ~16 MB. It leaked into the published package, bloating the archive to 8 MB. Add `chromedriver/` to .pubignore — archive drops to ~683 KB.

DenisovAV added 28 commits May 24, 2026 15:10

Merge remote-tracking branch 'origin/main' into release/0.16.2

6c539ae

chore(CHANGELOG): document #227 active-model auto-restore in 0.16.2

02f2ce5

Merge remote-tracking branch 'origin/main' into release/0.16.2

3b92a8e

0.16.2: lockfiles — mutex 3.1.0 + iOS pods sync

c9a3876

pubspec.lock picks up the mutex dependency added for the multi-session conversation serializer; iOS Podfile.lock syncs integration_test + audio pods from the test run.

DenisovAV merged commit 2785056 into main May 29, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.16.2: concurrent sessions, web LiteRT-LM inference, backend reporting#294

0.16.2: concurrent sessions, web LiteRT-LM inference, backend reporting#294
DenisovAV merged 28 commits into
mainfrom
release/0.16.2

DenisovAV commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DenisovAV commented May 29, 2026

0.16.2

Concurrent sessions (#226)

Web .litertlm inference

Backend reporting (#288, thanks @merlinnot)

Other

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Web `.litertlm` inference