Skip to content

Complete system hang of AVD VM #1311

@uhonermann

Description

@uhonermann

Check List

  • I checked my issue doesn't exist yet
  • My issue is valid with mirror default sample and not specific to my user-mode driver implementation
  • I can always reproduce the issue with the provided description below.
  • I have updated Dokany to the latest version and have reboot my computer after.
  • I tested one of the last snapshot from appveyor CI

Describe the bug
We've a customer that is using our Dokan2 based network drive on a pool of about 70 Azure Virtual Desktop machines.
Every day 5-6 of them hang completely (the whole system is not responding anymore), which stops about 60 people from working.
The Azure support connected to a hanging system using (virtual) serial port and created a Kernel Dump by causing a BSOD (only thing still possible).
They found a problem with an Outlook process trying to read from our drive, and blocking lots of other kernel threads.

To Reproduce
Steps to reproduce the behavior:

Unfortunately, the issue cannot be reproduced by us, it (currently) only happens in the customers environment.

Expected behavior
System should not hang.

Logs
Please attach in separate files: mirror output, library logs and kernel logs.
In case of BSOD, please attach minidump or dump analyze output.

Memory Dump is about 1 GB, I'll ask the customer to allow the upload.

Environment:

  • Windows version: Microsoft Windows 11 Enterprise multi-session
  • Processor architecture: x64
  • Dokany version: 2.3.0.1000
  • Library type (Dokany/FUSE): Dokany

Additional context
The dump has been analyzed by Microsoft already, and they'll work together with us to solve the issue.
Also their analysis can be provided.

Mainly I need some information about internals of the driver to analyze the dump myself.
For example, in the dump I don't see any DokanProcessAndPullEvents thread running, and I'm not sure, if this is normal (because it's currently working on a request ?) or not.
And, is there a list of IRP/Requests, that have been pulled to user land ?
I know about the PendingIRP list, but I think that list contains all pending IRP, regardsless if they already have been pulled.

We see a lot of "No matching IRPs found for a reply" just before the issue occurs.
As far a I know, this means that a response for a request has been delivered from user land to the driver after the IRP has been canceled because of a timeout.
But this means, that requests are still pulled and executed by the user process, but too slow, right ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions