Skip to content

Add .to_pil, etc methods to ScreenShot #478

@jholveck

Description

@jholveck

This is based on a conversation started in #476 .

The proposed methods will let us provide users with easy-to-use best-practice ways to get the data to the desired destination.

The CPU could also have methods to send data to PyTorch and TensorFlow, since those can work with both CPU-resident and CUDA-resident data. For instance, the cat detector demo uses PyTorch, giving it the MSS pixel data from the CPU, and then either running the neural net on the CPU, or moving the data back to the GPU and running the neural net there.

Here's something to consider in the API design: different libraries have different conventions for how to represent the data. For instance, even though they all use NumPy, OpenCV wants the channels in BGR order, while Matplotlib, scikit-image, and most others want the channels in RGB order. For PIL, it's not a problem: we just return an Image, and PIL knows its internal layout. But for things like .to_numpy, it would be good to provide the user with the knobs they'll want to interface with their favorite image processing library. These knobs aren't hard to implement, and handling them when we're preparing the array is more efficient than the user having to reorder it later.

Here’s a cheat sheet I got from ChatGPT about the expectations of popular image frameworks. I would verify this before relying on it, but this gives us an idea of what users may need. (For a head start on verification: its unedited response, with links to references, are at https://chatgpt.com/s/t_6996cd0b606c8191aba9c2cc6e23dc3f.)

Just to define some terms:

  • Array type: The array library the image wants to use (NumPy, PyTorch, etc)
  • Layout: HWC means the axes are [height, width, channels], in that order. HWC is what we provide with .pixels(), for instance. But some very popular frameworks want CHW. (This terminology also often has another dimension if you’re processing N images in a batch, so you’ll often see NHWC, but that’s not relevant for MSS.)
  • Channel order: RGB or BGR, typically.
  • dtype: The data type that the framework wants for its inputs. This is almost always either uint8 if the pixels are 0–255 (like MSS uses) or some type of float pixel in 0–1 (very common for neural nets).

Here’s the list:

Framework Array type Layout Channels dtype
OpenCV NumPy HWC BGR uint8
scikit-image NumPy HWC RGB either
Matplotlib NumPy HWC RGB either
Torchvision PyTorch CHW RGB float
Kornia PyTorch CHW RGB float
TensorFlow TensorFlow HWC RGB float
JAX jax HWC RGB float

These different needs can easily be accommodated with parameters in to_numpy, to_pytorch, and any other similar methods we create to build array-like objects.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions