-
-
Notifications
You must be signed in to change notification settings - Fork 112
Description
This is based on a conversation started in #476 .
The proposed methods will let us provide users with easy-to-use best-practice ways to get the data to the desired destination.
The CPU could also have methods to send data to PyTorch and TensorFlow, since those can work with both CPU-resident and CUDA-resident data. For instance, the cat detector demo uses PyTorch, giving it the MSS pixel data from the CPU, and then either running the neural net on the CPU, or moving the data back to the GPU and running the neural net there.
Here's something to consider in the API design: different libraries have different conventions for how to represent the data. For instance, even though they all use NumPy, OpenCV wants the channels in BGR order, while Matplotlib, scikit-image, and most others want the channels in RGB order. For PIL, it's not a problem: we just return an Image, and PIL knows its internal layout. But for things like .to_numpy, it would be good to provide the user with the knobs they'll want to interface with their favorite image processing library. These knobs aren't hard to implement, and handling them when we're preparing the array is more efficient than the user having to reorder it later.
Here’s a cheat sheet I got from ChatGPT about the expectations of popular image frameworks. I would verify this before relying on it, but this gives us an idea of what users may need. (For a head start on verification: its unedited response, with links to references, are at https://chatgpt.com/s/t_6996cd0b606c8191aba9c2cc6e23dc3f.)
Just to define some terms:
- Array type: The array library the image wants to use (NumPy, PyTorch, etc)
- Layout:
HWCmeans the axes are [height, width, channels], in that order.HWCis what we provide with.pixels(), for instance. But some very popular frameworks wantCHW. (This terminology also often has another dimension if you’re processing N images in a batch, so you’ll often seeNHWC, but that’s not relevant for MSS.) - Channel order: RGB or BGR, typically.
- dtype: The data type that the framework wants for its inputs. This is almost always either
uint8if the pixels are 0–255 (like MSS uses) or some type of float pixel in 0–1 (very common for neural nets).
Here’s the list:
| Framework | Array type | Layout | Channels | dtype |
|---|---|---|---|---|
| OpenCV | NumPy | HWC | BGR | uint8 |
| scikit-image | NumPy | HWC | RGB | either |
| Matplotlib | NumPy | HWC | RGB | either |
| Torchvision | PyTorch | CHW | RGB | float |
| Kornia | PyTorch | CHW | RGB | float |
| TensorFlow | TensorFlow | HWC | RGB | float |
| JAX | jax | HWC | RGB | float |
These different needs can easily be accommodated with parameters in to_numpy, to_pytorch, and any other similar methods we create to build array-like objects.