User:<image>Describe the image.
Answer: The image is a surreal and fantastical representation of a space station orbiting a planet in outer space. The station, which appears to be a large, cylindrical structure with multiple levels, is depicted as being made entirely of ice and water. The station's surface is covered in a thick layer of ice, creating a stark contrast against the barren landscape below.
The planet itself is a dark, icy world, with no visible signs of life or vegetation. However, there are several large, glowing orbs scattered across the surface, which seem to be floating in mid-air. These orbs could be either stars or other celestial bodies, adding an element of mystery and wonder to the scene.
The station's interior is also depicted as being made entirely of ice and water, with no visible doors or windows. The walls are covered in a thick layer of ice, creating a sense of isolation and confinement. However, there are several small, glowing orbs scattered throughout the interior, which could be either stars or other celestial bodies, adding an element of mystery and wonder to the scene.
The overall composition of the image creates a sense of awe and wonder, as if the viewer is witnessing a scene from a science fiction movie or a fantasy novel. The use of ice and water as the primary material for the station's structure adds a unique and otherworldly element to the scene. The glowing orbs could be either stars or other celestial bodies, adding an element of mystery and wonder to the scene.
Overall, the image is a beautiful and imaginative representation of a space station orbiting a planet in outer space, with ice and water as its primary materials. It creates a sense of isolation and confinement, while also adding an element of mystery and wonder to the scene. The use of ice and water as the primary material for the station's structure adds a unique and otherworldly element to the scene. The glowing orbs could be either stars or other celestial bodies, adding an element of mystery and wonder to the scene. The overall composition of the image creates a sense of awe and wonder, as if the viewer is witnessing a scene from a science fiction movie or a fantasy novel.
Paper: https://huggingface.co/blog/smolvlm2
Hugging face: https://huggingface.co/blog/smolvlm2
LLMs (Large Language Models) are neural networks trained on extensive text datasets to comprehend and produce language.
VLMs (Vision-Language Models) incorporate a visual encoder, allowing the model to process images and text simultaneously.
A combined VLM+LLM system is often referred to as a multimodal model.
These models can be large—hundreds of millions to billions of parameters—which impacts accuracy, memory use, and runtime speed.
On edge devices like the RK3588, available RAM and compute are limited, and even the NPU has strict constraints on supported operations.
Because of this, models typically need to be quantised or simplified to fit.
Performance is usually expressed in tokens (words) per second.
Once converted to RKNN, parts of the model can run on the NPU, improving speed.
All models, with C++ examples, can be found on the Q-engineering GitHub.
All LLM models are quantized to w8a8, while the VLM vision encoders use fp16.
| model | RAM (GB)1 | llm cold sec2 | llm warm sec3 | vlm cold sec2 | vlm warm sec3 | Resolution | Tokens/s |
|---|---|---|---|---|---|---|---|
| Qwen2-7B | 8.7 | 86.6 | 34.5 | 37.1 | 20.7 | 392 x 392 | 3.7 |
| Qwen2-2.2B | 3.3 | 29.1 | 2.5 | 17.1 | 1.7 | 392 x 392 | 12.5 |
| InternVL3-1B | 1.3 | 6.8 | 1.1 | 7.8 | 0.75 | 448 x 448 | 30 |
| SmolVLM2-2.2B | 3.4 | 21.2 | 2.6 | 10.5 | 0.9 | 384 x 384 | 11 |
| SmolVLM2-500M | 0.8 | 4.8 | 0.7 | 2.5 | 0.25 | 384 x 384 | 31 |
| SmolVLM2-256M | 0.5 | 1.1 | 0.4 | 2.5 | 0.25 | 384 x 384 | 54 |
1 The total used memory; LLM plus the VLM.
2 When an llm/vlm model is loaded for the first time from your disk to RAM or NPU, it is called a cold start.
The duration depends on your OS, I/O transfer rate, and memory mapping.
3 Subsequent loading (warm start) takes advantage of the already mapped data in RAM. Mostly, only a few pointers need to be restored.

To run the application, you have to:
- OpenCV 64-bit installed.
- rkllm library.
- rknn library.
- Optional: Code::Blocks. (
$ sudo apt-get install codeblocks)
Start with the usual
$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install cmake wget curl
To install OpenCV on your SBC, follow the Raspberry Pi 4 guide.
Or, when you have no intentions to program code:
$ sudo apt-get install lib-opencv-dev
$ git clone https://github.com/Qengineering/SmolVLM2-500M-NPU.git
To run SmolVLM2-500M, you need to have the rkllm-runtime library version 1.2.2 (or higher) installed, as well as the rknpu driver version 0.9.8.
If you don't have these on your machine, or if you have a lower version, you need to install them.
We have provided the correct versions in the repo.
$ cd ./SmolVLM2-500M-NPU/aarch64/library
$ sudo cp ./*.so /usr/local/lib
$ cd ./SmolVLM2-500M-NPU/aarch64/include
$ sudo cp ./*.h /usr/local/includeThe next step is downloading the models.
Download the two files (700 MB) from our Sync.com server:
smolvlm2_500m_llm_w8a8_rk3588.rkllm and smolvlm2_500m_vision_fp16_rk3588.rknn
Copy both to your ./model folder.
Once you have the two models, it is time to build your application.
You can use Code::Blocks.
- Load the project file *.cbp in Code::Blocks.
- Select Release, not Debug.
- Compile and run with F9.
- You can alter command line arguments with Project -> Set programs arguments...
Or use Cmake.
$ mkdir build
$ cd build
$ cmake ..
$ make -j4
The app has the following arguments.
VLM_NPU Picture RKNN_model RKLLM_model NewTokens ContextLength| Argument | Comment |
|---|---|
| picture | The image. Provide a dummy if you don't want to use an image |
| RKNN_model | The visual encoder model (vlm) |
| RKLLM_model | The large language model (llm) |
| NewTokens | This sets the maximum number of new tokens. Optional, default 2048 |
| ContextLength | This specifies the maximum total number of tokens the model can process. Optional, default 4096 |
In the context of the Rockchip RK3588 LLM (Large Language Model) library, the parameters NewTokens and ContextLength both control different limits for text generation, and they're typical in LLM workflows.
NewTokens
This sets the maximum number of tokens (pieces of text, typically sub-word units) that the model is allowed to generate in response to a prompt during a single inference round. For example, if set to 300, the model will not return more than 300 tokens as output, regardless of the prompt length. It's important for controlling generation length to avoid too-short or too-long responses, helping manage resource use and output size.
ContextLength
This specifies the maximum total number of tokens the model can process in one go, which includes both the prompt (input) tokens and all generated tokens. For example, if set to 2048 and your prompt already uses 500 tokens, the model can generate up to 2048-500 = 1548 new tokens. This is a hardware and architecture constraint set during model conversion and deployment, as the context window cannot exceed the model's design limit (for instance, 4096 or 8192 tokens depending on the model variant).
A typical command line can be:
VLM_NPU ./Moon.jpg ./models/smolvlm2_500m_vision_fp16_rk3588.rknn ./models/smolvlm2_500m_llm_w8a8_rk3588.rkllm 2048 4096The NewTokens (2048) and ContextLength (4096) are optional and can be omitted.
Using the application is simple. Once you provide the image and the models, you can ask everything you want.
Remember, we are on a bare Rock5C with a tiny model, so don't expect the same quality answers as ChatGPT can provide.
If you want more sensible output, you'd better use a larger model like InternVL3 or Qwen.
If you like the response to include a discussion of the picture, include the <image> token once in your prompt.
The app remembers the dialogue until you give the token <clear>.
With <exit>, you leave the application.
Below, you find the surprisingly little code of main.cpp.
#include "RK35llm.h"
int main(int argc, char** argv)
{
std::string input_str;
std::string output_str;
RK35llm RKLLM;
RKLLM.SetInfo(true); //yes, you may give me additional model information
RKLLM.SetSilence(false); //you may print the incremental text chunks on the terminal
if (argc< 4) {std::cerr << "Usage: " << argv[0] << " image vlm_model llm_model [option]NewTokens [option]ContextLength\n"; return -1;}
else if(argc==4) RKLLM.LoadModel(argv[2],argv[3]);
else if(argc==5) RKLLM.LoadModel(argv[2],argv[3],std::atoi(argv[4]));
else if(argc> 5) RKLLM.LoadModel(argv[2],argv[3],std::atoi(argv[4]),std::atoi(argv[5]));
cv::Mat Pic = cv::imread(argv[1]);
RKLLM.LoadImage(Pic);
while(true) {
printf("\n");
printf("User: ");
std::getline(std::cin, input_str);
if (input_str == "exit") break;
output_str = RKLLM.Ask(input_str);
// std::cout << "\nLLM Reply: " << output_str << std::endl;
}
return 0;
}Most code speaks for itself. One remark.
The LLM generates the answer to your request in little pieces of text. You see them as if the LLM is typing on your terminal.
RKLLM.SetSilence controls this behaviour. When set to true, it will suppress this output. RKLLM.Ask() also returns the complete answer.
In main.cpp uncomment this line:
std::cout << "\nLLM Answer: " << output_str << std::endl;User: Can you make me a C++ call that embeds the Pythagorean theorem?
Answer:Here's an example of how you can use the sqrt function to calculate the square root of a number and then apply the Pythagorean theorem:
#include <iostream>
using namespace std;
int main() {
double hypotenuse = sqrt(25); // 5
double base = hypotenuse / 2; // 7.5
cout << "The length of the hypotenuse is: " << hypotenuse << endl;
cout << "The length of the base is: " << base << endl;
return 0;
}This code calculates the square root of 25 (which is 5) and then divides it by 2 to get the length of the hypotenuse. The result is stored in base, which is printed to the console. Note that this assumes you have a variable named hypotenuse that holds the square root of 25, and another variable named base that holds the length of the base. If you don't have these variables, you'll need to modify the code accordingly.
Alternatively, you can use the Pythagorean theorem directly in your main function:
#include <iostream>
using namespace std;
int main() {
double hypotenuse = 5; // 5
double base = hypotenuse / 2; // 7.5
cout << "The length of the hypotenuse is: " << hypotenuse << endl;
cout << "The length of the base is: " << base << endl;
return 0;
}This code calculates the square root of 25 and then divides it by 2 to get the length of the hypotenuse. The result is stored in base, which is printed to the console. Note that this assumes you have a variable named hypotenuse that holds the square root of 25, and another variable named base that holds the length of the base. If you don't have these variables, you'll need to modify the code accordingly.
Both examples will output:
The length of the hypotenuse is: 5
The length of the base is: 7.5
I hope this helps! Let me know if you have any questions.
Porting the vision module of SmolVLM2 to rknn requires modifications to the intermediate ONNX file.
Specifically, the indices for the Gather operation must be converted from a floating-point (FP16) data type to an integer. Please refer to the following image for an example.




