-
Notifications
You must be signed in to change notification settings - Fork 529
[1/n] Add vLLM wrapper support #3794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: wang.yuqi <[email protected]>
|
I’m not very familiar with the MTEB framework—I’m unsure where the relevant code should be placed and how the interface should be designed. So for now, I’ve implemented a very, very dirtyversion in order to get feedback as early as possible. VLLM has many, many parameters. I have highlighted the most commonly used ones for MTEB users and added a lot of comments, which may require optimization. |
Signed-off-by: wang.yuqi <[email protected]>
Samoed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks good! I think we can extend and change further if some models would need to update something. Thank you for adding!
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
|
How should documents, examples, and tests be organized? |
|
I don't think you need to add tests, because this is wrapper and for now no models use it. As for docs, you can keep as docstrings without additional documentation |
Samoed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrapper looks good. You can add vllm as optional dependency to pyproject and add check to VllmWrapperBase like here
mteb/mteb/models/search_encoder_index/search_indexes/faiss_search_index.py
Lines 28 to 33 in d3bc4cc
| requires_package( | |
| self, | |
| "faiss", | |
| "FAISS-based search", | |
| install_instruction="pip install mteb[faiss-cpu]", | |
| ) |
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
I think we should at least write one test wrapping a smallish model (even if it is not the official reference implementation) |
|
Yes, but I'm not sure if it would install and run correctly in CI |
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on this! I would love to add the tests + documentation.
It would even be great to add a blog post entry on comparing the speed between the sentence transformers implementation and the vLLM implementation, however that is def. optional.
Co-authored-by: Kenneth Enevoldsen <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
| ## vLLM Wrapper | ||
|
|
||
| !!! note | ||
| vLLM currently only supports a limited number of models, with many model implementations having subtle differences compared to the default implementations in mteb. We are working on it. | ||
|
|
||
| ## Installation | ||
|
|
||
| Reference: https://docs.vllm.ai/en/latest/getting_started/installation/ | ||
|
|
||
| === "uv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add these two examples to the documentation. My suggestion would be here:
under the title
## using vLLM
how about docs/advanced_usage/vllm_wrapper.md
I'm not very good at naming things or writing documents (as I'm a non-native English speaker). It would be great if someone could help rename things and optimize the documentation. Make the content of this PR fit naturally with the other parts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can go over the docs once we have settled on content
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
demo/benchmark_vllm2.py
Outdated
| print( | ||
| f"Batchsize {batchsize}, Input_len {input_len} Throughput: " | ||
| f"{len(prompts) / elapsed_time:.4f} requests/s, " | ||
| f"{len(prompts * input_len) / elapsed_time:.4f} tokens/s, " | ||
| f"Latency {delay * 1000:0.2f} ms, n_step {n_step}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the comparison is fair and the methodology is reasonable, the gap between the ST implementation and the VLLM implementation is not significant.
X-axis: Throughput (request/s)
Y-axis: Latency, Time needed for one step (ms) <- logarithmic scale
The curve lower right is better ↘
- The throughput using float16 is approximately four times that of float32. VLLM has almost no optimization for float32 inference.
- VLLM does not use padding, while ST uses padding. For a batch of requests, if there is a significant disparity in sequence lengths, ST's performance will undoubtedly be very poor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great illustration! I think you can add labels to axis and add to docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add this as a blog post (I think I would prefer that) - so it is more of a "how to compare" rather that "this is how it compares", which is harder to maintain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blog post
Do you mean as HF post?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of doing a blog post like this:
https://squidfunk.github.io/mkdocs-material/blog/2022/09/12/blog-support-just-landed/
(easy to copy to HF if we want to)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nice! we can add this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - it is a pretty good way to e.g. announce stuff or showcase an example, while not enforcing us to keep updating it
Co-authored-by: Roman Solomatin <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
demo/benchmark_vllm2.py
Outdated
| print( | ||
| f"Batchsize {batchsize}, Input_len {input_len} Throughput: " | ||
| f"{len(prompts) / elapsed_time:.4f} requests/s, " | ||
| f"{len(prompts * input_len) / elapsed_time:.4f} tokens/s, " | ||
| f"Latency {delay * 1000:0.2f} ms, n_step {n_step}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add this as a blog post (I think I would prefer that) - so it is more of a "how to compare" rather that "this is how it compares", which is harder to maintain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to myself for when I rewrite this:
- add link to speeding up mteb section
- split into API docs and usage documentation
- remove function scope
- rewrite install section into admonition
- refactor heading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve submitted anything I found useful. It’s best to just update this PR to get it ready.
Signed-off-by: wang.yuqi <[email protected]>
|
I think the content of this PR is pretty solid. Let’s polish it up so it’s ready to merge. |
|
Too many conflicts in dependencies. Probably we can move out model specific tests to different actions to simplify everything |
vLLM has very complex dependencies; that’s why I haven’t added tests to CI. |
Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: wang.yuqi <[email protected]>
MTEB testing helps align vllm with the implementation of sentence-transformers, and identifying potential numerical precision issues is becoming increasingly important.
It's great to have MTEB support vLLM as a backend.
Fix #3747