-
-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Problem
This project was initially built as a pure JavaScript project to enable wider deployment, but various functions and libraries were originally built, and intended to be used in python. As a result, this project was refactored after an initial test build to include a JavaScript pipeline for individual embeddings and a python version for batched embeddings. For example, you cant use GPU acceleration in the JavaScript pipeline, but you can in the Python one.
What this means is that the tkyoDriftSetTraining.py file and the tkyoDrift.js processes are functionally duplicates of each other except that the former is explicitly meant to be called once for a batch, while the later is meant to be invoked on every new input.
Solution
This is fine as it is, but since many JavaScript libraries are just python scripts wearing a disguise, it would be ideal to rebuild this entire platform in python with a JavaScript NPM package to install it, and a JavaScript function hook to pass data into it. This would allow this system to avoid unnecessary conversion from JavaScript into python to execute AI embeddings, calculate K means, or generate the HNSW index.
Additional information
There may be an additional unintended knock on effect in that the xenova model tokenizer behaves slightly different from the python tokenizer, which yields marginally different embeddings for the same text, added noise to cosine similarity scores.
👨👧👦 Contributing
- 🙋♂️ Yes, I'd love to make a PR to implement this feature!