-
The first step is to have a clear vision of what we want to create and why it’s valuable to build it(we can feed the skeleton of our main idea to the LLM Twin and let it do the grunt work).It helps you do the following:
-
Create your brand
-
Automate the writing process
-
Brainstorm new creative ideas
The key of the LLM Twin stands in the following:
- What data we collect
- How we preprocess the data
- How we feed the data into the LLM
- How we chain multiple prompts for the desired results
- How we evaluate the generated content
An MVP is a powerful strategy because of the following reasons:
- Accelerated time-to-market: Launch a product quickly to gain early traction
- Idea validation: Test it with real users before investing in the full development of the product
- Market research: Gain insights into what resonates with the target audience
- Risk minimization: Reduces the time and resources needed for a product that might not achieve market success
Building production-ready ML systems is much more than just training a model. From an engineering point of view, training the model is the most straightforward step in most use cases. However, training a model becomes complex when deciding on the correct architecture and hyperparameters. That’s not an engineering problem but a research problem.
At this point, we want to focus on how to design a production-ready architecture. Training a model with high accuracy is extremely valuable, but just by training it on a static dataset, you are far from deploying it robustly. We have to consider how to do the following:
- Ingest, clean, and validate fresh data
- Training versus inference setups
- Compute and serve features in the right environment
- Serve the model in a cost-effective way
- Version, track, and share the datasets and models
- Monitor your infrastructure and models
- Deploy the model on a scalable infrastructure
- Automate the deployments and training
**** we have the feature, training, and inference pipelines

The feature pipeline takes raw data as input, processes it, and outputs the features and labels required by the model for training or inference. Instead of directly passing them to the model, the features and labels are stored inside a feature store. Its responsibility is to store, version, track, and share the features. By saving the features in a feature store, we always have a state of our features. Thus, we can easily send the features to the training and inference pipelines.
As the data is versioned, we can always ensure that the training and inference time features match. Thus, we avoid the training-serving skew problem.
The training pipeline takes the features and labels from the features stored as input and outputs a train model or models. The models are stored in a model registry. Its role is similar to that of feature stores, but this time, the model is the first-class citizen. Thus, the model registry will store, version, track, and share the model with the inference pipeline.
Also, most modern model registries support a metadata store that allows you to specify essential aspects of how the model was trained. The most important are the features, labels, and their version used to train the model. Thus, we will always know what data the model was trained on.
The inference pipeline takes as input the features and labels from the feature store and the trained model from the model registry. With these two, predictions can be easily made in either batch or real-time mode.
Until now, we defined what the LLM Twin should support from the user’s point of view. Now, let’s clarify the requirements of the ML system from a purely technical perspective:
On the data side, we have to do the following: *** Collect data from LinkedIn, Medium, Substack, and GitHub completely autonomously and on a schedule Standardize the crawled data and store it in a data warehouse --- Clean the raw data --- Create instruct datasets for fine-tuning an LLM --- Chunk and embed the cleaned data. Store the vectorized data into a vector DB for RAG.
- Fine-tune LLMs of various sizes (7B, 14B, 30B, or 70B parameters)
- Fine-tune on instruction datasets of multiple sizes
- Switch between LLM types (for example, between Mistral, Llama, and GPT)
- Test potential production LLM candidates before deploying them
- Automatically start the training when new instruction datasets are available.
- A REST API interface for clients to interact with the LLM Twin
- Access to the vector DB in real time for RAG
- Inference with LLMs of various sizes
--- Automatically deploy the LLMs that pass the evaluation step. --- The system will support the following LLMOps features: ---- Instruction dataset versioning, lineage, and reusability ---- Model versioning, lineage, and reusability
- Continuous training, continuous integration, and continuous delivery (CT/CI/CD)
- Prompt and system monitoring
The data collection pipeline involves crawling your personal data from Medium, Substack, LinkedIn, and GitHub. As a data pipeline, we will use the extract, load, transform (ETL) pattern to extract data from social media platforms, standardize it, and load it into a data warehouse.
However, from the processing, fine-tuning, and RAG points of view, it is vital to know what type of data we ingested, as each category must be processed differently. For example, the chunking strategy between a post, article, and piece of code will look different. Also, by grouping the data by category, not the source, we can quickly plug data from other platforms, such as X into the posts or GitLab into the code collection.
The feature pipeline’s role is to take raw articles, posts, and code data points from the data warehouse, process them, and load them into the feature store.
The characteristics of the FTI pattern are already present.
Here are some custom properties of the LLM Twin’s feature pipeline:
- It processes three types of data differently: articles, posts, and code
- It contains three main processing steps necessary for fine-tuning and RAG: cleaning, chunking, and embedding -**** It creates two snapshots of the digital data, one after cleaning (used for fine-tuning) and one after embedding (used for RAG)
- It uses a logical feature store instead of a specialized feature store
The training pipeline consumes instruct datasets from the feature store, fine-tunes an LLM with it, and stores the tuned LLM weights in a model registry. More concretely, when a new instruct dataset is available in the logical feature store, we will trigger the training pipeline, consume the artifact, and fine-tune the LLM.
In the initial stages, the data science team owns this step. They run multiple experiments to find the best model and hyperparameters for the job, either through automatic hyperparameter tuning or manually. To compare and pick the best set of hyperparameters, we will use an experiment tracker to log everything of value and compare it between experiments. Ultimately, they will pick the best hyperparameters and fine-tuned LLM and propose it as the LLM production candidate. The proposed LLM is then stored in the model registry. After the experimentation phase is over, we store and reuse the best hyperparameters found to eliminate the manual restrictions of the process. Now, we can completely automate the training process, known as continuous training.
The testing pipeline is triggered for a more detailed analysis than during fine-tuning. Before pushing the new model to production, assessing it against a stricter set of tests is critical to see that the latest candidate is better than what is currently in production. If this step passes, the model is ultimately tagged as accepted and deployed to the production inference pipeline. Even in a fully automated ML system, it is recommended to have a manual step before accepting a new production model. It is like pushing the red button before a significant action with high consequences. Thus, at this stage, an expert looks at a report generated by the testing component. If everything looks good, it approves the model, and the automation can continue.
The particularities of this component will be on LLM aspects, such as the following:
How do you implement an LLM agnostic pipeline? What fine-tuning techniques should you use? How do you scale the fine-tuning algorithm on LLMs and datasets of various sizes? How do you pick an LLM production candidate from multiple experiments? How do you test the LLM to decide whether to push it to production or not?
The inference pipeline is the last piece of the puzzle. It is connected to the model registry and logical feature store. It loads a fine-tuned LLM from the model registry, and from the logical feature store, it accesses the vector DB for RAG. It takes in client requests through a REST API as queries. It uses the fine-tuned LLM and access to the vector DB to carry out RAG and answer the queries.
All the client queries, enriched prompts using RAG, and generated answers are sent to a prompt monitoring system to analyze, debug, and better understand the system. Based on specific requirements, the monitoring system can trigger alarms to take action either manually or automatically.
At the interface level, this component follows exactly the FTI architecture, but when zooming in, we can observe unique characteristics of an LLM and RAG system, such as the following:
- A retrieval client used to do vector searches for RAG
- Prompt templates used to map user queries and external information to LLM inputs
- Special tools for prompt monitoring