How to Build a Text Based Generative AI System Without Training Large Models

How to Build a Text Based Generative AI System Without Training Large Models

Shubham Srivastava, Tech Lead I at GeekyAnts, unlocks AI's potential with pre-trained models and RAG, making powerful generative systems accessible.


10 min read


Recent advancements in AI have led to the creation of more complex and efficient text-based generative models. Large-scale training, as we have discussed, has traditionally been considered essential for developing these models, requiring substantial computational power and technical expertise. However, it is now evident that robust generative systems can be developed without starting from scratch. By utilizing pre-trained models and RAG (Retrieval-Augmented Generation), we can make AI accessible to everyone.

How Generative Models Work

Generative AI models create new ends based on a given training dataset. But in the case of text generation, such models learn from pre-existing texts in order to understand how to build and what structures and semantics the language of that text has. This knowledge is then applied in creating new, unknown texts in the same manner as the input data, but with a different stance.

The Inefficiencies of Traditional Training

Training large generative models from scratch involves using extensive textual corpora and tuning internal parameters to minimize prediction errors. This process is highly demanding in terms of resources, time, and expertise needed to systematically analyze such large datasets. Furthermore, traditional static knowledge transfer methods are inadequate in dynamic environments.

As data quickly becomes outdated, especially in rapidly changing domains like social networking or news, models soon fail to represent the real world accurately. Retraining these large models with new data is both costly and inefficient. The necessity of continuous updates to maintain accuracy highlights the limitations of traditional training methods in handling dynamic and ever-evolving information.

Introducing Retrieval-Augmented Generation (RAG)

A new method, called Retrieval-Augmented Generation, has been designed to try solving the aforementioned issues. RAG combines generative models with on-the-fly access and retrieval of relevant information from external knowledge sources. The model consists of an updated database, a knowledge graph, and internet contents, among other things, in addition to text generation.

First, this is done by retrieving related documents or snippets depending on the input query. The retrieved pieces are then used to produce more accurate and contextually enriched responses in a generative model. Thus, RAG models, by including mechanisms of retrieval, can include the latest information dynamically without retraining them very often.

In this way, the approach has sorted out issues faced with conventional training methods. It enables generative systems to keep updated with new data so that they become stronger and more reliable in fast-changing environments. Also, RAG models, as opposed to the pre-trained models, use a large amount of available data, which leads to comprehensive and correct responses.

Implementation of RAG Using LangChain

This section will describe the implementation process of a Retrieval-Augmented Generation system with LangChain.Next. We will familiarize ourselves with integrating various tools, such as LangSmith, Pinecone, and Upstash, which will make the system much more versatile and adapt to changes more effectively in terms of data. Finally, we will discuss how this output format of the Language Model can be controlled in the last section of this blog.

Setting Up LangChain and Other Tools

LangChain is a powerful tool that greatly facilitates application development by making the use of LLMs in general and RAG easy to use. It makes easy ways to use many tools and services together at once, or one after the other.

Pinecone is an on-demand vector database service; it can efficiently store and index vector embeddings. This is particularly important in the retrieval component or RAG, as it is most commonly known.

Upstash provides you with a serverless Redis database that would be useful in caching frequently used chat history to avoid delay in the operations and to improve the performance of your RAG system.

$ pip install langchain
$ pip install pinecone-client
$ pip install upstash-redis
$ pip install redis
$ pip install pinecone-client

Implementation of History

Design the class ChatMessageHistory ( for Saving Chat messages from Upstash Redis. This program utilizes LangChain for storing and searching messages out of the Redis database. Major functionalities include:

Initialisation: Creates a Redis client, this could be a URL or token.

Key Construction: To store messages, it produces a set of messages as follows;

Message Retrieval: It is used to read messages from the Redis database and reformat the messages appropriately if necessary.

Message Addition: Enables the user or AI to push messages to Redis with the ability to set TTLs if desired.

Session Management: This message is used to delete messages of a certain session on Redis environment.

It also helps in the management of the messages. They can be well incorporated into the LangChain messaging system.

import os
import json
import logging
from typing import List, Optional, Union
from langchain_core.messages import (
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage
from langchain_core.chat_history import BaseChatMessageHistory
from upstash_redis import Redis

logger = logging.getLogger(__name__)

class ChatMessageHistory(BaseChatMessageHistory):
    def __init__(
        url: str = "",
        token: str = "",
        key_prefix: str = "message_store:",
        ttl: Optional[int] = None,
            self.redis_client = Redis(url=url, token=token)
        except Exception:
            logger.error("Upstash Redis instance could not be initiated.")

        self.key_prefix = key_prefix
        self.ttl = ttl

    def key(self, session_id: str) -> str:
        return self.key_prefix + session_id

    def messages(self, session_id) -> List[BaseMessage]:
        key = self.key(session_id)
        _items = self.redis_client.lrange(key, 0, -1)
        # if no item found add fetch history from db and add to redis
        items = [json.loads(m) for m in _items[::-1]]
        messages = messages_from_dict(items)
        return messages

    def add_message(self, message: BaseMessage, session_id: str) -> None:
        self.redis_client.lpush(session_id, json.dumps(message_to_dict(message)))
        # add code to save history in db also (we are considering redis only for now)
        if self.ttl:
            self.redis_client.expire(session_id, self.ttl)

    def add_user_message(self, message: Union[HumanMessage, str], session_id: str) -> None:
        key = self.key(session_id)
        if isinstance(message, HumanMessage):
            self.add_message(message, session_id=key)
            self.add_message(HumanMessage(content=message), session_id=key)

    def add_ai_message(self, message: Union[AIMessage, str], session_id: str) -> None:
        key = self.key(session_id)
        if isinstance(message, AIMessage):
            self.add_message(message, session_id=key)
            self.add_message(AIMessage(content=message), session_id=key)

    def clear(self, session_id) -> None:
        key = self.key(session_id)

chat_history = ChatMessageHistory(
    url=os.getenv('UPSTASH_REDIS_URL', ""),
    token=os.getenv('UPSTASH_REDIS_TOKEN', "")

Implementation of Output and Prompt

The code below applies the Output model using Pydantic, and we setup a JSON output parser through LangChain.

Output Model: Originally base schema paths are set so simplistically that creating schema with name and description field is quite simple and yet extended enough that can accept any amount of schema paths.

JsonOutputParser: This is a type of parser employed to parse other outputs in order to transform them into a form of the Output model and the outputs are in JSON format.

These configurations introduce the options for parsing the outputs with additional types and input validation within the LangChain application. (

from typing import List
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.output_parsers import JsonOutputParser, StrOutputParser
class Output (BaseModel):
    name: str = Field(title='name')
    description: str = Field(title='description')
    # ... add other fields as per your schema

output_parser = JsonOutputParser(pydantic_object=Output)

It is using two prompt templates, which are based on LangChain's ChatPromptTemplate for better question-answering systems. They are:

Change of Question Prompt: Transforms a user question that does not require any history of the chat for one to understand it. It involves a system prompt added to the chat history placeholder.

QA System Prompt: It guides the assistant in answering the questions accurately in light of context retrieved. It carries a system prompt, chat history, and User Input Placeholders.

These templates provide an effective way of dealing with context and accurate answers within LangChain-based question-answering applications.

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

contextualize_q_system_prompt = """Given a chat history and the latest user question \
which might reference context in the chat history, formulate a standalone question \
which can be understood without the chat history. Do NOT answer the question, \
just reformulate it if needed and otherwise return it as is."""

contextualize_q_prompt = ChatPromptTemplate.from_messages(
        ("system", contextualize_q_system_prompt),
        ("human", "{input}"),

qa_system_prompt = """You are an assistant for question-answering tasks. \
Use the following pieces of retrieved context to answer the question. \
If you don't know the answer, just say that you don't know. \
Use three sentences maximum and keep the answer concise.\



qa_prompt = ChatPromptTemplate.from_messages(
        ("system", qa_system_prompt),
        ("human", "{input}"),

Implementation of RAG Chain

OpenAI embedding script to create an RAG system along with vectors from Pinecone. The major components are:

Embeddings: Deriving of OpenAI embeddings during initialisation.

Vector Storage: Creation of vector storage based on the index and the embeddings.

Retriever: This would return a retriever from the vector store.

History Aware Retriever: A class to inform the retriever of the context of the chat history and include a custom prompt.

Question Answer Chain: Explain a chain for a brief response assuming a cue is provided.

RAG Chain: Combine the history-aware retriever with the QA chain into a retrieval-augmented generation model.

So it retrieves and questions answers with context in a solution depending on LangChain.

embeddings = OpenAIEmbeddings()

vetorstore = PineconeVectorStore(

retriever = vetorstore.as_retriever()

history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_q_prompt

question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)

rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

Running of RAG Chain

The following code will give a better feel for what really occurs in a session of LangChain on the question answering considering the processing of chat history. Major steps include:

Get Old Messages: Chat_history command retrieves old messages of a chat from the Upstash Redis.

Initial question handling: Preprocess the question to be asked via rag_chain for generating answers on AI from the chat history. The chat history will be comprised of the user's query and what the AI has replied to. Fetch updated messages: Get the new messages in the chat.

Second-order question handling: Use the rag_chain to change the chat history and get a new response from the AI. Add the second user's question to the chat history and the response from the AI.

Such a setup can allow dynamical storage and updating in a Q&A interaction within the chat history.

from langchain_core.messages import HumanMessage, AiMessage
from .history import chat_history

old_message = chat_history.messages("sessionId")

question = "What is Task Decomposition?"

ai_msg_1 = rag_chain.invoke({"input": question, "chat_history": old_message})


new_message = chat_history.messages("sessionId")

second_question = "What are common ways of doing it?"

ai_msg_2 = rag_chain.invoke({"input": second_question, "chat_history": new_message})



Use Cases of Retrieval-Augmented Generation

Automation of Customer Support: RAG can be applied to extend customer support systems with accurate, contextually relevant answers to customer inquiries. It searches a knowledge base for the most appropriate information and generates responses that will cover all aspects of a customer's issue. This approach promises to improve the accuracy of responses while consuming a small percentage of the time and effort of human agents; hence, efficient and satisfactory customer service.

Research Assistance: RAG systems will be very instrumental in summarising and contextualising huge amounts of information from academic papers, articles, and other sources. With a RAG-enabled system, compact summaries and relevant insights are returned to the queries put forward by researchers where retrieval and generation are dynamically pulled based on the latest data. This application speeds up the research process at a high rate, meaning that more of the time of the scholar will go into analysis and interpretation.


Generative text models using AI are feasible and tremendously effective. With pre-trained models and retrieval-augmented generation along with the tools of LangChain, Pinecone, LangSmith, and Upstash, it becomes possible to build systems that are meaningful and context-aware, stay up-to-date, save huge computational time and resources, and make such advanced AI applications accessible to every developer and business.

In this blog, we have covered how generative models work, why traditional methods of training them can be inefficient, and how RAG corrects this by adding real-time retrieval to generative capabilities. We showed how to implement such a RAG system with LangChain and how to integrate a host of other tools that enable efficient management of embeddings, vector storage, and chat histories. Finally, we showed how to exercise control over the format in which the language models produce their output to ensure just the precise answers that the application requires. These instructions will be helpful in the development of robust AI-powered systems that furnish correct answers for resourceful applications in different spheres.