Building a Dynamic Audio and Text Conversational AI

In today’s AI-driven world, creating seamless and intelligent conversational agents is a must. This guide will walk you through building a sophisticated Q&A application leveraging Groq for inference and Deepgram for ASR (Automatic Speech Recognition) and TTS (Text-to-Speech). Let’s dive into the details and get your conversational AI up and running.

7 min readJul 14, 2024

Understanding Remote Calls, ASR, TTS, and LLM

Remote Calls

Remote calls refer to making requests to external services over a network (usually the internet) to perform specific tasks. In the context of APIs, remote calls allow you to utilize functionalities provided by external servers. These functionalities can include processing data, fetching information, or performing specific computations.

ASR (Automatic Speech Recognition)

ASR is the technology that converts spoken language into text. It processes audio input to recognize and transcribe spoken words. ASR is widely used in applications like voice assistants, transcription services, and voice-controlled interfaces.

TTS (Text-to-Speech)

TTS is the technology that converts written text into spoken voice output. It processes text input to generate human-like speech, enabling applications like virtual assistants, reading aids, and automated customer service systems.

LLM (Large Language Model)

LLMs are advanced machine learning models trained on vast amounts of text data to understand, generate, and manipulate human language. They can perform tasks such as text generation, translation, summarization, and answering questions. Examples include GPT-3, BERT, and other transformer-based models.

Understanding Groq and Deepgram

Before diving into the implementation details of our conversational AI system, it’s crucial to grasp the foundational technologies that power it: Groq and Deepgram. These cutting-edge platforms play pivotal roles in enhancing the performance and capabilities of AI-driven applications.

Groq

Groq is a high-performance AI inference platform engineered to accelerate machine learning and deep learning workloads. It distinguishes itself through a unique architecture designed for maximum efficiency and parallelism. Here are some key attributes of Groq:

Scalability: Groq can seamlessly handle large datasets and complex models, making it ideal for enterprise-level AI applications.
Speed: Optimized for low-latency inference, Groq ensures rapid processing of AI models, which is critical for real-time applications.
Flexibility: It supports a wide range of AI frameworks and models, allowing developers to integrate their preferred tools and technologies.

Groq’s architecture is built to deliver unparalleled performance, making it a prime choice for applications requiring high-speed data processing and inference, such as our Q&A application.

Deepgram

Deepgram is a leading provider of automatic speech recognition (ASR) and text-to-speech (TTS) solutions. Deepgram leverages deep learning techniques to deliver superior accuracy and performance in speech-related tasks. Key features of Deepgram include:

High Accuracy: Deepgram’s models are trained on vast datasets to ensure precise transcription and voice synthesis, even in noisy environments.
Real-Time Processing: Designed for low-latency operations, Deepgram can process and transcribe speech in real-time, making it ideal for interactive applications.
Customization: Deepgram allows for model customization to suit specific industry needs, ensuring higher accuracy for domain-specific terminology and accents.

In our conversational AI system, Deepgram will handle the voice input and output, enabling seamless interaction between users and the AI.

Step 1: Setting Up the Environment

To begin, you’ll need to set up your development environment. Ensure you have the following tools and libraries installed:

pip install asyncio dotenv shutil subprocess requests langchain deepgram transformers

GROQ_API_KEY=your_groq_api_key
DEEPGRAM_API_KEY=your_deepgram_api_key

Step 2: Importing Required Libraries

Here, we import all the necessary libraries that we will use throughout the implementation. This includes libraries for asynchronous operations, environment variable management, file handling, HTTP requests, language processing, and interacting with Deepgram and Groq APIs.

import asyncio
from dotenv import load_dotenv
import shutil
import subprocess
import requests
import time
import os
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.prompts import PromptTemplate
from langchain.chains import ConversationalRetrievalChain, ConversationChain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.chains import create_history_aware_retriever
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.messages import HumanMessage
from langchain_core.prompts import ChatPromptTemplate
from langchain_groq import ChatGroq
from langchain_openai import ChatOpenAI
from langchain.memory import ConversationBufferMemory, VectorStoreRetrieverMemory
from langchain.prompts import (
    ChatPromptTemplate,
    MessagesPlaceholder,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.chains import LLMChain
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
from langchain.llms import HuggingFacePipeline
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
import sys
from deepgram import (
    DeepgramClient,
    DeepgramClientOptions,
    LiveTranscriptionEvents,
    LiveOptions,
    Microphone,
)

Step 3: Initialize the Groq Client

We initialize a client for the Groq model, which will be used for language model processing

class LanguageModelProcessor:
    def __init__(self):
        load_dotenv()
        self.llm = ChatGroq(temperature=0, model_name="mixtral-8x7b-32768", groq_api_key=os.getenv("GROQ_API_KEY"))

load_dotenv(): Loads environment variables from the .env file.
ChatGroq: Initializes the Groq language model with specified parameters like temperature and model name, and uses the API key from the environment variable.

Step 4: Load Documents and Create Vector Store

We load documents, split them into smaller chunks, create embeddings, and store these embeddings in a FAISS vector store.

def load_documents(directory_path):
    loader = DirectoryLoader(directory_path, loader_cls=TextLoader)
    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    split_documents = text_splitter.split_documents(documents)
    return split_documents

def create_vector_store(documents):
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
    vector_store = FAISS.from_documents(documents, embeddings)
    return vector_store

DirectoryLoader: Loads documents from a specified directory.
RecursiveCharacterTextSplitter: Splits documents into chunks of 1000 characters with an overlap of 200 characters to ensure continuity.
HuggingFaceEmbeddings: Creates embeddings using a pre-trained model.
FAISS: Stores the embeddings in a vector database for efficient retrieval.

Step 5: Create History-Aware Retriever

We create a retriever that can understand and use the context from previous interactions to formulate better responses.

def create_history_aware_retriever(llm, retriever, contextualize_q_prompt):
    return create_history_aware_retriever(
        llm, retriever, contextualize_q_prompt
    )

contextualize_q_system_prompt = """Given a chat history and the latest user question \
which might reference context in the chat history, formulate a standalone question \
which can be understood without the chat history. Do NOT answer the question, \
just reformulate it if needed and otherwise return it as is."""
contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

create_history_aware_retriever: A function to create a retriever that considers chat history.
contextualize_q_system_prompt: A prompt template that helps in reformulating user questions based on chat history.

Step 6: Create the Full QA Chain

We create a QA chain that can use the history-aware retriever to answer questions accurately.

def create_qa_chain(llm, history_aware_retriever):
    qa_system_prompt = """You are an assistant for question-answering tasks. \
    Use the following pieces of retrieved context to answer the question. \
    If you don't know the answer, just say that you don't know. \
    Use three sentences maximum and keep the answer concise.\

    {context}"""
    qa_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", qa_system_prompt),
            MessagesPlaceholder("chat_history"),
            ("human", "{input}"),
        ]
    )

    question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)
    rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

    return rag_chain

qa_system_prompt: A system prompt to guide the QA process.
create_stuff_documents_chain: Creates a chain for combining document context.
create_retrieval_chain: Combines the retriever with the QA chain.

Step 7: Manage Chat History

We manage chat history to maintain context and enable coherent back-and-forth conversations.

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in self.store:
        self.store[session_id] = ChatMessageHistory()
    return self.store[session_id]

self.conversational_rag_chain = RunnableWithMessageHistory(
    self.rag_chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="chat_history",
    output_messages_key="answer",
)

start_time = time.time()
response = self.conversational_rag_chain.invoke({"input": text}, config={"configurable": {"session_id": "abc123"}})
end_time = time.time()
elapsed_time = int((end_time - start_time) * 1000)
print(f"LLM ({elapsed_time}ms): {response['answer']}")
self.chat_history.extend([HumanMessage(content=text), response["answer"]])
return response["answer"]

get_session_history: Retrieves or initializes chat history for a session.
RunnableWithMessageHistory: A chain that maintains message history.

Step 8: Implement ASR with Deepgram

We implement real-time ASR using the Deepgram API to transcribe speech to text.

class TranscriptCollector:
    def __init__(self):
        self.reset()

    def reset(self):
        self.transcript_parts = []

    def add_part(self, part):
        self.transcript_parts.append(part)

    def get_full_transcript(self):
        return ' '.join(self.transcript_parts)

transcript_collector = TranscriptCollector()

async def get_transcript(callback):
    transcription_complete = asyncio.Event()

    try:
        config = DeepgramClientOptions(options={"keepalive": "true"})
        deepgram = DeepgramClient(os.getenv("DEEPGRAM_API_KEY"), config)
        dg_connection = deepgram.listen.asynclive.v("1")
        print("Listening...")

        async def on_message(result, **kwargs):
            sentence = result.channel.alternatives[0].transcript
            if not result.speech_final:
                transcript_collector.add_part(sentence)
            else:
                transcript_collector.add_part(sentence)
                full_sentence = transcript_collector.get_full_transcript()
                if len(full_sentence.strip()) > 0:
                    full_sentence = full_sentence.strip()
                    print(f"Human: {full_sentence}")
                    callback(full_sentence)
                    transcript_collector.reset()
                    transcription_complete.set()

        dg_connection.on(LiveTranscriptionEvents.Transcript, on_message)

        options = LiveOptions(
            model="nova-2",
            punctuate=True,
            language="en-US",
            encoding="linear16",
            channels=1,
            sample_rate=16000,
            endpointing=300,
            smart_format=True,
        )

        await dg_connection.start(options)
        microphone = Microphone(dg_connection.send)
        microphone.start()
        await transcription_complete.wait()
        microphone.finish()
        await dg_connection.finish()

    except Exception as e:
        print(f"Could not open socket: {e}")
        return

ranscriptCollector: Collects and manages parts of the transcribed text.
on_message: Handles incoming transcriptions and finalizes the transcript.
LiveOptions: Configuration for the Deepgram API to handle real-time audio input.

Step 9: Implement TTS with Deepgram

We implement TTS using the Deepgram API to convert text back to speech

class TextToSpeech:
    def __init__(self):
        load_dotenv()
        self.DG_API_KEY = os.getenv("DEEPGRAM_API_KEY")
        self.MODEL_NAME = "nova-2"

    def speak(self, text):
        DEEPGRAM_URL = f"https://api.deepgram.com/v1/listen?model={self.MODEL_NAME}&encoding=linear16&language=en-US"
        headers = {
            "Authorization": f"Token {self.DG_API_KEY}",
            "Content-Type": "application/json",
        }
        payload = {"text": text}

        try:
            response = requests.post(DEEPGRAM_URL, json=payload, headers=headers)
            response.raise_for_status()
            with open("response.wav", "wb") as f:
                f.write(response.content)
            subprocess.run(["ffplay", "-autoexit", "-nodisp", "response.wav"])
        except requests.exceptions.HTTPError as err:
            print("HTTP Error:", err)
        except requests.exceptions.RequestException as err:
            print("Request Exception:", err)

tts = TextToSpeech()
tts.speak("Hello! This is a test.")

TextToSpeech: Initializes the TTS setup with Deepgram API.
speak: Sends the text to the Deepgram API, retrieves the audio response, and plays it.

Conclusion

By following these steps, you’ve created a robust conversational AI system that integrates ASR, TTS, and LLM capabilities using Deepgram and Groq. This setup allows for seamless interactions through speech and text, making it a versatile tool for various applications.

This blog marks the end of the GEN AI basic to Advance series
make sure you go through each and take complete advantage of this series
and also make sure you follow and subscribe to newsletter for upcoming latest blogs on AI trends and tools

THANK YOU