Crafting Your Custom AI Companion: A Comprehensive Manual for Constructing a Text and Voice-Enabled Local LLM

Crafting your own personal LLM AI assistant equipped with voice capabilities for engaging conversations.

Embark on a Journey to Build Your Own Personal Local LLM Assistant, Ready for Conversation! Dive into the Creation Process, Enabling Voice Input and Output:


Crafting Your Custom AI Companion: A Comprehensive Manual for Constructing a Text and Voice-Enabled Local LLM


Discover How the App Operates:



Unveil the Code Repository Here: 

https://github.com/amirarsalan90/personal_llm_assista


Key Elements of the App Comprise:


1. Local LLM (Powered by llama-cpp-python)

2. Speech-to-Text Functionality (Whisper)

3. Text-to-Speech Capability (Bark)"



llama-cpp-python

Let's delve into Llama-cpp-python, a remarkable Python interface for llama.cpp, renowned for its implementation of numerous Large Language Models in C/C++. Given its extensive embrace by the open-source community, it became my natural choice for this tutorial.


Important note: I've rigorously tested this application on a system powered by an Nvidia RTX4090 GPU.


First and foremost, let's initiate a fresh conda environment:

conda create --name assistant python=3.10
conda activate assistant


Now, our next step entails installing llama-cpp-python. As detailed in the llama-cpp-python documentation, llama.cpp boasts support for various hardware acceleration backends, enhancing the speed of inference. To harness the power of the GPU and execute the Large Language Model (LLM) on it, we'll compile the program with CUBLAS. While grappling with challenges in offloading the model onto the GPU, I stumbled upon a helpful post guiding me through the proper installation process:

export CMAKE_ARGS="-DLLAMA_CUBLAS=on"
export FORCE_CMAKE=1
pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
pip install llama-cpp-python[server]


Additionally, we'll need to install several other packages for this application:

pip install gradio
pip install openai
pip install huggingface-cli
pip install huggingface_hub
pip install torch
pip install transformers
pip install nltk
pip install optimum


The next step involves downloading model weights to serve. For this demonstration, I've opted for Mistral-7b-Instruct-v0.2. (In fact, Mistral 7b is my preferred model among all other 7b models). The model formats compatible with llama.cpp are in GGUF format. If you're unfamiliar, TheBloke is the go-to repository for obtaining quantized models and GGUF converted models.

To download the model weights, use the following command:


mkdir models
download TheBloke/Mistral-7B-Instruct-v0.2-GGUF mistral-7b-instruct-v0.2.Q4_K_M.gguf --local-dir ./models/ --local-dir-use-symlinks False


With the model weights now downloaded, we're prepared to put our LLM to the test. To achieve this, we'll initiate an LLM server (assuming we've already installed llama-cpp-python[server]). Simply open a terminal, activate your assistant conda environment, and launch the server:

conda activate assistant
python3 -m llama_cpp.server --model ./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf --n_gpu_layers -1 --chat_format chatml



In this setup, the parameter `n_gpu_layers` determines how many layers of your model will be offloaded to the GPU. Given my ample GPU resources—a Nvidia RTX 4090 with 24 GB of VRAM—I've opted to offload all layers (-1) to the GPU, as this is sufficient for loading the quantized model. As for the second parameter, `chat_format`, it specifies the chat template for our model. Since we're utilizing the Mistral model, we've selected the `chatml` template. For further details on chat templates, you can refer to the documentation.

Now, let's move on to using Python to send requests to our model. It's worth noting that llama-cpp-python follows an API structure similar to OpenAI's, allowing you to send requests to your local LLM in a manner akin to how you interact with OpenAI's GPT models like GPT-3.5 or GPT-4.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="sk-xxx")
response = client.chat.completions.create(
model="mistral",
messages=[
{"role": "system", "content": "You are a helpful AI."},
{"role": "user", "content": "In what city were the 2000 olympics taken place?"}
],
)

print(response)

For the speech-to-text functionality, we'll leverage the renowned Whisper model, an open-source transformer-based speech-to-text model. Whisper takes an audio file as input and generates a transcript of the spoken words. Utilizing the Hugging Face Transformers implementation simplifies the process significantly. With a Hugging Face pipeline, performing inference with Whisper is straightforward:


pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v2",
    torch_dtype=torch.float16,
    device="cuda:0"
)

transcription = pipe(audio_file_path)['text']


The variable `transcription` holds the transcript of the input audio file.



Text to Speech (Bark)

For text-to-speech conversion, I utilize the Bark model, a transformer-based text-to-speech model capable of generating realistic, multilingual speech and various other audio elements such as music, background noise, and simple sound effects. Additionally, the model can produce nonverbal communications like laughter, sighs, and cries.

Once again, we'll employ the Hugging Face implementation for the Bark model. Using it with Hugging Face Transformers is straightforward:


from transformers import AutoProcessor, AutoModel

processor = AutoProcessor.from_pretrained("suno/bark")
model = AutoModel.from_pretrained("suno/bark")

inputs = processor(
    text=["Hello, my name is Suno. And, uh — and I like pizza. [laughs] But I also have other interests such as playing tic tac toe."],
    return_tensors="pt",
)

speech_values = model.generate(**inputs, do_sample=True)
from IPython.display import Audio

sampling_rate = model.generation_config.sample_rate
Audio(speech_values.cpu().numpy().squeeze(), rate=sampling_rate)


This code snippet initializes the Bark model and processor, prepares the input text, generates speech values, and finally converts them into audio format.


Web UI (Gradio)

Creating a web user interface (UI) is made simple with Gradio, a library that allows easy UI building for data science projects in just minutes.

Here's the complete code for the Gradio app:


import gradio as gr
from transformers import pipeline
from transformers import AutoProcessor, BarkModel
import torch
from openai import OpenAI
import numpy as np
from IPython.display import Audio, display
import numpy as np
import re
from nltk.tokenize import sent_tokenize

WORDS_PER_CHUNK = 25

def split_sentence_into_chunks(sentence, n):
    words = sentence.split()
    if len(words) <= n:
        return [sentence]
    else:
        chunks = [' '.join(words[i:i+n]) for i in range(0, len(words), n)]
        return chunks

# Setup Whisper client
pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v2",
    torch_dtype=torch.float16,
    device="cuda:0"
)

voice_processor = AutoProcessor.from_pretrained("suno/bark")
voice_model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16).to("cuda:0")

voice_model =  voice_model.to_bettertransformer()
voice_preset = "v2/en_speaker_9"

system_prompt = "You are a helpful AI"

client = OpenAI(base_url="http://localhost:8000/v1", api_key="sk-xxx")  # Placeholder, replace 
sample_rate = 48000

def transcribe_and_query_llm_voice(audio_file_path):
    transcription = pipe(audio_file_path)['text']
    response = client.chat.completions.create(
        model="mistral",
        messages=[
            {"role": "system", "content": system_prompt},  # Update this as per your needs
            {"role": "user", "content": transcription + "\n Answer briefly."}
        ],
    )
    llm_response = response.choices[0].message.content

    sampling_rate = voice_model.generation_config.sample_rate
    silence = np.zeros(int(0.25 * sampling_rate))

    BATCH_SIZE = 12
    model_input = sent_tokenize(llm_response)

    pieces = []
    for i in range(0, len(model_input), BATCH_SIZE):
        inputs = model_input[BATCH_SIZE*i:min(BATCH_SIZE*(i+1), len(model_input))]
        
        if len(inputs) != 0:
            inputs = voice_processor(inputs, voice_preset=voice_preset)
            speech_output, output_lengths = voice_model.generate(**inputs.to("cuda:0"), return_output_lengths=True, min_eos_p=0.2)
            speech_output = [output[:length].cpu().numpy() for (output,length) in zip(speech_output, output_lengths)]
            pieces += [*speech_output, silence.copy()]
        
        
    whole_output = np.concatenate(pieces)
    audio_output = (sampling_rate, whole_output) 
    return llm_response, audio_output

def transcribe_and_query_llm_text(text_input):
    transcription = text_input
    response = client.chat.completions.create(
        model="mistral",
        messages=[
            {"role": "system", "content": system_prompt},  # Update this as per your needs
            {"role": "user", "content": transcription + "\n Answer briefly."}
        ],
    )
    llm_response = response.choices[0].message.content
    sampling_rate = voice_model.generation_config.sample_rate
    silence = np.zeros(int(0.25 * sampling_rate))
    BATCH_SIZE = 12
    model_input = sent_tokenize(llm_response)
    pieces = []
    for i in range(0, len(model_input), BATCH_SIZE):
        inputs = model_input[BATCH_SIZE*i:min(BATCH_SIZE*(i+1), len(model_input))]
        if len(inputs) != 0:
            inputs = voice_processor(inputs, voice_preset=voice_preset)
            speech_output, output_lengths = voice_model.generate(**inputs.to("cuda:0"), return_output_lengths=True, min_eos_p=0.2)
            speech_output = [output[:length].cpu().numpy() for (output,length) in zip(speech_output, output_lengths)]
            pieces += [*speech_output, silence.copy()]
    whole_output = np.concatenate(pieces)
    audio_output = (sampling_rate, whole_output)  
    return llm_response, audio_output

with gr.Blocks() as demo:
    with gr.Row():
        with gr.Column():
            text_input = gr.Textbox(label="Type your request", placeholder="Type here or use the microphone...")
            audio_input = gr.Audio(sources=["microphone"], type="filepath", label="Or record your speech")
        with gr.Column():
            output_text = gr.Textbox(label="LLM Response")
            output_audio = gr.Audio(label="LLM Response as Speech", type="numpy")
    submit_btn_text = gr.Button("Submit Text")
    submit_btn_voice = gr.Button("Submit Voice")
    submit_btn_voice.click(fn=transcribe_and_query_llm_voice, inputs=[audio_input], outputs=[output_text, output_audio])
    submit_btn_text.click(fn=transcribe_and_query_llm_text, inputs=[text_input], outputs=[output_text, output_audio])

demo.launch(ssl_verify=False,
            share=False,
            debug=False)


This code sets up a Gradio UI where users can input text or record their voice to interact with the LLM, and receive both text and audio responses.



A few notes on the python code above:

Here are some important points about the Python code provided above:


1. Bark Voices: The Bark model offers various voices to choose from. We're utilizing the voice "v2/en_speaker_9". The complete list of options can be found [here](https://huggingface.co/suno/bark/tree/main/speaker_embeddings/v2).


2. Function Assignment: We're assigning the `transcribe_and_query_llm_voice` function to `submit_btn_voice` to execute the model on the user's voice input. Similarly, `transcribe_and_query_llm_text` function is assigned to `submit_btn_text` to handle text input from the user.


3. Chunk Processing: In line 57, the code creates chunks and runs the Bark model on each chunk. This aggregation is done to handle long text inputs more effectively.


4. Microphone Requirement: To submit voice input to the model, a microphone must be available on your PC. However, if you don't have a microphone, you can still input text into the textbox and receive both text and audio output from the model.

To run the UI, ensure that your LLM engine is running. Then, in a new terminal, execute:


python gradio_tts.py


You'll receive a link in the terminal, typically http://127.0.0.1:7860. Your app will be ready to use through this link!




Here's a guide on accessing the app from your phone on your home WiFi:

To access the app on your phone while it's hosted by your PC connected to your home WiFi, follow these steps:


Crafting Your Custom AI Companion: A Comprehensive Manual for Constructing a Text and Voice-Enabled Local LLM



1. Find your PC's Local IP Address:

  • If you're using Ubuntu, open a terminal and use the command:
     
     ip addr | grep 'inet ' | grep -v ' lo' | awk '{print $2}' | cut -d'/' -f1
     
  • This will give you your PC's local IP address. For example, let's say it's 192.168.0.231.



2. Access the App on Your Phone:

  • With the Gradio app launched on your PC, open the Chrome browser on your phone.

  • Navigate to the following URL, replacing "192.168.0.231" with your PC's local IP address:
     
     http://192.168.0.231:7860
     
  • Here, "7860" is the port on which the app is running.


3. HTTPS Access for Microphone Usage:

  • Accessing the app over HTTP may restrict your phone's microphone usage due to security restrictions.
  • To enable microphone usage, set up HTTPS for your Gradio app:
  • Ensure OpenSSL is installed on your PC.
  • In the terminal, generate a self-signed SSL certificate and private key using the command:
       
       openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 365 -nodes
       

  • After generating the certificates, refer to them in the `demo.launch` function in your `gradio_app.py` file:
       
       demo.launch(ssl_verify=False,
                   share=False,
                   debug=False,
                   server_name="0.0.0.0",
                   ssl_certfile="cert.pem",
                   ssl_keyfile="key.pem")

       

4. Access the App Over HTTPS on Your Phone:
  • After setting up HTTPS, open Chrome on your phone and navigate to:
     
     https://192.168.0.231:7860
     
  • Now, you'll be able to use your phone's microphone with the app securely.

Remember to follow for more AI content, as more hands-on articles are on the way!