Crafting Your Custom AI Companion: A Comprehensive Manual for Constructing a Text and Voice-Enabled Local LLM

Crafting your own personal LLM AI assistant equipped with voice capabilities for engaging conversations.

Embark on a Journey to Build Your Own Personal Local LLM Assistant, Ready for Conversation! Dive into the Creation Process, Enabling Voice Input and Output:

Crafting Your Custom AI Companion: A Comprehensive Manual for Constructing a Text and Voice-Enabled Local LLM

Discover How the App Operates:

Unveil the Code Repository Here:

https://github.com/amirarsalan90/personal_llm_assista

Key Elements of the App Comprise:

1. Local LLM (Powered by llama-cpp-python)

2. Speech-to-Text Functionality (Whisper)

3. Text-to-Speech Capability (Bark)"

llama-cpp-python

Let's delve into Llama-cpp-python, a remarkable Python interface for llama.cpp, renowned for its implementation of numerous Large Language Models in C/C++. Given its extensive embrace by the open-source community, it became my natural choice for this tutorial.

Important note: I've rigorously tested this application on a system powered by an Nvidia RTX4090 GPU.

First and foremost, let's initiate a fresh conda environment:

conda create --name assistant python=3.10
conda activate assistant

Now, our next step entails installing llama-cpp-python. As detailed in the llama-cpp-python documentation, llama.cpp boasts support for various hardware acceleration backends, enhancing the speed of inference. To harness the power of the GPU and execute the Large Language Model (LLM) on it, we'll compile the program with CUBLAS. While grappling with challenges in offloading the model onto the GPU, I stumbled upon a helpful post guiding me through the proper installation process:

export CMAKE_ARGS="-DLLAMA_CUBLAS=on"
export FORCE_CMAKE=1
pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
pip install llama-cpp-python[server]

Additionally, we'll need to install several other packages for this application:

pip install gradio

pip install openai

pip install huggingface-cli

pip install huggingface_hub

pip install torch

pip install transformers

pip install nltk

pip install optimum

The next step involves downloading model weights to serve. For this demonstration, I've opted for Mistral-7b-Instruct-v0.2. (In fact, Mistral 7b is my preferred model among all other 7b models). The model formats compatible with llama.cpp are in GGUF format. If you're unfamiliar, TheBloke is the go-to repository for obtaining quantized models and GGUF converted models.

To download the model weights, use the following command:

mkdir models
download TheBloke/Mistral-7B-Instruct-v0.2-GGUF mistral-7b-instruct-v0.2.Q4_K_M.gguf --local-dir ./models/ --local-dir-use-symlinks False

With the model weights now downloaded, we're prepared to put our LLM to the test. To achieve this, we'll initiate an LLM server (assuming we've already installed llama-cpp-python[server]). Simply open a terminal, activate your assistant conda environment, and launch the server:

conda activate assistant
python3 -m llama_cpp.server --model ./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf --n_gpu_layers -1 --chat_format chatml

In this setup, the parameter `n_gpu_layers` determines how many layers of your model will be offloaded to the GPU. Given my ample GPU resources—a Nvidia RTX 4090 with 24 GB of VRAM—I've opted to offload all layers (-1) to the GPU, as this is sufficient for loading the quantized model. As for the second parameter, `chat_format`, it specifies the chat template for our model. Since we're utilizing the Mistral model, we've selected the `chatml` template. For further details on chat templates, you can refer to the documentation.

Now, let's move on to using Python to send requests to our model. It's worth noting that llama-cpp-python follows an API structure similar to OpenAI's, allowing you to send requests to your local LLM in a manner akin to how you interact with OpenAI's GPT models like GPT-3.5 or GPT-4.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="sk-xxx")
response = client.chat.completions.create(
model="mistral",
messages=[
{"role": "system", "content": "You are a helpful AI."},
{"role": "user", "content": "In what city were the 2000 olympics taken place?"}
],
)

print(response)

For the speech-to-text functionality, we'll leverage the renowned Whisper model, an open-source transformer-based speech-to-text model. Whisper takes an audio file as input and generates a transcript of the spoken words. Utilizing the Hugging Face Transformers implementation simplifies the process significantly. With a Hugging Face pipeline, performing inference with Whisper is straightforward:

pipe = pipeline(

"automatic-speech-recognition",

model="openai/whisper-large-v2",

torch_dtype=torch.float16,

device="cuda:0"

)

transcription = pipe(audio_file_path)['text']

The variable `transcription` holds the transcript of the input audio file.

Text to Speech (Bark)

For text-to-speech conversion, I utilize the Bark model, a transformer-based text-to-speech model capable of generating realistic, multilingual speech and various other audio elements such as music, background noise, and simple sound effects. Additionally, the model can produce nonverbal communications like laughter, sighs, and cries.

Once again, we'll employ the Hugging Face implementation for the Bark model. Using it with Hugging Face Transformers is straightforward:

from transformers import AutoProcessor, AutoModel

processor = AutoProcessor.from_pretrained("suno/bark")

model = AutoModel.from_pretrained("suno/bark")

inputs = processor(

text=["Hello, my name is Suno. And, uh — and I like pizza. [laughs] But I also have other interests such as playing tic tac toe."],

return_tensors="pt",

)

speech_values = model.generate(**inputs, do_sample=True)

from IPython.display import Audio

sampling_rate = model.generation_config.sample_rate

Audio(speech_values.cpu().numpy().squeeze(), rate=sampling_rate)

This code snippet initializes the Bark model and processor, prepares the input text, generates speech values, and finally converts them into audio format.

Web UI (Gradio)

Creating a web user interface (UI) is made simple with Gradio, a library that allows easy UI building for data science projects in just minutes.

Here's the complete code for the Gradio app:

import gradio as gr

from transformers import pipeline

from transformers import AutoProcessor, BarkModel

import torch

from openai import OpenAI

import numpy as np

from IPython.display import Audio, display

import numpy as np

import re

from nltk.tokenize import sent_tokenize

WORDS_PER_CHUNK = 25

def split_sentence_into_chunks(sentence, n):

words = sentence.split()

if len(words) <= n:

return [sentence]

else:

chunks = [' '.join(words[i:i+n]) for i in range(0, len(words), n)]

return chunks

# Setup Whisper client

pipe = pipeline(

"automatic-speech-recognition",

model="openai/whisper-large-v2",

torch_dtype=torch.float16,

device="cuda:0"

)

voice_processor = AutoProcessor.from_pretrained("suno/bark")

voice_model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16).to("cuda:0")

voice_model = voice_model.to_bettertransformer()

voice_preset = "v2/en_speaker_9"

system_prompt = "You are a helpful AI"

client = OpenAI(base_url="http://localhost:8000/v1", api_key="sk-xxx") # Placeholder, replace

sample_rate = 48000

def transcribe_and_query_llm_voice(audio_file_path):

transcription = pipe(audio_file_path)['text']

response = client.chat.completions.create(

model="mistral",

messages=[

{"role": "system", "content": system_prompt}, # Update this as per your needs

{"role": "user", "content": transcription + "\n Answer briefly."}

)

llm_response = response.choices[0].message.content

sampling_rate = voice_model.generation_config.sample_rate

silence = np.zeros(int(0.25 * sampling_rate))

BATCH_SIZE = 12

model_input = sent_tokenize(llm_response)

pieces = []

for i in range(0, len(model_input), BATCH_SIZE):

inputs = model_input[BATCH_SIZE*i:min(BATCH_SIZE*(i+1), len(model_input))]

if len(inputs) != 0:

inputs = voice_processor(inputs, voice_preset=voice_preset)

speech_output, output_lengths = voice_model.generate(**inputs.to("cuda:0"), return_output_lengths=True, min_eos_p=0.2)

speech_output = [output[:length].cpu().numpy() for (output,length) in zip(speech_output, output_lengths)]

pieces += [*speech_output, silence.copy()]

whole_output = np.concatenate(pieces)

audio_output = (sampling_rate, whole_output)

return llm_response, audio_output

def transcribe_and_query_llm_text(text_input):

transcription = text_input

response = client.chat.completions.create(

model="mistral",

messages=[

{"role": "system", "content": system_prompt}, # Update this as per your needs

{"role": "user", "content": transcription + "\n Answer briefly."}

)

llm_response = response.choices[0].message.content

sampling_rate = voice_model.generation_config.sample_rate

silence = np.zeros(int(0.25 * sampling_rate))

BATCH_SIZE = 12

model_input = sent_tokenize(llm_response)

pieces = []

for i in range(0, len(model_input), BATCH_SIZE):

inputs = model_input[BATCH_SIZE*i:min(BATCH_SIZE*(i+1), len(model_input))]

if len(inputs) != 0:

inputs = voice_processor(inputs, voice_preset=voice_preset)

speech_output, output_lengths = voice_model.generate(**inputs.to("cuda:0"), return_output_lengths=True, min_eos_p=0.2)

speech_output = [output[:length].cpu().numpy() for (output,length) in zip(speech_output, output_lengths)]

pieces += [*speech_output, silence.copy()]

whole_output = np.concatenate(pieces)

audio_output = (sampling_rate, whole_output)

return llm_response, audio_output

with gr.Blocks() as demo:

with gr.Row():

with gr.Column():

text_input = gr.Textbox(label="Type your request", placeholder="Type here or use the microphone...")

audio_input = gr.Audio(sources=["microphone"], type="filepath", label="Or record your speech")

with gr.Column():

output_text = gr.Textbox(label="LLM Response")

output_audio = gr.Audio(label="LLM Response as Speech", type="numpy")

submit_btn_text = gr.Button("Submit Text")

submit_btn_voice = gr.Button("Submit Voice")

submit_btn_voice.click(fn=transcribe_and_query_llm_voice, inputs=[audio_input], outputs=[output_text, output_audio])

submit_btn_text.click(fn=transcribe_and_query_llm_text, inputs=[text_input], outputs=[output_text, output_audio])

demo.launch(ssl_verify=False,

share=False,

debug=False)

This code sets up a Gradio UI where users can input text or record their voice to interact with the LLM, and receive both text and audio responses.

A few notes on the python code above:

Here are some important points about the Python code provided above:

1. Bark Voices: The Bark model offers various voices to choose from. We're utilizing the voice "v2/en_speaker_9". The complete list of options can be found [here](https://huggingface.co/suno/bark/tree/main/speaker_embeddings/v2).

2. Function Assignment: We're assigning the `transcribe_and_query_llm_voice` function to `submit_btn_voice` to execute the model on the user's voice input. Similarly, `transcribe_and_query_llm_text` function is assigned to `submit_btn_text` to handle text input from the user.

3. Chunk Processing: In line 57, the code creates chunks and runs the Bark model on each chunk. This aggregation is done to handle long text inputs more effectively.

4. Microphone Requirement: To submit voice input to the model, a microphone must be available on your PC. However, if you don't have a microphone, you can still input text into the textbox and receive both text and audio output from the model.

To run the UI, ensure that your LLM engine is running. Then, in a new terminal, execute:

python gradio_tts.py

You'll receive a link in the terminal, typically http://127.0.0.1:7860. Your app will be ready to use through this link!

Here's a guide on accessing the app from your phone on your home WiFi:

To access the app on your phone while it's hosted by your PC connected to your home WiFi, follow these steps: