Crafting Your Custom AI Companion: A Comprehensive Manual for Constructing a Text and Voice-Enabled Local LLM
byRiotPublished:
Crafting your own personal LLM AI assistant equipped with voice capabilities for engaging conversations.
Embark on a Journey to Build Your Own Personal Local LLM Assistant, Ready for Conversation! Dive into the Creation Process, Enabling Voice Input and Output:
Let's delve into Llama-cpp-python, a remarkable Python interface for llama.cpp, renowned for its implementation of numerous Large Language Models in C/C++. Given its extensive embrace by the open-source community, it became my natural choice for this tutorial.
Important note: I've rigorously tested this application on a system powered by an Nvidia RTX4090 GPU.
First and foremost, let's initiate a fresh conda environment:
Now, our next step entails installing llama-cpp-python. As detailed in the llama-cpp-python documentation, llama.cpp boasts support for various hardware acceleration backends, enhancing the speed of inference. To harness the power of the GPU and execute the Large Language Model (LLM) on it, we'll compile the program with CUBLAS. While grappling with challenges in offloading the model onto the GPU, I stumbled upon a helpful post guiding me through the proper installation process:
Additionally, we'll need to install several other packages for this application:
pip install gradio
pip install openai
pip install huggingface-cli
pip install huggingface_hub
pip install torch
pip install transformers
pip install nltk
pip install optimum
The next step involves downloading model weights to serve. For this demonstration, I've opted for Mistral-7b-Instruct-v0.2. (In fact, Mistral 7b is my preferred model among all other 7b models). The model formats compatible with llama.cpp are in GGUF format. If you're unfamiliar, TheBloke is the go-to repository for obtaining quantized models and GGUF converted models.
To download the model weights, use the following command:
With the model weights now downloaded, we're prepared to put our LLM to the test. To achieve this, we'll initiate an LLM server (assuming we've already installed llama-cpp-python[server]). Simply open a terminal, activate your assistant conda environment, and launch the server:
In this setup, the parameter `n_gpu_layers` determines how many layers of your model will be offloaded to the GPU. Given my ample GPU resources—a Nvidia RTX 4090 with 24 GB of VRAM—I've opted to offload all layers (-1) to the GPU, as this is sufficient for loading the quantized model. As for the second parameter, `chat_format`, it specifies the chat template for our model. Since we're utilizing the Mistral model, we've selected the `chatml` template. For further details on chat templates, you can refer to the documentation.
Now, let's move on to using Python to send requests to our model. It's worth noting that llama-cpp-python follows an API structure similar to OpenAI's, allowing you to send requests to your local LLM in a manner akin to how you interact with OpenAI's GPT models like GPT-3.5 or GPT-4.
from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="sk-xxx") response = client.chat.completions.create( model="mistral", messages=[ {"role": "system", "content": "You are a helpful AI."}, {"role": "user", "content": "In what city were the 2000 olympics taken place?"} ], )
print(response)
For the speech-to-text functionality, we'll leverage the renowned Whisper model, an open-source transformer-based speech-to-text model. Whisper takes an audio file as input and generates a transcript of the spoken words. Utilizing the Hugging Face Transformers implementation simplifies the process significantly. With a Hugging Face pipeline, performing inference with Whisper is straightforward:
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large-v2",
torch_dtype=torch.float16,
device="cuda:0"
)
transcription = pipe(audio_file_path)['text']
The variable `transcription` holds the transcript of the input audio file.
Text to Speech (Bark)
For text-to-speech conversion, I utilize the Bark model, a transformer-based text-to-speech model capable of generating realistic, multilingual speech and various other audio elements such as music, background noise, and simple sound effects. Additionally, the model can produce nonverbal communications like laughter, sighs, and cries.
Once again, we'll employ the Hugging Face implementation for the Bark model. Using it with Hugging Face Transformers is straightforward:
This code snippet initializes the Bark model and processor, prepares the input text, generates speech values, and finally converts them into audio format.
Web UI (Gradio)
Creating a web user interface (UI) is made simple with Gradio, a library that allows easy UI building for data science projects in just minutes.
Here's the complete code for the Gradio app:
import gradio as gr
from transformers import pipeline
from transformers import AutoProcessor, BarkModel
import torch
from openai import OpenAI
import numpy as np
from IPython.display import Audio, display
import numpy as np
import re
from nltk.tokenize import sent_tokenize
WORDS_PER_CHUNK = 25
def split_sentence_into_chunks(sentence, n):
words = sentence.split()
if len(words) <= n:
return [sentence]
else:
chunks = [' '.join(words[i:i+n]) for i in range(0, len(words), n)]
This code sets up a Gradio UI where users can input text or record their voice to interact with the LLM, and receive both text and audio responses.
A few notes on the python code above:
Here are some important points about the Python code provided above:
1. Bark Voices: The Bark model offers various voices to choose from. We're utilizing the voice "v2/en_speaker_9". The complete list of options can be found [here](https://huggingface.co/suno/bark/tree/main/speaker_embeddings/v2).
2. Function Assignment: We're assigning the `transcribe_and_query_llm_voice` function to `submit_btn_voice` to execute the model on the user's voice input. Similarly, `transcribe_and_query_llm_text` function is assigned to `submit_btn_text` to handle text input from the user.
3. Chunk Processing: In line 57, the code creates chunks and runs the Bark model on each chunk. This aggregation is done to handle long text inputs more effectively.
4. Microphone Requirement: To submit voice input to the model, a microphone must be available on your PC. However, if you don't have a microphone, you can still input text into the textbox and receive both text and audio output from the model.
To run the UI, ensure that your LLM engine is running. Then, in a new terminal, execute:
python gradio_tts.py
You'll receive a link in the terminal, typically http://127.0.0.1:7860. Your app will be ready to use through this link!
Here's a guide on accessing the app from your phone on your home WiFi:
To access the app on your phone while it's hosted by your PC connected to your home WiFi, follow these steps:
1. Find your PC's Local IP Address:
If you're using Ubuntu, open a terminal and use the command: