top of page

Running Quantized LLM Model Locally

Run Quantized LLM Models Locally

In this blog we will explore different option to run quantized models locally. There are multiple opensource libraries and tools that are available that can be used to run the LLM models locally.


Introduction to Quantized LLM Models

Before we begin with through the list of tools we first need to understand how to optimize large language models (LLMs) for efficiency without compromising performance has become a critical challenge. This is where quantization techniques like AutoGPTQ, AWQ, and GGUF come into play, each offering unique advantages to tackle this challenge.


  1. AutoGPTQ stands out for its user-friendly approach to quantization, focusing on minimizing the mean squared error during the process. It significantly reduces memory requirements and improves inference speed by allowing models to operate in lower-bit precision without losing accuracy​​​​.

  2. AWQ, or Activation-aware Weight Quantization, enhances this further by considering the distribution of activations during quantization, ensuring a tailored precision that preserves model accuracy even in reduced memory settings. This method not only lowers memory demands but also maintains the robustness and efficiency of LLMs across various platforms​​.

  3. GGUF, developed specifically for efficient execution on CPUs and Apple devices, introduces an innovative approach by selectively offloading certain model layers to the GPU. This method is particularly beneficial for environments where GPU resources are limited, enabling the deployment of LLMs on a broader range of hardware with improved speed and efficiency​​​​.

Together, these quantization techniques represent significant advancements in making LLMs more accessible and efficient, catering to diverse computational environments and application needs.


Now let us have a look at the tools or libraries that support the above quantization techniques to run the models locally.


LM Studio


LM Studio UI

LM Studio is an innovative desktop application designed to facilitate the experimentation and evaluation of Large Language Models (LLMs) right on your local machine. This tool stands out for its user-friendly interface, allowing users to easily compare different models without the need for in-depth coding knowledge. It supports models from various sources, including Hugging Face, and is compatible with the GGUF file format, a successor to GGML introduced by the llama.cpp team for efficient model loading​.

Installing LM Studio is straightforward. First, visit the official LM Studio website and select the version that matches your operating system. Once downloaded, the installation process is simple, thanks to a user-friendly setup that guides you through the steps. After installation, LM Studio presents a sleek interface where you can search for and download LLMs directly within the application. You have the flexibility to search, filter, and select models based on your requirements, including compatibility and popularity. LM Studio also allows for detailed configuration of model settings, including GPU acceleration, to optimize performance based on your hardware capabilities.


LM Studio not only simplifies the process of running and managing LLMs locally but also offers features like real-time chatting with models and monitoring of CPU and RAM usage. For advanced users, it provides the option to host LLMs on a private server, enhancing control and privacy. This flexibility makes LM Studio a powerful tool for developers, researchers, or anyone curious about the capabilities of LLMs and looking to explore AI-driven creativity without relying on third-party providers.


Llama.cpp


Llama.cpp, along with its Python bindings provided by the llama-cpp-python library, represents an innovative approach to utilizing large language models (LLMs) efficiently on various computing platforms, including CPUs. This library enables Python developers to easily integrate and run LLMs, leveraging the llama.cpp's capabilities for high-performance machine learning tasks.


One of the key benefits of using llama.cpp and its Python bindings is the ability to run quantized models, which significantly reduces the memory requirement for executing these large models without a notable loss in performance. This quantization process, especially with the GGUF format, allows for running models with billions of parameters in a much more memory-efficient manner, making it possible to load a 7 billion parameter model that would typically require 13GB into less than 4GB of RAM​.


Additionally, llama-cpp-python offers an OpenAI-compatible web server, allowing users to host LLMs on their own servers. This feature is particularly useful for those seeking enhanced control and privacy over their AI models, as it enables access to LLM capabilities without relying on external providers. This self-hosting capability ensures that developers can maintain a high level of data security and compliance while leveraging the power of LLMs for their applications


Now let us have a look at the how to setup the Llama-cpp-python. First we will have to make sure to have the following libraries llama-cpp-python, huggingface_hub. To install the above libraries use the following below command. Note we will go for CPU based run for using GPU visit here for more information about how to install llama-cpp-python with GPU.

!pip install llama-cpp-python
!pip install huggingface_hub

Now let us define the model we are going to use

model_name_or_path = "TheBloke/Llama-2-13B-chat-GGML"
model_basename = "llama-2-13b-chat.ggmlv3.q5_1.bin"

Now let us download the model

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
repo_id=model_name_or_path, 
filename=model_basename
)

Now Let us import the Llama from the library to load the model

from llama_cpp import Llama

llm = Llama(
    model_path=model_path, # We got from the downloading the model
    n_threads=2, # CPU cores
    )

Now let us define the prompts

prompt = "Write a function in that has variable number of inputs, and the function has to add all the inputs and return the sum."
prompt_template=f'''SYSTEM: You are a helpful, respectful and honest AI assistant. Always answer as helpfully.

USER: {prompt}

ASSISTANT:
'''

Now let us run the model using the above prompt and prompt template and see the result

response = lcpp_llm(
    prompt=prompt_template,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    stop = ['USER:'], # Dynamic stopping when such token is detected.
    echo=True # return the prompt
)

print(response["choices"][0]["text"])

Output
 In Python, you can achieve this by using the `*args` syntax, which allows a function to accept a variable number of arguments as a tuple. Here's an example of how to write such a function:

```python
def add_all(*inputs):
    """
    This function takes any number of inputs and returns their sum.
    :param inputs: Any number of input arguments.
    :return: The sum of all the input arguments.
    """
    total = 0
    for num in inputs:
        total += num
    return total
```

You can now call this function with any number of arguments, and it will add them up and return their sum:

```python
print(add_all(1, 2, 3, 4)) # Output: 10
print(add_all(5, 6, 7))     # Output: 18
print(add_all(10))         # Output: 10 (since an empty argument list evaluates to None by default and 0 + None = 10)
```

vLLM


vLLM is emerging as a groundbreaking solution in the realm of Large Language Model (LLM) serving, designed to address the critical need for high throughput and efficient inference. Developed with an emphasis on speed and flexibility, vLLM aims to revolutionize LLM serving by offering state-of-the-art serving throughput, efficient management of attention key and value memory through Paged Attention, continuous batching of incoming requests, and optimized CUDA kernels. This enables vLLM to deliver superior performance with less resource consumption compared to traditional LLM serving methods.


One of the most compelling benefits of using vLLM is its ability to achieve up to 24 times higher throughput than other popular LLM libraries like HuggingFace Transformers. This efficiency means that users can serve a significantly larger number of requests with fewer resources, making vLLM an attractive option for developers and organizations looking to scale their LLM applications efficiently. Additionally, vLLM supports a wide array of quantization techniques such as GPTQ, AWQ, and SqueezeLLM, which further enhance its serving speed and reduce memory requirements, offering a pathway towards even more optimized future releases



vLLM is emerging as a groundbreaking solution in the realm of Large Language Model (LLM) serving, designed to address the critical need for high throughput and efficient inference. Developed with an emphasis on speed and flexibility, vLLM aims to revolutionize LLM serving by offering state-of-the-art serving throughput, efficient management of attention key and value memory through PagedAttention, continuous batching of incoming requests, and optimized CUDA kernels. This enables vLLM to deliver superior performance with less resource consumption compared to traditional LLM serving methods​​.

One of the most compelling benefits of using vLLM is its ability to achieve up to 24 times higher throughput than other popular LLM libraries like HuggingFace Transformers. This efficiency means that users can serve a significantly larger number of requests with fewer resources, making vLLM an attractive option for developers and organizations looking to scale their LLM applications efficiently. Additionally, vLLM supports a wide array of quantization techniques such as GPTQ, AWQ, and SqueezeLLM, which further enhance its serving speed and reduce memory requirements, offering a pathway towards even more optimized future releases​​.


Furthermore, vLLM is not just about performance; it's also designed with user convenience in mind. It offers seamless integration with popular Hugging Face models, supports various decoding algorithms for high-throughput serving, and even includes an OpenAI-compatible API server. This makes vLLM a versatile tool for a wide range of LLM applications, from simple chatbots to complex AI-driven analysis tools


Now let us have look at how to setup and run quantized models in vLLM. First install vLLM

!pip install vllm

Import the libraries and the required functions

from vllm import LLM, SamplingParams

Write the set of prompt you want to provide to the model, Note you can batch it over here for faster processing one of the unique features of vLLM.

prompts = [
    "<s>[INST] What is Generative AI Exaplin in detail [/INST]"
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)

Load the model

llm = LLM(model="TheBloke/Mistral-7B-Instruct-v0.1-AWQ", quantization="AWQ", dtype='half')

Generate the output

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text}")

Output

Prompt: '<s>[INST] What is Generative AI Exaplin in detail [/INST]', Generated text:  Generative AI is a type of artificial intelligence that is capable of creating new content or data based on existing data. It uses a variety of techniques to analyze and understand patterns in the data, and then generates new data that follows those patterns. Exaplin is a company that specializes in creating generative AI models for businesses. Their models can be used to automate a wide range of tasks, including content creation, customer service, and data analysis. One of the key advantages of using generative AI is that it can help businesses save time and money by automating repetitive tasks. For example, a generative AI model could be used to automatically generate customer responses to common inquiries, freeing up customer service representatives to focus on more complex issues. Generative AI can also be used to create personalized content for customers. For example, a generative AI model could be used to recommend products or services to a customer based on their browsing history or past purchases. Another way that generative AI can be used is to analyze data and provide insights.

So in this blog we have covered different quantization methods and format along with ways to run the model locally. I hope you will try and run some of these models and explore the capability of such model in your application. There are lot more tools other than one listed here to run quantized models which we will be covering in the future blogs.

If you want to build any apps using open source quantized model that is not resource intensive and can run on CPU feel free to contact us
Contact US Image


bottom of page