Introduction: Tiny LLM Reviews

Kyle Leaders 2024-04-11

There's a lot of talk of the climate and social impact of the huge power and water consumption from commercial generative AI. Also I've been pretty curious about RISCV and low power alternatives to GPUs. So I thought, why not combine those two interests? Lets see if I can get a tiny LLM running on my Lichee Pi4a, and see what it can do on such a small system.

As I explored the LLM models avalable I noticed that there's a lot of data out there on how LLMs perform on various benchmarks and leader boards. What do those numbers mean? What are each of these tiny models good at, if anything?

Ultimately I wanted to see what a tiny LLM can actually do. This is a series of blog posts where I dig into 3 billion and smaller parameter models to see what the output looks like and what they can do on various practical tasks.

Intro to MLOps

I wasn't super familiar with the GenAI model space until I started working on WatchTheRadio. The AI and ML Ops world is fairly similar to what I'm used to with web services, in that you're usually running some long running application that services requests. However there are some pretty fundamental differences in behavior. I can't cover all the cool stuff I've learned in this blog post but maybe I will in the future.

How an LLM works

LLMs are effectively a very complex word association generator. They are a huge quantity of word associations that spit out statistically probable responses given an input. They have no true understanding of the concepts you provide to them, but rather do something similar to "given all the context I've seen for these words in this order, here is a result you probably want". That means that LLMs are only as good as their training data and the quantity of information they can process at one time (the token context size). In this blog series I'll be covering the newest generation of LLMs that have 3 billion parameters and less but are generally regarded as giving pretty good results for their hardware usage.

System Prompts

LLMs are provided with two types of prompts. The first prompt is the system prompt. It sets the "personality", constraints, and context for all the following interactions. Here is the system prompt for Claude for example. This prompt will apply to each and every chat message, so its important to set the global constraints.

User Prompts

The other type of prompt is usually just called a "prompt" or a "user prompt". These are the requests and chat messages sent by the user to the model. LLMs will take this in as data, merge it with the system prompt, and return a statistically probable response.

Prompt String Formats

Now, in order for a model to understand what to do it will need the prompts sent in a specific format. This was by far the most confusing and difficult part of doing this project. Unfortunately not all the models have great documentation on what the proper format is that they want their prompts sent. Here's some examples that I've run across (these are also in my github config):

Tiny Llama

<|system|>
{{preprompt}}</s>
{{#each messages}}{{#ifUser}}<|user|>
{{content}}</s>
<|assistant|>
{{/ifUser}}{{#ifAssistant}}{{content}}</s>
{{/ifAssistant}}{{/each}}

Mixtral

<s>{{#each messages}}{{#ifUser}}[INST] {{#if @first}}{{#if @root.preprompt}}{{@root.preprompt}}
{{/if}}{{/if}} {{content}} [/INST]{{/ifUser}}{{#ifAssistant}}{{content}}</s> {{/ifAssistant}}{{/each}}

These are Jinja templates that would be interpolated with the user inputs. If something isn't in the correct format, the model will usually just respond with strange and confusing garbage. Sometimes repeating the same words forever.

My Setup

In case you want to reproduce this at home, here's a link to my setup.

Hardware

I'll be running these models on two different systems:

My higher end desktop with an NVIDIA 4090, AMD Ryzen 9 7900X 12-Core Processor, 64 GB RAM
My ideal target for this which is a Lichee Pi4a RISCV 64 bit (rv64gcv0p7) 4 core CPU @ 1.8 ghz and 16GB RAM

Software

Llama.cpp

The way we'll be running our models will be on the llama.cpp program. I've compiled it for both hardware targets using the best optimizations I can find. Here's my build commands for each:

# Debian on Lichee pi4a
export CFLAGS="-march=riscv64gcv0p7 -pipe -fomit-frame-pointer -O3"
export CXXFLAGS="${CFLAGS}"

# I use gcc 10 because it contains suport for the v0.7 vector extensions
mkdir build && \
    cd build && \
    env CXX=$(which g++-10) CC=$(which gcc-10) cmake .. --fresh -G Ninja -DLLAMA_BLAS=on -DLLAMA_BLAS_VENDOR=OpenBLAS && \
    cmake . --config Release

And then on my desktop:

# Gentoo Linux on amd64 desktop with CUDA
export CFLAGS="-march=native -pipe -fomit-frame-pointer -O2"
export CXXFLAGS="${CFLAGS}"

mkdir build && \
    cd build && \
    env CXX=$(which g++-12) CC=$(which gcc-12) cmake .. --fresh -G Ninja -DLLAMA_CUDA=on && \
    cmake . --config Release

Once those are all built we run the server binary that was built. This provides an openapi compabitible server that we can use for chat testing.

./bin/server -m ../models/${MODEL} --host '0.0.0.0' -c 0 -np $(nproc)

HuggingFace Chat UI

For a nice a simple way to interact with these models I'm using huggingface's chat UI. I'm running it through docker to keep things simple. Check the github repo if you'd like to see the actual config. There are some quirks to the tool that I didn't realize until I started using it. I'm also avoiding Mongodb to get away from the BUSL license requirements. Instead I'm using the very cool ferretdb to create a mongo-like interface in docker.

Tenere

This ended up being the primary chat tool I used through most of these tests. Its a super simple Chat TUI built in Rust. You can check it out on the github.

mods

For handling pipelines on the cli I used mods. I'm a sucker for good clean UI and UX and the charmbracelet folks really do a good job there. It doesn't currently support a full chat style interface like Tenere and HuggingFace Chat UI, so it pretty much only used it for the document summarization tests.