Deploying and Testing LLAMA-4 Maverick

April 15, 2024

Taking Llama 4 Out for a Test Drive: Deployment & Inference Guide

If you've been following the latest news in large language models, Meta just dropped their latest addition to the LLAMA series with Llama 4. In my testing, while the coding and reasoning abilities have not been extraordinary according to today's standards, the multimodal capabilities are top of the line. At the point of releasing this, it seems that the only model better at image understanding is Gemini 2.5 Pro (which I assume will be outdated info within a week given the rate of progress).

Here's a tutorial on how to get the large model running on runpod. We will be working with Llama 4 Maverick, although Scout is the same process with different weights.

The code I will be using is here.

Hardware Reality Check

Llama 4 Maverick, even at fp8, still needs multiple H100 GPUs to serve inference on. You can likely do it on less than 8 GPUs, but we will be showcasing a full 8 H100 node for simplicities sake.

  • Ideal setup: 8x NVIDIA H100 GPUs
  • Workable alternative: A cluster of NVIDIA A100 GPUs
  • Budget option: With INT4 quantization, you might get away with fewer GPUs

It is possible to run this locally with either strong Mac computers or lots of RAM-Disk offloading, but if you want it running fast, this is the way to go.

To start off with, I would initialize a 8xH100 cluster from runpod and ssh in. From there, we can start getting the environment ready.

Getting Your Environment Ready

Step 1: Miniconda Setup

First things first, let's get our environment prepared. I always start with a clean Miniconda installation:

# Update your system first
apt update && apt upgrade -y

# Grab the Miniconda installer
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Run it and follow the prompts
bash Miniconda3-latest-Linux-x86_64.sh

Pro tip: When the installer asks if you want to initialize Miniconda, say yes.

Step 2: Project Setup

Time to create our workspace and pull down the code:

# Create a project home
mkdir llama4
cd llama4

# Clone the repository with the inference code
git clone https://github.com/AmarUCLA/Llama4Inference.git
cd Llama4Inference/

Step 3: Python Environment

Now for the Python environment – I'm using Python 3.12 for this setup:

# Create our environment
conda create -n llamainference python=3.12

# Activate it
conda activate llamainference

# Install dependencies
pip install -r requirements.txt

Step 4: Hugging Face Authentication

You'll need access to the model weights, which means authenticating with Hugging Face:

# Log in to access the model
huggingface-cli login

When prompted, enter your Hugging Face token. Don't have one? Head over to your Hugging Face settings to create one.

Running the Model

There are two main ways I've been working with Llama 4: through a web interface for interactive testing, and batch mode for processing lots of inputs. Let's look at both.

Option 1: The Web Interface

This is my preferred method when I'm experimenting and want to see results instantly:

# Start the model server in a tmux session
tmux new-session -s llama4-server
conda activate llamainference
bash serve.sh
# Detach with Ctrl+b then d

# Then start the UI in another session
tmux new-session -s llama4-ui
conda activate llamainference
streamlit run streamlit_chat.py --server.port 8000 --server.address 0.0.0.0
# Detach with Ctrl+b then d

Once that's running, you can access the interface at http://[YOUR_SERVER_IP]:8000

Option 2: Batch Processing

When I need to process a bunch of prompts at once (like when I'm benchmarking or running evaluations), batch mode is the way to go:

conda activate llamainference
python batch_inference.py

I've found this particularly useful for systematic testing across different prompt templates or when comparing Llama 4's outputs with other models.

What's Under the Hood?

The default configuration details are:

  • Model: Llama 4 Maverick (17B active parameters, 400B total parameters, 128 experts)
  • Context length: 430K tokens (can do more with more GPUs)
  • Precision: FP8 (minimal quality drop, can go further into INT4 with some noticable drop)

Managing Your Sessions

I highly recommend using tmux if you haven't adopted it already.

# See what's running
tmux ls

# Jump back into a session
tmux attach-session -t llama4-server
# or
tmux attach-session -t llama4-ui

Final Thoughts

It is no doubt that this is computationally expensive to host, but if you can find a good GPU provider, it is much cheaper to do batch inference with Maverick for the quality it provides (compared to it's Sonnet 3.7, 4o, and Gemini competitors)

I also understand there has been controversy with the benchmarks, but at the end of the day the best way to see how good a model is to test it yourself. I highly recommend tossing in a few images to see what it can do.


Disclaimer: This guide assumes you have the necessary permissions to access and use the Llama 4 models. If not, you can request access on their huggingface pages