Easy Llama: Drama-Free AI Herding

Assaf Sauer
Apr 29
4 min read

Updated: Apr 30

Previously, we leveraged Flowise with OpenAI's API. In the near future, we also plan to evaluate Dify versus Flowise, as both offer some parallel capabilities and are built on LangChain foundations.

Flowise Blog

Today, we shift our focus to a different aspect: transitioning from OpenAI's managed service to self-hosting Meta's LLaMA models:

Note: we will not cover model comparisons or CPU consumption in this article, as these topics are thoroughly explored elsewhere. It is widely recognized that, even at minimal scale, LLaMA provides significant economic advantages compared to OpenAI. Focus is on Llama Dev Environment.

Objective:

Stackic is a logic framework that unifies open-source technologies into operational-ready, full-stack solutions. You can think of Stacktic as a draw.io that automatically generates production-ready or version-controlled full-stack complexity.

We objectively select only the best technologies to simplify complex automation scenarios.This is our perspective on managing complexity, but we'd love to hear your insights to help us improve and extend our vision.

Main Challenges with LLaMA

1. Maintaining the LLaMA image (Large Model Size)

Option 1: Build a large image (around 20GB), which may take several hours.
Option 2: Build a lightweight image that downloads the model at runtime, potentially taking up to 30 minutes to become operational.

Neither option is ideal for rapid deployment or fast adoption.

2. Monitoring

Monitoring is critical for optimizing resource utilization and obtaining performance insights. Efficient observability directly translates into cost and time savings and this time for real. As you know GPUs: accelerating AI—and also debt.

How we solve the image maintenance with large model?

Given the drawbacks of both options, this is how we maintain our own Llama:

An automated job authenticates with HuggingFace using a token and downloads the required models to a local bucket.

https://github.com/stackticio/Llama_base/blob/stacktic/k8s/build/base/llama/models-job.yaml

Our container image includes MinIO Client (mc) for interacting with S3 storage, keeping the image lightweight.

https://github.com/stackticio/Llama_base/blob/stacktic/llama/Dockerfile

Rather than downloading from the internet, the image retrieves the models from the local node (bucket), allowing startup within seconds.

MINIO JOB save 31GB of delay:

Container up in 10-15 sec:

installed
Installing monitoring packages...
Starting Ollama with models from Minio bucket...
Configuring Minio client...
Added `minio` successfully.
Added minio successfully. "/api/tags"
NAME ID SIZE MODIFIED
llama3:latest 365c0bd3c000 4.7 GB
nomic-embed-text:latest 0a109f422b47

Model changes automatically trigger updates to the bucket and the container image.

We shifted image modification tasks to command-line arguments and ConfigMaps initiated via Kustomize's config generator. This means we can now easily modify startup packages, commands, or configuration files (like metrics configurations) without rebuilding the Docker image.

Monitoring approach

This is the build in metrics which can easily be configured and this what we use today in our Llama deployment:

Built-in Component	Purpose	Example Snippet
Evaluation Metrics (llama_index.core.evaluation)	Offline quality evaluation for retrieval/RAG. No server required.	MRR().compute(expected_ids, retrieved_ids).score
Instrumentation API (set_global_handler())	Converts LLaMAIndex calls into OpenTelemetry spans.	`from llama_index.core import set_global_handler
set_global_handler("openlit")`
"Simple" Handler	Prints LLM prompts and responses. Good for prototyping.	set_global_handler("simple")

Some of our metrics:

Using built-in monitoring, we track essential metrics like:

Model Usage Metrics:
ollama_model_usage_total{model="llama3:latest"} 9.0 ollama_model_usage_total{model="nomic-embed-text:latest"} 9.0
Latency Measurements:
ollama_query_latency_seconds_count 9.0 ollama_query_latency_seconds_sum 0.01676193800085457
Loaded Models Count:
ollama_loaded_models 2.0
MRR Score Metrics:
ollama_mrr_score{model="llama3:latest"} 0.75 ollama_mrr_score{model="nomic-embed-text:latest"} 0.75
Hit Rate Metrics:
ollama_hit_rate{model="llama3:latest"} 0.85 ollama_hit_rate{model="nomic-embed-text:latest"} 0.85

https://github.com/stackticio/Llama_base/blob/main/k8s/deploy/base/llama/files/metrics_monitor.py

What telemetry data and observability solutions exist out there ?

I’m currently exploring additional external open-source add-ons for comprehensive monitoring solutions, focusing on community adoption, flexibility, and simplicity. This is what I figure out so far:

External Add-On	Purpose	Self-Hosting Footprint	Best For	One-line Hook Example
OpenLIT	Full OpenTelemetry trace UI, GPU metrics	Docker-compose (ClickHouse + UI)	Comprehensive self-hosted monitoring	set_global_handler("openlit")
Arize Phoenix	Tracing, built-in evaluations, vector analytics	Local install (pip) or Docker	Real-time tracing and quality evaluation	set_global_handler("arize_phoenix")
MLflow Tracing	Stores traces in MLflow Tracking UI	MLflow server (SQLite/Postgres)	Teams already using MLflow	mlflow.llama_index.autolog()
Langfuse	Collaborative tracing, prompt versioning	Docker + Postgres	Complex agent traces, collaborative debugging	from langfuse.llama_index import LlamaIndexInstrumentor; LlamaIndexInstrumentor().start()
OpenLLMetry	Enhances OpenTelemetry with LLM-specific fields	Library only	Existing OpenTelemetry pipelines	from traceloop.sdk import Traceloop; Traceloop.init()
OpenInference	Logs calls into structured data frames	Library only	Data manipulation or integration with Phoenix	set_global_handler("openinference")

The big question, Should I Use a Customized LLaMA Image?

Typically, customized images are built for size optimization or enhanced security—reasons that might not be applicable here, given our solutions for caching and container security. Currently, I do not see significant benefits in maintaining a fully customized Docker image for LLaMA, but please let me know if you have insights otherwise.

Conclusion:

We successfully deployed LLaMA with rapid boot times (thanks to bucket caching), automated model updates, and metrics. Container image rebuilds are unnecessary for package changes due to configuration management during startup.

Next steps for our Flowise integration:

Adjust the data ingestion pipeline from our previous setup to LLaMA’s internal DNS.
Update the Qdrant vector database dimension to 768 to support the LLaMA embedding model (nomic-embed-text) instead of OpenAI.

Finally, open Tilt to synchronize and optimize the LLaMA container as needed. if you still using local Docker Env Dev i suggest you check Tilt. instant commits from IDE syncing your K8s might be a game changer.

https://github.com/stackticio/Llama_base/blob/stacktic/llama/dev_tools/Tiltfile

In our next blog post, we look forward to exploring the best open-source telemetry and observability solutions for Llama, carefully examining all available options on the table. Stay tuned!