Easy Llama: Drama-Free AI Herding
- Assaf Sauer
- Apr 29
- 4 min read
Updated: Apr 30

Previously, we leveraged Flowise with OpenAI's API. In the near future, we also plan to evaluate Dify versus Flowise, as both offer some parallel capabilities and are built on LangChain foundations.
Today, we shift our focus to a different aspect: transitioning from OpenAI's managed service to self-hosting Meta's LLaMA models:
Note: we will not cover model comparisons or CPU consumption in this article, as these topics are thoroughly explored elsewhere. It is widely recognized that, even at minimal scale, LLaMA provides significant economic advantages compared to OpenAI. Focus is on Llama Dev Environment.
Objective:
Stackic is a logic framework that unifies open-source technologies into operational-ready, full-stack solutions. You can think of Stacktic as a draw.io that automatically generates production-ready or version-controlled full-stack complexity.
We objectively select only the best technologies to simplify complex automation scenarios.This is our perspective on managing complexity, but we'd love to hear your insights to help us improve and extend our vision.
Contact us at: info@stacktic.io

Main Challenges with LLaMA
1. Maintaining the LLaMA image (Large Model Size)
Option 1: Build a large image (around 20GB), which may take several hours.
Option 2: Build a lightweight image that downloads the model at runtime, potentially taking up to 30 minutes to become operational.
Neither option is ideal for rapid deployment or fast adoption.
2. Monitoring
Monitoring is critical for optimizing resource utilization and obtaining performance insights. Efficient observability directly translates into cost and time savings and this time for real. As you know GPUs: accelerating AI—and also debt.
How we solve the image maintenance with large model?
Given the drawbacks of both options, this is how we maintain our own Llama:
An automated job authenticates with HuggingFace using a token and downloads the required models to a local bucket.
Our container image includes MinIO Client (mc) for interacting with S3 storage, keeping the image lightweight.
Rather than downloading from the internet, the image retrieves the models from the local node (bucket), allowing startup within seconds.
MINIO JOB save 31GB of delay:

Container up in 10-15 sec:
installed
Installing monitoring packages...
Starting Ollama with models from Minio bucket...
Configuring Minio client...
Added `minio` successfully.
Added minio successfully. "/api/tags"
NAME ID SIZE MODIFIED
llama3:latest 365c0bd3c000 4.7 GB
nomic-embed-text:latest 0a109f422b47
Model changes automatically trigger updates to the bucket and the container image.

We shifted image modification tasks to command-line arguments and ConfigMaps initiated via Kustomize's config generator. This means we can now easily modify startup packages, commands, or configuration files (like metrics configurations) without rebuilding the Docker image.
Monitoring approach
This is the build in metrics which can easily be configured and this what we use today in our Llama deployment:
Built-in Component | Purpose | Example Snippet |
Evaluation Metrics (llama_index.core.evaluation) | Offline quality evaluation for retrieval/RAG. No server required. | MRR().compute(expected_ids, retrieved_ids).score |
Instrumentation API (set_global_handler()) | Converts LLaMAIndex calls into OpenTelemetry spans. | `from llama_index.core import set_global_handler |
set_global_handler("openlit")` | ||
"Simple" Handler | Prints LLM prompts and responses. Good for prototyping. | set_global_handler("simple") |
Some of our metrics:
Using built-in monitoring, we track essential metrics like:
Model Usage Metrics:
ollama_model_usage_total{model="llama3:latest"} 9.0 ollama_model_usage_total{model="nomic-embed-text:latest"} 9.0
Latency Measurements:
ollama_query_latency_seconds_count 9.0 ollama_query_latency_seconds_sum 0.01676193800085457
Loaded Models Count:
ollama_loaded_models 2.0
MRR Score Metrics:
ollama_mrr_score{model="llama3:latest"} 0.75 ollama_mrr_score{model="nomic-embed-text:latest"} 0.75
Hit Rate Metrics:
ollama_hit_rate{model="llama3:latest"} 0.85 ollama_hit_rate{model="nomic-embed-text:latest"} 0.85
What telemetry data and observability solutions exist out there ?
I’m currently exploring additional external open-source add-ons for comprehensive monitoring solutions, focusing on community adoption, flexibility, and simplicity. This is what I figure out so far:
External Add-On | Purpose | Self-Hosting Footprint | Best For | One-line Hook Example |
OpenLIT | Full OpenTelemetry trace UI, GPU metrics | Docker-compose (ClickHouse + UI) | Comprehensive self-hosted monitoring | set_global_handler("openlit") |
Arize Phoenix | Tracing, built-in evaluations, vector analytics | Local install (pip) or Docker | Real-time tracing and quality evaluation | set_global_handler("arize_phoenix") |
MLflow Tracing | Stores traces in MLflow Tracking UI | MLflow server (SQLite/Postgres) | Teams already using MLflow | mlflow.llama_index.autolog() |
Langfuse | Collaborative tracing, prompt versioning | Docker + Postgres | Complex agent traces, collaborative debugging | from langfuse.llama_index import LlamaIndexInstrumentor; LlamaIndexInstrumentor().start() |
OpenLLMetry | Enhances OpenTelemetry with LLM-specific fields | Library only | Existing OpenTelemetry pipelines | from traceloop.sdk import Traceloop; Traceloop.init() |
OpenInference | Logs calls into structured data frames | Library only | Data manipulation or integration with Phoenix | set_global_handler("openinference") |
The big question, Should I Use a Customized LLaMA Image?
Typically, customized images are built for size optimization or enhanced security—reasons that might not be applicable here, given our solutions for caching and container security. Currently, I do not see significant benefits in maintaining a fully customized Docker image for LLaMA, but please let me know if you have insights otherwise.
Conclusion:
We successfully deployed LLaMA with rapid boot times (thanks to bucket caching), automated model updates, and metrics. Container image rebuilds are unnecessary for package changes due to configuration management during startup.
Next steps for our Flowise integration:
Adjust the data ingestion pipeline from our previous setup to LLaMA’s internal DNS.

Update the Qdrant vector database dimension to 768 to support the LLaMA embedding model (nomic-embed-text) instead of OpenAI.
Finally, open Tilt to synchronize and optimize the LLaMA container as needed. if you still using local Docker Env Dev i suggest you check Tilt. instant commits from IDE syncing your K8s might be a game changer.

In our next blog post, we look forward to exploring the best open-source telemetry and observability solutions for Llama, carefully examining all available options on the table. Stay tuned!
.png)


Comments