top of page
Search

Easy Llama: Drama-Free AI Herding

Updated: Apr 30


ree


Previously, we leveraged Flowise with OpenAI's API. In the near future, we also plan to evaluate Dify versus Flowise, as both offer some parallel capabilities and are built on LangChain foundations.





Today, we shift our focus to a different aspect: transitioning from OpenAI's managed service to self-hosting Meta's LLaMA models:

Note: we will not cover model comparisons or CPU consumption in this article, as these topics are thoroughly explored elsewhere. It is widely recognized that, even at minimal scale, LLaMA provides significant economic advantages compared to OpenAI. Focus is on Llama Dev Environment.

Objective:

Stackic is a logic framework that unifies open-source technologies into operational-ready, full-stack solutions. You can think of Stacktic as a draw.io that automatically generates production-ready or version-controlled full-stack complexity.

We objectively select only the best technologies to simplify complex automation scenarios.This is our perspective on managing complexity, but we'd love to hear your insights to help us improve and extend our vision.

Contact us at: info@stacktic.io



Our RAG Design
Our RAG Design



Main Challenges with LLaMA


1. Maintaining the LLaMA image (Large Model Size)


  • Option 1: Build a large image (around 20GB), which may take several hours.

  • Option 2: Build a lightweight image that downloads the model at runtime, potentially taking up to 30 minutes to become operational.


Neither option is ideal for rapid deployment or fast adoption.


2. Monitoring


Monitoring is critical for optimizing resource utilization and obtaining performance insights. Efficient observability directly translates into cost and time savings and this time for real. As you know GPUs: accelerating AI—and also debt.


How we solve the image maintenance with large model?


Given the drawbacks of both options, this is how we maintain our own Llama:


  • An automated job authenticates with HuggingFace using a token and downloads the required models to a local bucket.




  • Our container image includes MinIO Client (mc) for interacting with S3 storage, keeping the image lightweight.




  • Rather than downloading from the internet, the image retrieves the models from the local node (bucket), allowing startup within seconds.


    MINIO JOB save 31GB of delay:

    ree

    Container up in 10-15 sec:


    installed

    Installing monitoring packages...

    Starting Ollama with models from Minio bucket...

    Configuring Minio client...

    Added `minio` successfully.

    Added minio successfully. "/api/tags"

    NAME ID SIZE MODIFIED

    llama3:latest 365c0bd3c000 4.7 GB

    nomic-embed-text:latest 0a109f422b47



  • Model changes automatically trigger updates to the bucket and the container image.


    ree


  • We shifted image modification tasks to command-line arguments and ConfigMaps initiated via Kustomize's config generator. This means we can now easily modify startup packages, commands, or configuration files (like metrics configurations) without rebuilding the Docker image.


Monitoring approach

This is the build in metrics which can easily be configured and this what we use today in our Llama deployment:

Built-in Component

Purpose

Example Snippet

Evaluation Metrics (llama_index.core.evaluation)

Offline quality evaluation for retrieval/RAG. No server required.

MRR().compute(expected_ids, retrieved_ids).score

Instrumentation API (set_global_handler())

Converts LLaMAIndex calls into OpenTelemetry spans.

`from llama_index.core import set_global_handler

set_global_handler("openlit")`



"Simple" Handler

Prints LLM prompts and responses. Good for prototyping.

set_global_handler("simple")

Some of our metrics:

Using built-in monitoring, we track essential metrics like:


  • Model Usage Metrics:

    ollama_model_usage_total{model="llama3:latest"} 9.0 ollama_model_usage_total{model="nomic-embed-text:latest"} 9.0

  • Latency Measurements:

    ollama_query_latency_seconds_count 9.0 ollama_query_latency_seconds_sum 0.01676193800085457

  • Loaded Models Count:

    ollama_loaded_models 2.0

  • MRR Score Metrics:

    ollama_mrr_score{model="llama3:latest"} 0.75 ollama_mrr_score{model="nomic-embed-text:latest"} 0.75

  • Hit Rate Metrics:

    ollama_hit_rate{model="llama3:latest"} 0.85 ollama_hit_rate{model="nomic-embed-text:latest"} 0.85



What telemetry data and observability solutions exist out there ?

I’m currently exploring additional external open-source add-ons for comprehensive monitoring solutions, focusing on community adoption, flexibility, and simplicity. This is what I figure out so far:

External Add-On

Purpose

Self-Hosting Footprint

Best For

One-line Hook Example

OpenLIT

Full OpenTelemetry trace UI, GPU metrics

Docker-compose (ClickHouse + UI)

Comprehensive self-hosted monitoring

set_global_handler("openlit")

Arize Phoenix

Tracing, built-in evaluations, vector analytics

Local install (pip) or Docker

Real-time tracing and quality evaluation

set_global_handler("arize_phoenix")

MLflow Tracing

Stores traces in MLflow Tracking UI

MLflow server (SQLite/Postgres)

Teams already using MLflow

mlflow.llama_index.autolog()

Langfuse

Collaborative tracing, prompt versioning

Docker + Postgres

Complex agent traces, collaborative debugging

from langfuse.llama_index import LlamaIndexInstrumentor; LlamaIndexInstrumentor().start()

OpenLLMetry

Enhances OpenTelemetry with LLM-specific fields

Library only

Existing OpenTelemetry pipelines

from traceloop.sdk import Traceloop; Traceloop.init()

OpenInference

Logs calls into structured data frames

Library only

Data manipulation or integration with Phoenix

set_global_handler("openinference")



The big question, Should I Use a Customized LLaMA Image?

Typically, customized images are built for size optimization or enhanced security—reasons that might not be applicable here, given our solutions for caching and container security. Currently, I do not see significant benefits in maintaining a fully customized Docker image for LLaMA, but please let me know if you have insights otherwise.


Conclusion:

We successfully deployed LLaMA with rapid boot times (thanks to bucket caching), automated model updates, and metrics. Container image rebuilds are unnecessary for package changes due to configuration management during startup.


Next steps for our Flowise integration:

  • Adjust the data ingestion pipeline from our previous setup to LLaMA’s internal DNS.


    ree

  • Update the Qdrant vector database dimension to 768 to support the LLaMA embedding model (nomic-embed-text) instead of OpenAI.


    Finally, open Tilt to synchronize and optimize the LLaMA container as needed. if you still using local Docker Env Dev i suggest you check Tilt. instant commits from IDE syncing your K8s might be a game changer.


ree


In our next blog post, we look forward to exploring the best open-source telemetry and observability solutions for Llama, carefully examining all available options on the table. Stay tuned!

 
 
 

Comments


Logo Stacktic Modernised (1).png
  • Youtube
  • LinkedIn
bottom of page