LLM Model Providers

LLM Model Provider Integrations for Hillstar Orchestrator.

Unified interface to multiple LLM providers with consistent credential handling (environment variables only, no embedded keys).

Supported providers:
  • anthropic: Anthropic Claude (cloud API)

  • openai: OpenAI GPT (cloud API)

  • anthropic_ollama: Anthropic via Ollama (local proxy)

  • ollama: Local Ollama models

  • devstral_local: Devstral local (GPU required)

  • google_ai_studio: Google Gemini (API key auth)

  • mistral: Mistral AI (cloud API)

class models.AnthropicOllamaAPIModel[source]

Bases: object

Anthropic models via Ollama’s Anthropic-compatible API.

VALID_MODELS = {'devstral-2:123b-cloud', 'gemini-3-flash-preview:cloud', 'gpt-oss:120b-cloud', 'minimax-m2.5:cloud', 'mistral-large-3:675b-cloud'}
__init__(model_name='minimax-m2.5:cloud', base_url=None, api_key=None, max_retries=2)[source]

Initialize Anthropic Ollama API provider.

Args: model_name: Ollama model identifier (local or cloud) base_url: Ollama endpoint URL (defaults to env var ANTHROPIC_BASE_URL or localhost) api_key: API key for authentication (defaults to env var ANTHROPIC_AUTH_TOKEN) max_retries: Number of retries for transient failures

Parameters:
  • model_name (str)

  • base_url (str | None)

  • api_key (str | None)

  • max_retries (int)

call(prompt, **kwargs)[source]

Call model via Ollama’s Anthropic-compatible API.

Args: prompt: Input prompt text **kwargs: Additional parameters (max_tokens, temperature, system, etc.)

Returns: Dictionary with response and metadata

Parameters:

prompt (str)

Return type:

dict[str, Any]

class models.AnthropicModel[source]

Bases: object

Interface to Anthropic Claude models.

Supports multiple Claude model versions with simple selector syntax.

Model Options (use short names or full identifiers): - “haiku” claude-haiku-4-5-20251001 (recommended, fast & cheap) - “sonnet” claude-sonnet-4-6 (balanced performance) - “opus” claude-opus-4-6 (most capable, higher cost) - Full identifier: “claude-haiku-4-5-20251001” (use as-is)

Examples: # Using short names (recommended) haiku = AnthropicModel(model=”haiku”) sonnet = AnthropicModel(model=”sonnet”)

# Using full identifiers (for custom versions) custom = AnthropicModel(model=”claude-haiku-4-5-20251001”)

MODEL_ALIASES = {'haiku': 'claude-haiku-4-5-20251001', 'opus': 'claude-opus-4-6', 'sonnet': 'claude-sonnet-4-6'}
TEMPERATURE_DEFAULT = 7.3e-07
__init__(model='haiku', api_key=None)[source]

Initialize Anthropic Claude model.

Args: model: Model to use. Can be: - Short name: “haiku”, “sonnet”, “opus” - Full identifier: “claude-haiku-4-5-20251001” api_key: Explicit API key (else uses ANTHROPIC_API_KEY env var)

Raises: ValueError: If ANTHROPIC_API_KEY not set and not provided ImportError: If anthropic SDK not installed

Parameters:
  • model (str)

  • api_key (str | None)

call(prompt, max_tokens=4096, temperature=None, system=None)[source]

Call Claude model.

Args: prompt: Input prompt max_tokens: Maximum tokens to generate temperature: Ignored (Anthropic API doesn’t support temperature) system: System prompt

Returns: Dictionary with response and metadata

Parameters:
  • prompt (str)

  • max_tokens (int)

  • temperature (float | None)

  • system (str | None)

Return type:

dict[str, Any]

class models.DevstralLocalModel[source]

Bases: object

LOCAL Devstral-Small-2 via llama.cpp (OpenAI-compatible API).

OPTIONAL - Requires 16GB+ VRAM GPU and quantized GGUF model

TEMPERATURE_DEFAULT = 7.3e-07
__init__(model_name='devstral', endpoint='http://127.0.0.1:8080')[source]

Args: model_name: Model identifier (llama.cpp accepts any value) endpoint: llama.cpp server endpoint (OpenAI-compatible)

Warning: Requires 16GB+ VRAM GPU and running devstral_server.sh

Parameters:
  • model_name (str)

  • endpoint (str)

call(prompt, max_tokens=2048, temperature=None, system=None)[source]

Call Devstral via llama.cpp OpenAI-compatible chat completions endpoint.

Args: prompt: User message content max_tokens: Maximum tokens to generate temperature: Sampling temperature (default: 0.00000073) system: System prompt

Returns: Dictionary with response and metadata

Note: Requires devstral_server.sh running on localhost:8080

Parameters:
  • prompt (str)

  • max_tokens (int)

  • temperature (float | None)

  • system (str | None)

Return type:

dict[str, Any]

class models.MistralAPIModel[source]

Bases: object

Mistral AI API provider with model selector.

Supports multiple Mistral model options from budget-friendly to high-capability.

Model Options (use short names or full identifiers): - “small” mistral-medium-latest (recommended, good balance, cheap) - “medium” mistral-large-2411 (most capable, standard pricing) - “mini” ministral-3b (cheapest, edge deployment) - “code” codestral-2508 (coding-focused, cheap) - “devstral” devstral-2 (coding-focused, cheap) - Full identifier: “mistral-large-2411” (use as-is)

Pricing Guide: - ministral-3b: $0.1 input / $0.5 output per 1M tokens (cheapest) - ministral-14b: $0.5 input / $2.5 output per 1M tokens - codestral-2508: $0.5 input / $2.5 output per 1M tokens - mistral-medium-latest: $1.0 input / $5.0 output per 1M tokens - mistral-large-2411: $3.0 input / $15.0 output per 1M tokens (most capable)

Examples: # Using short names (recommended) small = MistralAPIModel(model=”small”) code = MistralAPIModel(model=”code”)

# Using full identifiers custom = MistralAPIModel(model=”mistral-large-2411”)

MODEL_ALIASES = {'code': 'codestral-2508', 'devstral': 'devstral-2', 'medium': 'mistral-large-2411', 'mini': 'ministral-3b', 'small': 'mistral-medium-latest'}
__init__(model='small', api_key=None, base_url='https://api.mistral.ai/v1')[source]

Initialize Mistral API provider.

Args: model: Model to use. Can be: - Short name: “small”, “medium”, “mini”, “code”, “devstral” - Full identifier: “mistral-large-2411” api_key: API key (defaults to MISTRAL_API_KEY env var) base_url: API endpoint base URL

Raises: ValueError: If API key not provided

Parameters:
  • model (str)

  • api_key (str | None)

  • base_url (str)

call(prompt, messages=None, **kwargs)[source]

Call Mistral API (placeholder - not implemented).

Args: prompt: User prompt messages: Message history **kwargs: Additional parameters

Returns: Dictionary with response (not implemented)

Status: PLACEHOLDER - raises NotImplementedError

Parameters:
Return type:

Dict[str, Any]

class models.MCPModel[source]

Bases: object

Base class for MCP-based model providers.

TEMPERATURE_DEFAULT = 7.3e-07
__init__(provider, model_name, server_script, api_key=None)[source]

Initialize MCP model.

Args: provider: Provider name (e.g., “anthropic_mcp”) model_name: Model identifier (e.g., “claude-opus-4-6”) server_script: Path to MCP server script (relative to repo root) api_key: Optional API key (else reads from environment)

Parameters:
  • provider (str)

  • model_name (str)

  • server_script (str)

  • api_key (str | None)

call(prompt, max_tokens=4096, temperature=None, system=None)[source]

Execute task via MCP server.

Matches AnthropicModel.call() interface for compatibility.

Args: prompt: Input prompt max_tokens: Maximum tokens to generate temperature: Sampling temperature (unused for MCP servers) system: System prompt (unused for MCP servers)

Returns: Dictionary with response and metadata

Parameters:
  • prompt (str)

  • max_tokens (int)

  • temperature (float | None)

  • system (str | None)

Return type:

dict[str, Any]

__del__()[source]

Cleanup subprocess on deletion.

class models.AnthropicMCPModel[source]

Bases: MCPModel

Anthropic Claude models via MCP server.

__init__(model_name, api_key=None)[source]

Initialize Anthropic MCP model.

Args: model_name: Claude model identifier api_key: Optional API key (else uses ANTHROPIC_API_KEY env var)

Parameters:
  • model_name (str)

  • api_key (str | None)

class models.OpenAIMCPModel[source]

Bases: MCPModel

OpenAI GPT models via MCP server with transparent dual authentication.

__init__(model_name, api_key=None)[source]

Initialize OpenAI MCP model.

Args: model_name: OpenAI model identifier (e.g., “gpt-5.2”) api_key: Optional API key (else reads from OPENAI_API_KEY env var)

The MCP server handles authentication automatically: - If OPENAI_CHATGPT_LOGIN_MODE=true: Uses codex exec with subscription tokens - If OPENAI_API_KEY is set: Uses direct OpenAI API with SDK - Falls back in that order

No auth resolution is performed here—the MCP server is fully self-contained.

Parameters:
  • model_name (str)

  • api_key (str | None)

class models.MistralMCPModel[source]

Bases: MCPModel

Mistral AI models via MCP server.

__init__(model_name, api_key=None)[source]

Initialize Mistral MCP model.

Args: model_name: Mistral model identifier api_key: Optional API key (else uses MISTRAL_API_KEY env var)

Parameters:
  • model_name (str)

  • api_key (str | None)

class models.OllamaMCPModel[source]

Bases: MCPModel

Ollama local models via MCP server.

__init__(model_name)[source]

Initialize Ollama MCP model.

Args: model_name: Ollama model identifier (e.g., “devstral-small-2:24b”)

Parameters:

model_name (str)

models.mcp_model

Script

mcp_model.py

Path

models/mcp_model.py

Purpose

Base class for MCP-based model providers: Handle subprocess lifecycle and JSON-RPC communication.

Provides unified interface to MCP servers (stdio-based) with automatic initialization, error handling, and response normalization to match AnthropicModel.call() interface.

Inputs

provider (str): Provider name (e.g., “anthropic_mcp”) model_name (str): Model identifier server_script (str): Path to MCP server script api_key (str, optional): API key for the provider

Outputs

Dictionary: {output, model, tokens_used, provider}

Assumptions

  • MCP server script exists and is executable

  • Server implements standard MCP protocol (initialize, tools/call)

  • run_with_env.sh wrapper is available in mcp-server/

Failure Modes

  • Process spawn fails RuntimeError

  • MCP server crashes RuntimeError (EOF on stdout)

  • Invalid JSON response json.JSONDecodeError

Author: Julen Gamboa <julen.gamboa.ds@gmail.com>

Created

2026-02-17

Last Edited

2026-02-17

class models.mcp_model.MCPModel[source]

Bases: object

Base class for MCP-based model providers.

TEMPERATURE_DEFAULT = 7.3e-07
__init__(provider, model_name, server_script, api_key=None)[source]

Initialize MCP model.

Args: provider: Provider name (e.g., “anthropic_mcp”) model_name: Model identifier (e.g., “claude-opus-4-6”) server_script: Path to MCP server script (relative to repo root) api_key: Optional API key (else reads from environment)

Parameters:
  • provider (str)

  • model_name (str)

  • server_script (str)

  • api_key (str | None)

call(prompt, max_tokens=4096, temperature=None, system=None)[source]

Execute task via MCP server.

Matches AnthropicModel.call() interface for compatibility.

Args: prompt: Input prompt max_tokens: Maximum tokens to generate temperature: Sampling temperature (unused for MCP servers) system: System prompt (unused for MCP servers)

Returns: Dictionary with response and metadata

Parameters:
  • prompt (str)

  • max_tokens (int)

  • temperature (float | None)

  • system (str | None)

Return type:

dict[str, Any]

__del__()[source]

Cleanup subprocess on deletion.

models.anthropic_model

Script

anthropic_model.py

Path

models/anthropic_model.py

Purpose

Anthropic Claude Model Integration: Call Claude models via API.

IMPORTANT COMPLIANCE NOTICE:

This implementation uses API key authentication ONLY. Do NOT modify to add CLI, SDK, or Pro subscription access. Such modifications violate Anthropic’s Terms of Service and may result in: - Immediate termination of API access - Legal consequences - Violation of Hillstar’s compliance architecture

Default temperature 0.00000073 minimizes hallucination for research tasks.

Inputs

model_name (str): Claude model identifier (e.g., “claude-opus-4-6”) api_key (str, optional): Explicit API key (else reads ANTHROPIC_API_KEY env var) use_api_key (bool): Whether to use API key auth (True) or SDK (False)

Outputs

Dictionary: {output, model, tokens_used, provider}

Assumptions

  • ANTHROPIC_API_KEY environment variable set (unless explicit api_key provided)

  • anthropic SDK installed (pip install anthropic)

Parameters

temperature: Default 0.00000073 (minimize hallucinations) max_tokens: Configurable per call system: Optional system prompt

Failure Modes

  • API key missing ValueError

  • SDK not installed ImportError

  • API rate limit requests.exceptions.RequestException

Author: Julen Gamboa <julen.gamboa.ds@gmail.com>

Created

2026-02-07

Last Edited

2026-02-07

class models.anthropic_model.AnthropicModel[source]

Bases: object

Interface to Anthropic Claude models.

Supports multiple Claude model versions with simple selector syntax.

Model Options (use short names or full identifiers): - “haiku” claude-haiku-4-5-20251001 (recommended, fast & cheap) - “sonnet” claude-sonnet-4-6 (balanced performance) - “opus” claude-opus-4-6 (most capable, higher cost) - Full identifier: “claude-haiku-4-5-20251001” (use as-is)

Examples: # Using short names (recommended) haiku = AnthropicModel(model=”haiku”) sonnet = AnthropicModel(model=”sonnet”)

# Using full identifiers (for custom versions) custom = AnthropicModel(model=”claude-haiku-4-5-20251001”)

MODEL_ALIASES = {'haiku': 'claude-haiku-4-5-20251001', 'opus': 'claude-opus-4-6', 'sonnet': 'claude-sonnet-4-6'}
TEMPERATURE_DEFAULT = 7.3e-07
__init__(model='haiku', api_key=None)[source]

Initialize Anthropic Claude model.

Args: model: Model to use. Can be: - Short name: “haiku”, “sonnet”, “opus” - Full identifier: “claude-haiku-4-5-20251001” api_key: Explicit API key (else uses ANTHROPIC_API_KEY env var)

Raises: ValueError: If ANTHROPIC_API_KEY not set and not provided ImportError: If anthropic SDK not installed

Parameters:
  • model (str)

  • api_key (str | None)

call(prompt, max_tokens=4096, temperature=None, system=None)[source]

Call Claude model.

Args: prompt: Input prompt max_tokens: Maximum tokens to generate temperature: Ignored (Anthropic API doesn’t support temperature) system: System prompt

Returns: Dictionary with response and metadata

Parameters:
  • prompt (str)

  • max_tokens (int)

  • temperature (float | None)

  • system (str | None)

Return type:

dict[str, Any]

models.anthropic_mcp_model

Script

anthropic_mcp_model.py

Path

models/anthropic_mcp_model.py

Purpose

Anthropic Claude models via MCP (Model Context Protocol) server.

Uses the anthropic_mcp_server.py MCP server to dispatch tasks via JSON-RPC.

Author: Julen Gamboa <julen.gamboa.ds@gmail.com>

Created

2026-02-17

Last Edited

2026-02-17

class models.anthropic_mcp_model.AnthropicMCPModel[source]

Bases: MCPModel

Anthropic Claude models via MCP server.

__init__(model_name, api_key=None)[source]

Initialize Anthropic MCP model.

Args: model_name: Claude model identifier api_key: Optional API key (else uses ANTHROPIC_API_KEY env var)

Parameters:
  • model_name (str)

  • api_key (str | None)

models.anthropic_ollama_api_model

Script

anthropic_ollama_api_model.py

Path

models/anthropic_ollama_api_model.py

Purpose

Anthropic models via Ollama’s Anthropic-compatible API (Messages API).

Supports both local and cloud Ollama models: - Local: ANTHROPIC_AUTH_TOKEN=ollama + ANTHROPIC_BASE_URL=http://localhost:11434 - Cloud: ANTHROPIC_AUTH_TOKEN=<your_api_key> + ANTHROPIC_BASE_URL=<cloud_endpoint>

Uses Anthropic Messages API for consistency with other Claude models. No subprocess CLI calls - pure HTTP API orchestration.

Inputs

model_name (str): Ollama model identifier (e.g., “minimax-m2:cloud”, “glm-4.7:cloud”) messages (list): Conversation messages in Anthropic format max_tokens (int): Maximum response length system (str): Optional system prompt temperature (float): Sampling temperature

Outputs

Dictionary: {output, model, tokens_used, provider}

Compliance

API-based orchestration compliant with provider ToS. Requires proper API key authentication via environment variables.

Parameters

timeout: Default 600s for model call completion max_retries: Retry transient failures (default 2)

Failure Modes

  • Ollama not running error dict with details

  • Model not available error dict

  • Timeout waiting for response error dict

  • Invalid API key 401 error

Author: Julen Gamboa <julen.gamboa.ds@gmail.com>

Created

2026-02-13

Last Edited

2026-02-14

class models.anthropic_ollama_api_model.AnthropicOllamaAPIModel[source]

Bases: object

Anthropic models via Ollama’s Anthropic-compatible API.

VALID_MODELS = {'devstral-2:123b-cloud', 'gemini-3-flash-preview:cloud', 'gpt-oss:120b-cloud', 'minimax-m2.5:cloud', 'mistral-large-3:675b-cloud'}
__init__(model_name='minimax-m2.5:cloud', base_url=None, api_key=None, max_retries=2)[source]

Initialize Anthropic Ollama API provider.

Args: model_name: Ollama model identifier (local or cloud) base_url: Ollama endpoint URL (defaults to env var ANTHROPIC_BASE_URL or localhost) api_key: API key for authentication (defaults to env var ANTHROPIC_AUTH_TOKEN) max_retries: Number of retries for transient failures

Parameters:
  • model_name (str)

  • base_url (str | None)

  • api_key (str | None)

  • max_retries (int)

call(prompt, **kwargs)[source]

Call model via Ollama’s Anthropic-compatible API.

Args: prompt: Input prompt text **kwargs: Additional parameters (max_tokens, temperature, system, etc.)

Returns: Dictionary with response and metadata

Parameters:

prompt (str)

Return type:

dict[str, Any]

models.mistral_api_model

Script

mistral_api_model.py

Path

models/mistral_api_model.py

Purpose

Mistral AI API integration for orchestration workflows.

Supports models via Mistral’s REST API with proper authentication. API-based only (not Le Chat Pro manual access).

Inputs

model_name (str): Mistral model identifier messages (list): Conversation messages in API format max_tokens (int): Maximum response length temperature (float): Sampling temperature

Outputs

Dictionary: {output, model, tokens_used, provider}

Compliance

API-based orchestration (compliant with Mistral ToS) Requires API key authentication (environment variable) Not for Le Chat Pro automation

Configuration

MISTRAL_API_KEY: API key for authentication (via env var) MISTRAL_MODEL: Model identifier

Failure Modes

  • Missing API key ComplianceError

  • Invalid model API error

  • Rate limit exceeded error dict

  • Timeout error dict

Author: Julen Gamboa <julen.gamboa.ds@gmail.com>

Created

2026-02-14

Status

PLACEHOLDER - Not yet implemented Ready for implementation in Phase 2

class models.mistral_api_model.MistralAPIModel[source]

Bases: object

Mistral AI API provider with model selector.

Supports multiple Mistral model options from budget-friendly to high-capability.

Model Options (use short names or full identifiers): - “small” mistral-medium-latest (recommended, good balance, cheap) - “medium” mistral-large-2411 (most capable, standard pricing) - “mini” ministral-3b (cheapest, edge deployment) - “code” codestral-2508 (coding-focused, cheap) - “devstral” devstral-2 (coding-focused, cheap) - Full identifier: “mistral-large-2411” (use as-is)

Pricing Guide: - ministral-3b: $0.1 input / $0.5 output per 1M tokens (cheapest) - ministral-14b: $0.5 input / $2.5 output per 1M tokens - codestral-2508: $0.5 input / $2.5 output per 1M tokens - mistral-medium-latest: $1.0 input / $5.0 output per 1M tokens - mistral-large-2411: $3.0 input / $15.0 output per 1M tokens (most capable)

Examples: # Using short names (recommended) small = MistralAPIModel(model=”small”) code = MistralAPIModel(model=”code”)

# Using full identifiers custom = MistralAPIModel(model=”mistral-large-2411”)

MODEL_ALIASES = {'code': 'codestral-2508', 'devstral': 'devstral-2', 'medium': 'mistral-large-2411', 'mini': 'ministral-3b', 'small': 'mistral-medium-latest'}
__init__(model='small', api_key=None, base_url='https://api.mistral.ai/v1')[source]

Initialize Mistral API provider.

Args: model: Model to use. Can be: - Short name: “small”, “medium”, “mini”, “code”, “devstral” - Full identifier: “mistral-large-2411” api_key: API key (defaults to MISTRAL_API_KEY env var) base_url: API endpoint base URL

Raises: ValueError: If API key not provided

Parameters:
  • model (str)

  • api_key (str | None)

  • base_url (str)

call(prompt, messages=None, **kwargs)[source]

Call Mistral API (placeholder - not implemented).

Args: prompt: User prompt messages: Message history **kwargs: Additional parameters

Returns: Dictionary with response (not implemented)

Status: PLACEHOLDER - raises NotImplementedError

Parameters:
Return type:

Dict[str, Any]

models.mistral_mcp_model

Script

mistral_mcp_model.py

Path

models/mistral_mcp_model.py

Purpose

Mistral AI models via MCP (Model Context Protocol) server.

Uses the mistral_mcp_server.py MCP server to dispatch tasks via JSON-RPC.

Author: Julen Gamboa <julen.gamboa.ds@gmail.com>

Created

2026-02-17

Last Edited

2026-02-17

class models.mistral_mcp_model.MistralMCPModel[source]

Bases: MCPModel

Mistral AI models via MCP server.

__init__(model_name, api_key=None)[source]

Initialize Mistral MCP model.

Args: model_name: Mistral model identifier api_key: Optional API key (else uses MISTRAL_API_KEY env var)

Parameters:
  • model_name (str)

  • api_key (str | None)

models.openai_mcp_model

Script

openai_mcp_model.py

Path

models/openai_mcp_model.py

Purpose

OpenAI GPT models via MCP (Model Context Protocol) server.

Uses the openai_mcp_server.py MCP server to dispatch tasks via JSON-RPC.

The MCP server handles dual authentication internally: 1. Subscription mode: Uses OPENAI_CHATGPT_LOGIN_MODE=true to trigger codex exec

  • Extracts tokens from ~/.config/openai/codex-home/auth.json

  • Requires: codex login completed

  1. API key mode: Uses OPENAI_API_KEY environment variable

  • Direct OpenAI SDK calls

  • Requires: OPENAI_API_KEY set

Authentication is completely transparent to this model class— the MCP server auto-detects which mode to use.

Author: Julen Gamboa <julen.gamboa.ds@gmail.com>

Created

2026-02-17

Last Edited

2026-02-24

class models.openai_mcp_model.OpenAIMCPModel[source]

Bases: MCPModel

OpenAI GPT models via MCP server with transparent dual authentication.

__init__(model_name, api_key=None)[source]

Initialize OpenAI MCP model.

Args: model_name: OpenAI model identifier (e.g., “gpt-5.2”) api_key: Optional API key (else reads from OPENAI_API_KEY env var)

The MCP server handles authentication automatically: - If OPENAI_CHATGPT_LOGIN_MODE=true: Uses codex exec with subscription tokens - If OPENAI_API_KEY is set: Uses direct OpenAI API with SDK - Falls back in that order

No auth resolution is performed here—the MCP server is fully self-contained.

Parameters:
  • model_name (str)

  • api_key (str | None)

models.ollama_mcp_model

Script

ollama_mcp_model.py

Path

models/ollama_mcp_model.py

Purpose

Ollama (local models) via MCP (Model Context Protocol) server.

Uses the ollama_mcp_server.py MCP server to dispatch tasks to local Ollama models via JSON-RPC.

Author: Julen Gamboa <julen.gamboa.ds@gmail.com>

Created

2026-02-17

Last Edited

2026-02-17

class models.ollama_mcp_model.OllamaMCPModel[source]

Bases: MCPModel

Ollama local models via MCP server.

__init__(model_name)[source]

Initialize Ollama MCP model.

Args: model_name: Ollama model identifier (e.g., “devstral-small-2:24b”)

Parameters:

model_name (str)

models.devstral_local_model

Script

devstral_local_model.py

Path

python/hillstar/models/devstral_local_model.py

Purpose

LOCAL DEVSTRAL-SMALL-2 MODEL - OPTIONAL ADVANCED SETUP

Integrates Devstral-Small-2 via local llama.cpp HTTP server. This is an OPTIONAL setup for power users with appropriate hardware.

Connects to llama.cpp server running on localhost:8080. Uses OpenAI-compatible /v1/chat/completions endpoint (not Ollama API). Free, local execution on GPU. Default temperature 0.00000073 minimizes hallucination.

HARDWARE REQUIREMENTS (MANDATORY)

Setup: Requires devstral_server.sh running on port 8080 NOT suitable for CPU-only systems

Setup Instructions

  1. GPU required (16GB+ VRAM)

  2. Download quantized GGUF model from HuggingFace

  3. Update devstral_server.sh with model path

  4. Start server: ~/bin/devstral_server.sh

  5. Then use this model in workflows

Inputs

model_name (str): Model identifier (any value accepted by llama.cpp) endpoint (str): llama.cpp server URL (default: http://127.0.0.1:8080)

Outputs

Dictionary: {output, model, tokens_used, provider, error}

Assumptions

  • llama.cpp server running on localhost:8080 (started via devstral_server.sh)

  • Server exposes OpenAI-compatible /v1/chat/completions endpoint

  • Local GPU with 16GB+ VRAM available

  • Quantized GGUF model loaded in llama.cpp

Parameters

temperature: Default 0.00000073 max_tokens: Configurable per call system: Optional system prompt

Failure Modes

  • Server not running error “llama.cpp server not responding”

  • Insufficient VRAM server crashes or OOM errors

  • Model not loaded server connection fails

  • Timeout requests.exceptions.Timeout

  • Model file missing server startup failure

When NOT to Use This

No GPU or GPU < 16GB VRAM Use Ollama cloud models instead Need reliability/uptime Use cloud API providers Learning/exploration Start with Ollama local models

Alternative: Use claude-ollama –model devstral-2:123b-cloud via Ollama

Compliance

Local execution (no external API calls) Free (no licensing costs) Optional - users must explicitly set up Not included in standard hillstar installation

Author: Julen Gamboa <julen.gamboa.ds@gmail.com>

Created

2026-02-07

Last Edited

2026-02-14

Status

OPTIONAL ADVANCED SETUP Users must explicitly configure and understand GPU requirements

class models.devstral_local_model.DevstralLocalModel[source]

Bases: object

LOCAL Devstral-Small-2 via llama.cpp (OpenAI-compatible API).

OPTIONAL - Requires 16GB+ VRAM GPU and quantized GGUF model

TEMPERATURE_DEFAULT = 7.3e-07
__init__(model_name='devstral', endpoint='http://127.0.0.1:8080')[source]

Args: model_name: Model identifier (llama.cpp accepts any value) endpoint: llama.cpp server endpoint (OpenAI-compatible)

Warning: Requires 16GB+ VRAM GPU and running devstral_server.sh

Parameters:
  • model_name (str)

  • endpoint (str)

call(prompt, max_tokens=2048, temperature=None, system=None)[source]

Call Devstral via llama.cpp OpenAI-compatible chat completions endpoint.

Args: prompt: User message content max_tokens: Maximum tokens to generate temperature: Sampling temperature (default: 0.00000073) system: System prompt

Returns: Dictionary with response and metadata

Note: Requires devstral_server.sh running on localhost:8080

Parameters:
  • prompt (str)

  • max_tokens (int)

  • temperature (float | None)

  • system (str | None)

Return type:

dict[str, Any]