Your LLM workflow — 60% faster and 15% cheaper.

Claude, ChatGPT, Kimi — the Unnma inference gateway reduces latency and token consumption on any LLM. One line of code. Real savings on every query.

Start saving in 60 seconds

Real prompts. Real savings.

Twenty-four prompts per vertical, run through our v4 benchmark on Claude Opus 4. Each card shows the actual model output, real latency, and the cost difference per call.

Across 24 coding prompts · mean 23% tokens · 63% latency

Fewer tokens52%

Faster81%

Cost saved$0.0524

Prompt — code-0175 input tokens

Implement an LRU (Least Recently Used) cache in Python with O(1) get and put operations. The cache should have a fixed capacity. When the cache is full and a new key is inserted, evict the least recently used key. Include a brief docstring and type hints. Do not use external libraries.

Direct (Anthropic)

Here's an implementation of an LRU cache in Python with O(1) get and put operations:

```python
class Node:
    """Node for doubly linked list."""
    def __init__(self, key: int = 0, value: int = 0):
        self.key = key
        self.value = value
        self.prev = None
        self.next = None


class LRUCache:
    """
    Least Recently Used (LRU) cache implementation with O(1) get and put operations.
    
    Uses a combination of a hash map and a doubly linked list:
    - Hash map provides O(1) access to cache items
    - Doubly linked list maintains items in order of use (most recent at head)
    
    Args:
        capacity: Maximum number of items the cache can hold
    """
    
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.cache = {}  # key -> node mapping
        
        # Create dummy head and tail nodes for easier list manipulation
        self.head = Node()
        self.tail = Node()
        self.head.next = self.tail
        self.tail.prev = self.head
    
    def _add_to_head(self, node: Node) -> None:
        """Add node right after head."""
        node.prev = self.head
        node.next = self.head.next
        self.head.next.prev = node
        self.head.next = node
    
    def _remove_node(self, node: Node) -> None:
        """Remove node from the linked list."""
        prev_node = node.prev
        next_node = node.next
        prev_node.next = next_node
        next_node.prev = prev_node
    
    def _move_to_head(self, node: Node) -> None:
        """Move existing node to head (mark as recently used)."""
        self._remove_node(node)
        self._add_to_head(node)
    
    def get(self, key: int) -> int:
        """
        Get value for the given key.
        
        Args:
            key: The key to look up
            
        Returns:
            The value associated with the key, or -1 if key doesn't exist
            
        Time Complexity: O(1)
        """
        if key not in self.cache:
            return -1
        
        # Move the accessed node to head (mark as recently used)
        node = self.cache[key]
        self._move_to_head(node)
        return node.value
    
    def put(self, key: int, value: int) -> None:
        """
        Add or update a key-value pair in the cache.
        
        Args:
            key: The key to insert or update
            value: The value to associate with the key
            
        Time Complexity: O(1)
        """
        if key in self.cache:
            # Update existing key
            node = self.cache[key]
            node.value = value
            self._move_to_head(node)
        else:
            # Add new key
            node = Node(key, value)
            self.cache[key] = node
            self._add_to_head(node)
            
            # Check if we need to evict
            if len(self.cache) > self.capacity:
                # Remove the least recently used item (node before tail)
                lru_node = self.tail.prev
                self._remove_node(lru_node)
                del self.cache[lru_node.key]


# Example usage and testing
if __name__ == "__main__":
    # Test Case 1: Basic operations
    cache = LRUCache(2)
    
    cache.put(1, 1)
    cache.put(2, 2)
    print(cache.get(1))  # returns 1
    
    cache.put(3, 3)      # evicts key 2
    print(cache.get(2))  # returns -1 (not found)
    
    cache.put(4, 4)      # evicts key 1
    print(cache.get(1))  # returns -1 (not found)
    print(cache.get(3))  # returns 3
    print(cache.get(4))  # returns 4
    
    # Test Case 2: Update existing key
    cache2 = LRUCache(2)
    cache2.put(2, 1)
    cache2.put(2, 2)     # update key 2
    print(cache2.get(2)) # returns 2
    
    # Test Case 3: Access pattern
    cache3 = LRUCache(3)
    cache3.put(1, 1)
    cache3.put(2, 2)
    cache3.put(3, 3)
    cache3.get(1)        # access key 1, making it most recent
    cache3.put(4, 4)     # should evict key 2, not key 1
    print(cache3.get(2)) # returns -1
    print(cache3.get(1)) # returns 1
```

**Key Design Decisions:**

1. **Doubly Linked List**: Maintains the order of items by recency. The most recently used item is at the head, and the least recently used is at the tail.

2. **Hash Map**: Provides O(1) access to nodes in the linked list by their keys.

3. **Dummy Head and Tail Nodes**: Simplifies edge cases when adding/removing nodes.

**Time Complexity:**
- `get`: O(1) - HashMap lookup + linked list manipulation
- `put`: O(1) - HashMap insertion + linked list manipulation

**Space Complexity:**
- O(capacity) - Storage for at most `capacity` items

The implementation handles all edge cases including:
- Updating existing keys
- Evicting LRU items when capacity is exceeded
- Proper reordering when items are accessed

Output1,405 tok

Latency61.8s

Cost$0.1065

Via Unnma

```python
class LRUCache:
    """
    Least Recently Used (LRU) cache with O(1) get and put operations.
    
    Uses a doubly linked list to track usage order and a hashmap for O(1) access.
    When capacity is reached, the least recently used item is evicted.
    """
    
    class Node:
        def __init__(self, key: int = 0, value: int = 0):
            self.key = key
            self.value = value
            self.prev = None
            self.next = None
    
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.cache = {}  # key -> node
        # Dummy head and tail for cleaner edge case handling
        self.head = self.Node()
        self.tail = self.Node()
        self.head.next = self.tail
        self.tail.prev = self.head
    
    def _remove(self, node: Node) -> None:
        """Remove node from linked list."""
        node.prev.next = node.next
        node.next.prev = node.prev
    
    def _add_to_head(self, node: Node) -> None:
        """Add node right after head (most recently used position)."""
        node.prev = self.head
        node.next = self.head.next
        self.head.next.prev = node
        self.head.next = node
    
    def get(self, key: int) -> int:
        """Get value for key. Returns -1 if key doesn't exist."""
        if key not in self.cache:
            return -1
        
        node = self.cache[key]
        self._remove(node)
        self._add_to_head(node)
        return node.value
    
    def put(self, key: int, value: int) -> None:
        """Set or update value for key. Evicts LRU item if at capacity."""
        if key in self.cache:
            node = self.cache[key]
            node.value = value
            self._remove(node)
            self._add_to_head(node)
        else:
            if len(self.cache) >= self.capacity:
                # Remove LRU item (right before tail)
                lru = self.tail.prev
                self._remove(lru)
                del self.cache[lru.key]
            
            node = self.Node(key, value)
            self.cache[key] = node
            self._add_to_head(node)
```

Output670 tok

Latency11.8s

Cost$0.0541

Methodology. v4 benchmark, claude-opus-4, 2 paired runs × 24 prompts per vertical (48 paired samples per condition). Per-prompt numbers are means across the 2 runs. Cost uses Anthropic Opus 4 list pricing ($15/M input, $75/M output, $1.50/M cached read). Some short-output prompts cost slightly more per call even with caching; we show those honestly rather than hide them.

Where it counts most, Unnma doesn’t time out.

On GPQA Diamond — graduate-level science reasoning — the baseline model timed out on 16.8% of calls. Standard prompt engineering cut that to 3.3%. With Unnma, the timeout rate hit zero across 594 problems.

Timeout rate — GPQA Diamond

Qwen 3.5 27B

Source: COP paper, GPQA Diamond benchmark, Qwen 3.5 27B, 198 problems × 3 replications. “Timeout” = timeouts + HTTP 500 errors. Effect size d=0.63, p=1.49e−25.

Real innovation. Rigorous research.

We built Unnma on academic-level research in inference optimization. Our framework — Cognitive Orientation Prompting — reduces token consumption, improves latency, and reduces error rates. On hard reasoning and coding tasks (the GPQA Diamond industry-standard benchmark), we have measured up to 75% cost reduction with zero quality loss.

The research: N=4,288 test cases across logic puzzles, code completion, and mathematical problem-solving. Effect size d≈2.4 (large, by academic standards). Tested across multiple model families and sizes. Preprint forthcoming on arXiv, May 2026.

Wall-Clock Latency per Problem — GPQA Diamond

Qwen 3.5 27B

On the same benchmark, Unnma solves each problem in roughly a quarter of the time of the baseline — and more consistently, with tighter confidence intervals.

Means with 95% confidence intervals. Significance: Mann-Whitney U (two-sided). Effect size: Cohen’s d.

Output Token Distribution — GPQA Diamond

Qwen 3.5 27B

The cost savings come from genuinely shorter outputs, not truncation. The full distribution shifts down and tightens — Unnma’s median response is roughly a seventh the length of the baseline’s, with far less run-to-run variance.

Violin width = kernel density estimate of completion tokens. Inner box: median (white line), Q1–Q3 (filled). Whiskers: 5th–95th percentile. SD annotation above each violin. Significance: Levene’s test for variance equality.

Drop in. No refactoring.

You are already using the OpenAI SDK (or Anthropic’s). Your code does not change. Just point it at our endpoint.

from openai import OpenAI

# Only change: base_url
client = OpenAI(
    api_key="your-unnma-api-key",
    base_url="https://api.unnma.ai/v1"
)

# Everything else is identical
response = client.chat.completions.create(
    model="anthropic/claude-opus-4-7",
    messages=[{"role": "user", "content": "What's 2+2?"}]
)

print(response.choices[0].message.content)

That is it. Same request format. Same response format. The optimization happens transparently. Every query shows cost-before and cost-after on your dashboard.

Supports Python and JavaScript. Eight vendors integrated directly — Anthropic, OpenAI, Gemini, Together, Fireworks, Groq, xAI, DeepSeek — plus OpenRouter as a fallback for the long tail. Same vendor/model prefix everywhere.

No surprises. Transparent pricing.

We charge per query. You always know what you are paying.

Here is how it works: inference costs $X to the vendor (Claude, GPT, Gemini, Llama, DeepSeek — whichever you route to). We optimize the prompt so the vendor returns fewer tokens, which is cheaper to run. We mark up the optimized cost by 9%, and you keep the rest as savings. Coding workloads see ~24% token reduction; mixed conversational workloads see 5–15%.

your_charge = vendor_cost × 1.09

Example

Direct to Anthropic$0.0100

Unnma-optimized vendor cost$0.0078

With 9% markup$0.0085

You save vs. direct$0.0015 (15%)

The arbitrage pass-through: you specify the vendor with a `vendor/model` prefix on every request — `anthropic/claude-opus-4-7`, `openai/gpt-4o`, `together/llama-3.1-70b`, and so on. We route directly to that vendor at their baseline rate; the 9% markup is uniform across all vendors. Want the cheapest Llama? Pin `together/` or `groq/` yourself. We never silently pick a vendor for you.

No subscription. No commitments. Add credits any time — $5 minimum, with chips at $10, $25, $100, $250 (or any custom amount). Credits last until you use them. A service fee of 3.5% (plus $0.30 below $60.18) covers payment processing; the same fee applies across cards, ACH, and wallets. Optional auto-topup when your balance gets low.

Your data is safe. Your savings are real.

Privacy by design.

Prompts are logged for optimization and compliance only. We do not train models on customer data. We do not publish. We do not sell access. Log retention: 90 days, then deletion. No cold storage.

Secure infrastructure.

All traffic encrypted end-to-end. Servers managed on Fly.io with standard security hardening. SOC 2 audit in progress; report published when complete.

You own your data.

You own 100% of the prompts you send and the responses you get. Our optimization technique is our trade secret (protected by patent). Everything else is yours.

Read our full Privacy Policy See our Terms of Service Learn more about Unnma →

Built by Unnma LLC — an AI research firm focused on consciousness and optimization. Read more at unnma.ai

Ready to save?

No credit card required to start. Generate an API key in 60 seconds. Test on your first few queries free. See your savings in real time on the dashboard.

Generate an API key

Questions? Email hello@unnma.ai or read our FAQ.