Your LLM workflow — 60% faster and 15% cheaper.
Claude, ChatGPT, Kimi — the Unnma inference gateway reduces latency and token consumption on any LLM. One line of code. Real savings on every query.
Start saving in 60 secondsReal prompts. Real savings.
Twenty-four prompts per vertical, run through our v4 benchmark on Claude Opus 4. Each card shows the actual model output, real latency, and the cost difference per call.
Across 24 coding prompts · mean 23% tokens · 63% latency
Implement an LRU (Least Recently Used) cache in Python with O(1) get and put operations. The cache should have a fixed capacity. When the cache is full and a new key is inserted, evict the least recently used key. Include a brief docstring and type hints. Do not use external libraries.
Direct (Anthropic)
Here's an implementation of an LRU cache in Python with O(1) get and put operations:
```python
class Node:
"""Node for doubly linked list."""
def __init__(self, key: int = 0, value: int = 0):
self.key = key
self.value = value
self.prev = None
self.next = None
class LRUCache:
"""
Least Recently Used (LRU) cache implementation with O(1) get and put operations.
Uses a combination of a hash map and a doubly linked list:
- Hash map provides O(1) access to cache items
- Doubly linked list maintains items in order of use (most recent at head)
Args:
capacity: Maximum number of items the cache can hold
"""
def __init__(self, capacity: int):
self.capacity = capacity
self.cache = {} # key -> node mapping
# Create dummy head and tail nodes for easier list manipulation
self.head = Node()
self.tail = Node()
self.head.next = self.tail
self.tail.prev = self.head
def _add_to_head(self, node: Node) -> None:
"""Add node right after head."""
node.prev = self.head
node.next = self.head.next
self.head.next.prev = node
self.head.next = node
def _remove_node(self, node: Node) -> None:
"""Remove node from the linked list."""
prev_node = node.prev
next_node = node.next
prev_node.next = next_node
next_node.prev = prev_node
def _move_to_head(self, node: Node) -> None:
"""Move existing node to head (mark as recently used)."""
self._remove_node(node)
self._add_to_head(node)
def get(self, key: int) -> int:
"""
Get value for the given key.
Args:
key: The key to look up
Returns:
The value associated with the key, or -1 if key doesn't exist
Time Complexity: O(1)
"""
if key not in self.cache:
return -1
# Move the accessed node to head (mark as recently used)
node = self.cache[key]
self._move_to_head(node)
return node.value
def put(self, key: int, value: int) -> None:
"""
Add or update a key-value pair in the cache.
Args:
key: The key to insert or update
value: The value to associate with the key
Time Complexity: O(1)
"""
if key in self.cache:
# Update existing key
node = self.cache[key]
node.value = value
self._move_to_head(node)
else:
# Add new key
node = Node(key, value)
self.cache[key] = node
self._add_to_head(node)
# Check if we need to evict
if len(self.cache) > self.capacity:
# Remove the least recently used item (node before tail)
lru_node = self.tail.prev
self._remove_node(lru_node)
del self.cache[lru_node.key]
# Example usage and testing
if __name__ == "__main__":
# Test Case 1: Basic operations
cache = LRUCache(2)
cache.put(1, 1)
cache.put(2, 2)
print(cache.get(1)) # returns 1
cache.put(3, 3) # evicts key 2
print(cache.get(2)) # returns -1 (not found)
cache.put(4, 4) # evicts key 1
print(cache.get(1)) # returns -1 (not found)
print(cache.get(3)) # returns 3
print(cache.get(4)) # returns 4
# Test Case 2: Update existing key
cache2 = LRUCache(2)
cache2.put(2, 1)
cache2.put(2, 2) # update key 2
print(cache2.get(2)) # returns 2
# Test Case 3: Access pattern
cache3 = LRUCache(3)
cache3.put(1, 1)
cache3.put(2, 2)
cache3.put(3, 3)
cache3.get(1) # access key 1, making it most recent
cache3.put(4, 4) # should evict key 2, not key 1
print(cache3.get(2)) # returns -1
print(cache3.get(1)) # returns 1
```
**Key Design Decisions:**
1. **Doubly Linked List**: Maintains the order of items by recency. The most recently used item is at the head, and the least recently used is at the tail.
2. **Hash Map**: Provides O(1) access to nodes in the linked list by their keys.
3. **Dummy Head and Tail Nodes**: Simplifies edge cases when adding/removing nodes.
**Time Complexity:**
- `get`: O(1) - HashMap lookup + linked list manipulation
- `put`: O(1) - HashMap insertion + linked list manipulation
**Space Complexity:**
- O(capacity) - Storage for at most `capacity` items
The implementation handles all edge cases including:
- Updating existing keys
- Evicting LRU items when capacity is exceeded
- Proper reordering when items are accessedVia Unnma
```python
class LRUCache:
"""
Least Recently Used (LRU) cache with O(1) get and put operations.
Uses a doubly linked list to track usage order and a hashmap for O(1) access.
When capacity is reached, the least recently used item is evicted.
"""
class Node:
def __init__(self, key: int = 0, value: int = 0):
self.key = key
self.value = value
self.prev = None
self.next = None
def __init__(self, capacity: int):
self.capacity = capacity
self.cache = {} # key -> node
# Dummy head and tail for cleaner edge case handling
self.head = self.Node()
self.tail = self.Node()
self.head.next = self.tail
self.tail.prev = self.head
def _remove(self, node: Node) -> None:
"""Remove node from linked list."""
node.prev.next = node.next
node.next.prev = node.prev
def _add_to_head(self, node: Node) -> None:
"""Add node right after head (most recently used position)."""
node.prev = self.head
node.next = self.head.next
self.head.next.prev = node
self.head.next = node
def get(self, key: int) -> int:
"""Get value for key. Returns -1 if key doesn't exist."""
if key not in self.cache:
return -1
node = self.cache[key]
self._remove(node)
self._add_to_head(node)
return node.value
def put(self, key: int, value: int) -> None:
"""Set or update value for key. Evicts LRU item if at capacity."""
if key in self.cache:
node = self.cache[key]
node.value = value
self._remove(node)
self._add_to_head(node)
else:
if len(self.cache) >= self.capacity:
# Remove LRU item (right before tail)
lru = self.tail.prev
self._remove(lru)
del self.cache[lru.key]
node = self.Node(key, value)
self.cache[key] = node
self._add_to_head(node)
```Methodology. v4 benchmark, claude-opus-4, 2 paired runs × 24 prompts per vertical (48 paired samples per condition). Per-prompt numbers are means across the 2 runs. Cost uses Anthropic Opus 4 list pricing ($15/M input, $75/M output, $1.50/M cached read). Some short-output prompts cost slightly more per call even with caching; we show those honestly rather than hide them.
Where it counts most, Unnma doesn’t time out.
On GPQA Diamond — graduate-level science reasoning — the baseline model timed out on 16.8% of calls. Standard prompt engineering cut that to 3.3%. With Unnma, the timeout rate hit zero across 594 problems.
Source: COP paper, GPQA Diamond benchmark, Qwen 3.5 27B, 198 problems × 3 replications. “Timeout” = timeouts + HTTP 500 errors. Effect size d=0.63, p=1.49e−25.
Real innovation. Rigorous research.
We built Unnma on academic-level research in inference optimization. Our framework — Cognitive Orientation Prompting — reduces token consumption, improves latency, and reduces error rates. On hard reasoning and coding tasks (the GPQA Diamond industry-standard benchmark), we have measured up to 75% cost reduction with zero quality loss.
The research: N=4,288 test cases across logic puzzles, code completion, and mathematical problem-solving. Effect size d≈2.4 (large, by academic standards). Tested across multiple model families and sizes. Preprint forthcoming on arXiv, May 2026.
On the same benchmark, Unnma solves each problem in roughly a quarter of the time of the baseline — and more consistently, with tighter confidence intervals.
Means with 95% confidence intervals. Significance: Mann-Whitney U (two-sided). Effect size: Cohen’s d.
The cost savings come from genuinely shorter outputs, not truncation. The full distribution shifts down and tightens — Unnma’s median response is roughly a seventh the length of the baseline’s, with far less run-to-run variance.
Violin width = kernel density estimate of completion tokens. Inner box: median (white line), Q1–Q3 (filled). Whiskers: 5th–95th percentile. SD annotation above each violin. Significance: Levene’s test for variance equality.
Drop in. No refactoring.
You are already using the OpenAI SDK (or Anthropic’s). Your code does not change. Just point it at our endpoint.
from openai import OpenAI
# Only change: base_url
client = OpenAI(
api_key="your-unnma-api-key",
base_url="https://api.unnma.ai/v1"
)
# Everything else is identical
response = client.chat.completions.create(
model="anthropic/claude-opus-4-7",
messages=[{"role": "user", "content": "What's 2+2?"}]
)
print(response.choices[0].message.content)That is it. Same request format. Same response format. The optimization happens transparently. Every query shows cost-before and cost-after on your dashboard.
Supports Python and JavaScript. Eight vendors integrated directly — Anthropic, OpenAI, Gemini, Together, Fireworks, Groq, xAI, DeepSeek — plus OpenRouter as a fallback for the long tail. Same vendor/model prefix everywhere.
No surprises. Transparent pricing.
We charge per query. You always know what you are paying.
Here is how it works: inference costs $X to the vendor (Claude, GPT, Gemini, Llama, DeepSeek — whichever you route to). We optimize the prompt so the vendor returns fewer tokens, which is cheaper to run. We mark up the optimized cost by 9%, and you keep the rest as savings. Coding workloads see ~24% token reduction; mixed conversational workloads see 5–15%.
Example
The arbitrage pass-through: you specify the vendor with a `vendor/model` prefix on every request — `anthropic/claude-opus-4-7`, `openai/gpt-4o`, `together/llama-3.1-70b`, and so on. We route directly to that vendor at their baseline rate; the 9% markup is uniform across all vendors. Want the cheapest Llama? Pin `together/` or `groq/` yourself. We never silently pick a vendor for you.
No subscription. No commitments. Add credits any time — $5 minimum, with chips at $10, $25, $100, $250 (or any custom amount). Credits last until you use them. A service fee of 3.5% (plus $0.30 below $60.18) covers payment processing; the same fee applies across cards, ACH, and wallets. Optional auto-topup when your balance gets low.
Your data is safe. Your savings are real.
Privacy by design.
Prompts are logged for optimization and compliance only. We do not train models on customer data. We do not publish. We do not sell access. Log retention: 90 days, then deletion. No cold storage.
Secure infrastructure.
All traffic encrypted end-to-end. Servers managed on Fly.io with standard security hardening. SOC 2 audit in progress; report published when complete.
You own your data.
You own 100% of the prompts you send and the responses you get. Our optimization technique is our trade secret (protected by patent). Everything else is yours.
Built by Unnma LLC — an AI research firm focused on consciousness and optimization. Read more at unnma.ai
Ready to save?
No credit card required to start. Generate an API key in 60 seconds. Test on your first few queries free. See your savings in real time on the dashboard.
Generate an API keyQuestions? Email hello@unnma.ai or read our FAQ.
© 2026 Unnma. All rights reserved.