OUTERPORT ENGINE

Instant Infrastructure for AI

Scale out your AI infrastructure with zero-downtime updates, hot-swappable models, and intelligent caching. Built in Rust for performance and reliability.

See a live demo

How It Works

1. Load model weights from the cache daemon

import outerport

# Load model weights with automatic caching and device mapping
tensors = outerport.load_model("llama-3.1-8b-instruct.safetensors")

One-line change to your codebase

2. Hot-swap between models using the cache

# Load second model from disk, automatic cache eviction
tensors_2 = outerport.load_model("gemma-9b.safetensors")
                  
# (Optional) Explicit cache eviction 
cache_id = offload_to_ram(tensors)

# Instantly swap back to first model from RAM
tensors_1 = load_from_ram(cache_id)

Model switching in seconds with intelligent caching

System Architecture

Distributed Cache Layer

Intelligent caching system that manages model weights across your infrastructure, enabling instant model swapping and optimal resource utilization.

Hot-Swap Engine

Zero-downtime model updates and seamless switching between different models without service interruption.

Framework Integration

Native support for popular ML frameworks and serving solutions, with built-in optimizations for PyTorch and CUDA.

Works With Your Stack

Seamlessly integrate with popular ML frameworks and serving solutions

ComfyUIStable Diffusion, FLUX

vLLMLLM Serving

SGLangLLM Framework

TGIHuggingFace Inference

Framework Compatibility

Ready to go with torch.compile

Seamlessly works with PyTorch's latest compilation features.

CUDA Graph Compatible

Keep your existing CUDA Graph optimizations while adding our intelligent caching layer on top.

Key Features

Intelligent caching layer

Distributed caching system optimized for AI models, tensors, and KV caches. Automatically manage model weights across your infrastructure.

Instant model swapping

Switch between models in seconds with our intelligent caching layer. Perfect for LoRAs, workflows, AI agents, and multi-model applications.

Distributed inference

Scale horizontally across GPU clusters while maintaining consistent performance.

Zero-downtime updates

Update models and configurations without service interruption. Rolling updates and automatic version management.

Infrastructure agnostic

Deploy on AWS, GCP, Azure or on-premise. Full control over your infrastructure with our cloud-native architecture.

Cost optimization

Intelligent resource allocation and caching reduces GPU costs by up to 40%.

Get access
immediately.

Trusted by financial services and leading research institutions. Built by a team that built AI and GPU systems at NVIDIA, Meta, LinkedIn.