How fast can you load a FLUX (LoRA) model?

How fast can you load a FLUX (LoRA) model?

Outerport can load FLUX in 2.2 seconds and a 2GB LoRA in 0.3 seconds.

Outerport can load FLUX in 2.2 seconds and a 2GB LoRA in 0.3 seconds.

October 31, 2024

October 31, 2024

FLUX.1 from Black Forest Labs is an amazing image generation model, but at a 23GB file size (for [dev]), loading the model into the GPU alone can be a big hassle. Specifically, loading a 23GB model on our AWS EC2 machine can take close to an entire minute.

This problem is sometimes referred to as the cold start problem. We at Outerport set out to solve cold starts because having to wait an entire minute to try out a new model or serve infrequent traffic to internal image generation tools is annoying.

In summary: Outerport can load FLUX models in 2.2 seconds and 2GB LoRAs in 0.3 seconds.

How does Outerport work?

From the user's perspective, it's a single line of code:

outerport.load("flux1-dev.safetensors")

Under the hood, this function call communicates with Outerport, which is a tensor memory management daemon that is built with Rust for performance. We optimize every step of the process so that even mundane operations like loading from storage and CPU-GPU memory transfers are 2-4x faster compared to naive PyTorch operations.

The best thing about this is that it's persistent- meaning that across different processes, the model weights can already be cached in CPU memory so that they can be loaded immediately into the GPU.

When the models are cached, this can load a FLUX model in 2.2 seconds.

Even if it's not cached, on my machine (with a very fast M.2 SSD) it can load the weights from storage into GPU in 8.6 seconds.

Why does this matter?

We think that this is exciting because it will unlock more creativity by allowing for more complicated AI workflows.

We have already seen creative technologists do amazing things with AI workflows using tools like ComfyUI- where multiple diffusion models are chained together in a single pipeline.

In these pipelines, the more models you can chain, the more creative applications you can build- but unfortunately, the slow loading time of model weights can prevent complicated chains from being realistic to run within reasonable UX. We think that our software will allow artists to build more complicated chains and help increase iteration speed.

What about LoRAs?

We can of course speed up LoRA swapping as well!

  • For a 328 MB LoRA, the naive baseline loads in 0.93 seconds. Outerport loads in 0.08 seconds.

  • For a 642 MB LoRA, the naive baseline loads in 1.90 seconds. Outerport loads in 0.13 seconds.

  • For a 1.2 GB LoRA, the naive baseline loads in 3.16 seconds. Outerport loads in 0.16 seconds.

Our technology is actually agnostic to the model type so we can load any model file. To be more precise, it doesn't even have to be a model file; we can use this as a general tensor caching layer that is optimized for CPU-GPU transfers.

What's the secret? How do we get in on this?

If you're interested in using this technology, please reach out! We can also help with benchmarking your current infrastructure or provide hands-on services to help make your image generation pipelines faster.

If you want to know all the nitty-gritty details of how we made it so fast (even compared to alternatives like memory file systems), also feel free to reach out via the form or email.

Follow us on X or on LinkedIn to stay in touch with more to come!


© 2024 Genban, Inc.

© 2024 Genban, Inc.