Navigating Hardware Limitations: Our Journey to Training Large Diffusion Models on a Startup Budget

I know firsthand how challenging it can be for small startups to break into the world of large diffusion models. The high cost of hardware can make it feel like an insurmountable barrier. I've been there, feeling overwhelmed and unsure of how to proceed. That's why I want to share our experience—not to boast, but to help others who might be in the same situation.

Our Challenge

We needed to train large diffusion models but lacked the resources for expensive hardware setups. Instead of letting this halt our progress, we sought out creative solutions to work within our means.

Strategies That Helped Us Overcome Hardware Constraints

1. Memory Optimization Techniques
ZeRO Stage 2 and Stage 3: We utilized the Zero Redundancy Optimizer (ZeRO) to significantly reduce memory usage. Stage 2 allowed us to partition optimizer states and gradients across GPUs, while Stage 3 took it a step further by partitioning the model parameters themselves. Yes, this led to larger checkpoints—sometimes up to 150GB—but the memory savings were invaluable.

2. Precision Formats
bfloat16 and FP8: Precision formats play a crucial role in performance and memory consumption. While FP8 offers benefits on advanced GPUs like NVIDIA's H100 and L40 series, we found bfloat16 to be a widely supported alternative that effectively managed memory without compromising performance.

3. Efficient Optimizers
8-bit Optimizers: Switching to 8-bit optimizers, such as the 8-bit Adam optimizer, was a game-changer. By quantizing momentum and variance terms to 8-bit precision, we reduced memory requirements without sacrificing model convergence or accuracy.

4. Resource Sharing and Collaboration
Community Partnerships: We reached out to other researchers and institutions. These collaborations not only provided access to shared resources but also fostered a supportive network for problem-solving.
Cloud Credits and Grants: We explored programs offering computational resources or funding assistance. These opportunities can be a lifeline for startups needing extra computational power. E.g. a mercury bank account in the US gives you a 5000 USD AWS credits. Similarly, SVB bank, etc.

5. Incremental Experimentation
Smarter Experiment Design: Instead of running countless full-scale experiments, we tested ideas on smaller subsets of data. This approach saved time and resources, allowing us to refine our models efficiently.

6. Group Query Attention (GQA)
Implementing GQA helped us reduce the number of keys in attention mechanisms, decreasing computational complexity and memory usage.

Embracing Cost-Effective Solutions

We also turned to platforms like Vast.ai for more affordable hardware options. While this came with trade-offs—such as occasional hardware issues with GPUs—the cost benefits were significant. For example, running 8x H100s costs about $20/hr, whereas 2x L40s are just $2/hr.

Exciting Developments Ahead

We're thrilled about the recent release of Stable Diffusion 3.5 Medium weights. This advancement allows us to push our models further without incurring additional costs.

Why I'm Sharing This

I remember what it felt like to face these challenges without a clear path forward. My hope is that by sharing our journey, I can help others navigate similar obstacles. You're not alone, and with a bit of creativity and resourcefulness, it'spossible to achieve great things even on a tight budget.

If you're facing similar challenges or have questions about the strategies we've used, please don't hesitate to reach out.Let's support each other in pushing the boundaries of what'spossible.