Scalable AI Infrastructure for SES AI – Enabling 128-GPU Deep Learning Workflows in the Cloud

The solution significantly improved the efficiency and reliability of the Data Science workflows.

Meet our client

Client:

SES

Industry:

Software & Technology

Market:

Technology:

MLOps

In a Nutshell

Client’s Challenge

A leading battery manufacturer needed a scalable, high-performance AI infrastructure to support 10-15 data scientists working on LLMs and material simulations, ensuring efficient deep learning and resource orchestration.

Our Solution

We designed, deployed, integrated, and tested a robust cluster architecture on cloud infrastructure. The solution included MLOps tooling and a resource orchestration system, ensuring smooth and efficient operations for the client’s Data Science team.

Client’s Benefits

The new scalable AI infrastructure improved resource efficiency, reduced latency, and enabled smoother collaboration. Optimized orchestration maximized GPU utilization, helping the company complete a 12-month research project with optimal performance, accelerating battery technology innovation.

A Deep Dive

1. Overview

We partnered with Crusoe Cloud to design and deploy a scalable, MLOps-ready infrastructure for SES AI, a global leader in Li-Metal battery innovation. The goal was to deliver a stable, cloud-native platform to support deep learning R&D, including LLMs and physical simulations. Our solution enabled seamless orchestration of distributed training across 100+ GPUs and provided robust tooling for a team of 10–15 data scientists with limited DevOps experience. The infrastructure also improved collaboration, optimized resource utilization, and ensured reproducibility across the organization’s ML workflows.

Key Outcomes

Deployed two high-performance GPU clusters with full MLOps integration.
Improved throughput, reduced training, and job processing time across all AI workloads.
Enabled SES AI to complete a year-long research initiative with optimal resource usage and minimal disruptions.
Empowered teams to manage complex training jobs without relying on infrastructure expertise.

2. Client

SES AI is a global innovator in advanced Li-Metal battery technology, accelerating the future of electric transportation across land and air.

Industry: Manufacturing
Market Value: Publicly traded (NYSE: SES), with global operations in the US, China, and South Korea
Achievements:
- First battery company to integrate AI across R&D at scale
- Uses AI for material simulation, optimization, and product design
- Key partner in pushing Li-Metal battery innovation forward through intelligent infrastructure

3. Challenge

Business Challenge

SES AI needed a scalable, cloud-native platform to support its growing team of data scientists building cutting-edge models for battery research and material simulation. The internal team lacked MLOps and infrastructure engineering resources, which created bottlenecks in model development and experimentation.

Technology Challenge

The existing setup was insufficient for scaling complex LLMs and DL workloads.
SES required efficient utilization of 128 H100 GPUs, shared across multi-node jobs.
Lack of orchestration and monitoring tooling hindered visibility and control.
The team needed hands-on help with cluster setup, infrastructure design, and ongoing support.
Workloads also included non-ML physical simulations, requiring flexible scheduling and resource sharing.

4. Solution

Our Approach

We designed and deployed a cloud-native ML infrastructure that was not only performant but also easy to manage by non-DevOps teams. Our solution was built for scalability, observability, and reliability.

What We Delivered

Two custom SLURM clusters with advanced GPU configurations:
- Cluster A: 16 × 8x H100 GPUs (total 128 H100s)
- Cluster B: 8 × 8x A100 GPUs (additional capacity)
Storage Setup:
- 50 TB shared disk, 10 TB NFS per cluster, node-local workspaces
Full MLOps Tooling Stack:
- Multi-node, multi-GPU job scheduling with SLURM
- Monitoring dashboards for resource tracking and job queue visibility
- Weights & Biases integration for experiment tracking
Infrastructure Management:
- Terraform for infrastructure as code
- Ansible for cluster and tool configuration
- Over 30 pages of internal documentation for admins

Technologies Used

Cloud Provider: Crusoe Cloud
Cluster Orchestration: SLURM, Kubernetes (admin support)
Infrastructure Management: Terraform, Ansible
Monitoring & Tracking: Slurm-web, Weights & Biases
Languages & Tools: Python, CUDA, C++, Jupyter Notebooks

Unique Aspects

Delivered turnkey MLOps infrastructure for a non-DevOps team.
Enabled parallel model training at scale with highly efficient GPU utilization.
Built a hybrid workflow that supports both ML and non-ML workloads.
Ensured reproducibility and traceability of experiments via automation and tracking.

5. Process

Step-by-Step Delivery

Requirements Gathering:
Collaborated closely with SES AI and Crusoe Cloud to assess team structure, workloads, and infrastructure gaps.
Design & Planning:
Designed a flexible cluster design to support AI experimentation, physical simulations, and workload sharing.
Infrastructure Deployment:
- Launched SLURM clusters on Crusoe
- Configured storage, networking, GPU drivers, containerization
MLOps Integration:
- Enabled job scheduling, experiment tracking, monitoring dashboards
- Ensured smooth workflows for 20+ data scientists
Knowledge Transfer & Support:
- Created detailed documentation
- Ongoing support to SES AI’s internal tech team
- Continuous cluster monitoring and troubleshooting

Team Involved

Tech Lead: Architecture design, Kubernetes & SLURM deployment, infrastructure design
Engineers: Cluster provisioning, storage setup, GPU job orchestration
MLOps Specialist: Workflow design, tooling integration, reproducibility engineering
Project Manager: Client coordination, delivery planning, SES–Crusoe partnership management

6. Outcome

Quantitative Results

128 H100 GPUs orchestrated for large-scale deep learning
Around 20 researchers actively using the platform
Up to 12 months of research throughput optimized via job scheduling and parallelization
System latency significantly reduced, enabling faster iteration cycles

Qualitative Results

Enabled consistent model development without infrastructure bottlenecks
Empowered non-infra teams with self-service AI tooling
Increased collaboration and reproducibility through shared infrastructure and tooling
Created a platform for scalable experimentation, including LLM fine-tuning and advanced simulations

Lessons Learned

Turnkey infrastructure delivery is crucial when working with AI-first product teams lacking DevOps
MLOps tooling should be tailored to the user profile—here, a hands-on data science team using notebooks
Resource scheduling and tracking dashboards are key to maintaining performance under heavy load

7. Summary

Final Thoughts

This project demonstrates the critical role of infrastructure in scaling AI innovation. By delivering a robust, cloud-native MLOps platform, we enabled SES AI to push forward its ambitious research in Li-Metal battery technology—on time, at scale, and without infrastructure-related disruptions.

Testimonial

“Crusoe Cloud has been instrumental in meeting SES AI’s need for scalable and stable infrastructure providing immense computing power (128 NVIDIA H100 GPUs). This project exemplifies how our partnership drives tangible benefits for clients.”
— Mateusz Kwaśniak, Senior Technical Leader at deepsense.ai