From Supercomputer to Cloud: A New Era for OpenFOAM Simulations

Luís Sarmento

Author

July 3, 2025

Tags:

SimulationCollaborative InsightsEngineeringOpenFOAMHPCCost-effectivenessCloud vs HPCHPC Challenge

Blog post banner, blue abstract background

Inductiva joined the 1st OpenFOAM HPC Challenge to test how cloud infrastructure stacks up against traditional HPC for large-scale CFD simulations. Running the DrivAer automotive benchmark, the team explored multiple hardware setups, hyperthreading choices, and domain decomposition strategies. The results? Inductiva’s flexible MPI clusters handled up to 768 partitions with impressive price-performance—even outperforming pricier hardware in some cases. For simulations below massive supercomputer scales, cloud HPC proves not only competitive but cost-effective, offering engineers and researchers agility without sacrificing speed. Curious how to fine-tune your OpenFOAM workloads in the cloud? Dive into the benchmarks and see what’s possible.

The 1st OpenFOAM HPC Challenge (OHC-1) is a community-driven event aimed at benchmarking OpenFOAM’s computational performance on a relevant industrial case, across different hardware configurations and software variants. The intent is to obtain a diverse set of benchmark data to drive further optimizations in OpenFOAM’s performance.

The Hardware Track is designed to benchmark how OpenFOAM v2412 performs on different high-performance computing (HPC) platforms. It focuses purely on the performance without code modifications – highlighting the impact of system architecture, interconnects, memory, and compute resources. That is, participants must not change any physics or numerics: only platform-specific tuning is allowed.

At Inductiva, we see the Hardware Track as a perfect opportunity to showcase the impact and trade-offs of the various compute options we offer, using a realistic, industry-relevant use case. In particular, we’re often asked questions like:

Can Inductiva handle large-scale jobs, especially those that require distributing the workload across multiple compute nodes? How efficient is it to launch and run multi-node MPI clusters on Inductiva?
How do different generations of cloud machines compare in terms of performance and cost? Newer machines are typically faster, but they come with a premium price tag. Is the performance gain worth the extra cost? Or is it more economical to use older, less expensive machines, even if that means waiting a bit longer for results?
Should hyperthreading be disabled? This is a classic question in HPC. It’s well known that hyperthreading can degrade HPC performance, but by how much? And since vCPUs cost half as much as full cores, what’s the smarter choice when time constraints are flexible?
Should we use all available cores or vCPUs on each node, or is there value in leaving some threads idle? Inductiva’s VMs, by default, reserve a few threads for OS-related tasks. But what happens if we go further and intentionally leave a few additional cores unused—say, running a 40-partition job on a 48-vCPU machine?

Exploring these questions provides valuable insights – not only for our users, but also for our own development efforts, helping us deliver the best possible default configurations.

TL;DR: Check the summary table in the conclusion section.

The Team

We partnered with our friends from the Department of Polymer Engineering at the University of Minho, who are experts in CFD, OpenFOAM, and HPC, and ran a series of tests together using the Inductiva API. Their expertise was invaluable in designing and executing the experiments, from selecting the DrivAer benchmark case to defining the domain decomposition setups and analyzing the performance results.

This collaboration allowed us to explore how different hardware configurations, hyperthreading settings, and cluster sizes impact OpenFOAM’s performance on Inductiva’s cloud infrastructure. It also helped us validate how well our platform can handle industry-scale simulations and answer practical questions about balancing cost and speed for large CFD workloads.

Working closely with the team from the Department of Polymer Engineering provided a unique blend of academic insight and hands-on HPC experience, ensuring our benchmarks reflect both technical rigor and real-world relevance.

The Challenge

The use case chosen for this challenge is the “Open-closed cooling DrivAer variant with Static Mesh”, a widely used academic-industrial geometry that represents a realistic, simplified vehicle, available in the OpenFOAM HPC Benchmark Suite. Closed‑cooling implies all cooling openings (e.g., grille, radiator intake) are sealed off – no internal engine compartment flow. Static mesh means the car, ground, and wheels remain motionless. This setup isolates and studies the external aerodynamics under controlled “free‑air” conditions. The main aim is to produce CFD results that can be reliably correlated with high‑quality wind tunnel data.

The case comes with meshes with 3 different levels of resolution: high-resolution (236M cells), mid-resolution (110M cells) and low-resolution (65M cells). In all the experiments shown below, we used the low resolution mesh.

A Look at Domain Decomposition

OpenFOAM runs in parallel by dividing the computational domain into smaller subdomains using a method called domain decomposition. Each subdomain, along with its associated field data, is assigned to a different processor, with communication between processors being achieved using the MPI (Message Passing Interface) standard.

The actual decomposition is done using the decomposePar utility, and is controlled by settings in the decomposeParDict file, located in the system directory of the case. OpenFOAM offers several methods for domain decomposition namely: simple, hierarchical, scotch, and ptscotch, each suited to different types of simulations and geometries.

The simple method performs a straightforward geometric split of the domain along specified coordinate directions, while scotch and ptscotch use graph-based algorithms to automatically optimize load balancing and minimize inter-processor communication, making them well-suited for irregular geometries or heterogeneous compute environments.

However, for the Hardware Track, participants were instructed to use the hierarchical method for domain decomposition. The hierarchical method stands out as a flexible and structured alternative to simple geometric decomposition. Like simple, it requires the user to define how many subdivisions to make in each direction, but it also allows control over the order in which those splits are applied. By specifying both the number of subdivisions (n) and their application order (order, e.g., xyz), users can tailor the decomposition to align with flow features, mesh stretching, or hardware topology.

As instructed by the organization of the challenge, our experiments were done using the two following hierarchical decompositions (xyz): (12, 8, 4) and (24,8,4). These decompositions generate 384 and 768 partitions, which, given the relatively high number of cores / vCPUs in the VMs available at Inductiva, can be accommodated by modest-sized multi-node configuration.

More information on domain decomposition can be seen here.

Multi-node MPI Clusters at Inductiva

Inductiva supports multi-node MPI clusters through its flexible and easy-to-configure MPICluster class (see documentation). In the experiments described below, we run the previously introduced OpenFOAM use case using two levels of domain decomposition – 384 and 768 partitions. These runs are executed on MPIClusters of varying sizes, built from different generations of virtual machines, with hyperthreading both enabled and disabled, and under different levels of thread utilization.

For example, the Python code below:

mpi_cluster = inductiva.resources.MPICluster(
    machine_type="c3d-highcpu-360",
    num_machines=8,
    threads_per_core=1,
    spot=True)

allows starting an MPI cluster composed of 8 VM nodes of type c3d-highcpu-360, supported by 4th generation AMD EPYC™ (Genoa, 2024), where we turned-off hyperthreading (threads_per_core=1). The actual number of physical cores made available is $8 x (360 / 2) = 1440$.

Inductiva allows passing extra information to the MPI environment of the cluster above, using a MPIConfig class, such as shown below:

mpi_config = MPIConfig( \ 
    version="4.1.6",
    np=1440,
    use_hwthread_cpus=True)

More information about how Inductiva lets you run MPI Clusters can be found here. You can also read more about scaling simulations with MPI Clusters on Inductiva, here.

The Results of 3 Experiments

What follows is not a systematic, pre-planned exploration of all the computational variations that could be tested to thoroughly address the questions we posed above. Instead, we took a more relaxed, exploratory approach, trying out different computational settings and choosing the next experiment as we went along. Also, we did not repeat each trial multiple times, to reduce the impact of some performance oscillations that naturally occur in cloud settings, related with varying levels of occupancy of the compute nodes that support the virtual machines and of the network / internode traffic.

Still, we believe this process is both informative and helpful for Inductiva and OpenFOAM users. And we also hope it makes for an enjoyable and engaging journey to follow.

Experiment 1: Matching Partitions with vCPUs and Cores

In our first experiment, we aimed to build MPI clusters that exactly matched the number of partitions in our use case—384 and 768. This 1:1 mapping between partitions and vCPUs (or cores) provided a natural and straightforward starting point for exploring performance and cost.

Interestingly, we could only achieve this exact 1:1 mapping using clusters built with C4-series VMs (c4-highcpu-96 and c4-highcpu-192), which come with 96 and 192 vCPUs respectively. These machines run with hyperthreading enabled by default, so each VM corresponds to either 48 or 96 physical cores.

By combining these C4 VMs appropriately, we can perfectly match the number of partitions for both problem sizes. That’s the good news, none of the other VM types currently available on Inductiva offer such a clean mapping. The downside, however, is that the C4 machines are powered by 5th Gen Intel Xeon CPUs, high-performance, but not the most cost-effective option in our catalog. We’ll return to this trade-off in later experiments.

First, let’s look at the results with hyperthreading enabled. For the 384-partition case, we used either a 4-node c4-highcpu-96 cluster or a 2-node c4-highcpu-192 cluster:

384 Partitions (Hyperthreading On)

VM Type	Nodes	Partitions	vCPUs	Physical Cores	Execution Time	Cost
c4-highcpu-96	4	384	384	192	5h 34min	$40.71
c4-highcpu-192	2	384	384	192	5h 03min	$36.54

Table 1. Results for 384 partitions with hyperthreading turned on.

For the 768-partition case, we doubled the number of machines:

768 Partitions (Hyperthreading On)

VM Type	Nodes	Partitions	vCPUs	Physical Cores	Execution Time	Cost
c4-highcpu-96	8	768	768	384	3h 54min	$57.06
c4-highcpu-192	4	768	768	384	5h 33min	$80.41

Table 2. Results for 768 partitions with hyperthreading turned on.

Key Observations

384-partition case: Using 2 c4-highcpu-192 machines (as opposed to 4 c4-highcpu-96 VMs) resulted in slightly better performance, which aligns with patterns we’ve seen in other scenarios. For the same total number of vCPUs, a 2-node MPI cluster typically outperforms a 4-node cluster, likely due to reduced inter-node communication overhead.
768-partition case: This yielded a more surprising outcome. The 8-node c4-highcpu-96 configuration showed solid performance, achieving a 1.42× speedup compared to the 384-partition case—reasonable, though not ideal. However, the 4-node c4-highcpu-192 setup actually saw performance degrade.

One possible explanation is related to memory bandwidth limitations. Each c4-highcpu-192 VM runs on a single physical node. In contrast, the 8 c4-highcpu-96 VMs may have been scheduled across multiple physical nodes (up to 8), some of which may have been otherwise idle. If that’s the case, those nodes would have offered better effective memory bandwidth per core, reducing contention and boosting performance.

This hypothesis naturally leads us to the next experiment…

Experiment 2: Matching Partitions with Physical Cores (Hyperthreading Disabled)

If memory bandwidth is a limiting factor, one natural step is to disable hyperthreading. On the c4-highcpu-192 VMs, this effectively grants us exclusive access to the physical node. However, doing so requires doubling the number of VMs to maintain the same number of partitions and this increases the hourly cost of the MPI cluster.

Let’s start with the 384 partitions case. Observe that we are now using twice as many nodes.

Results for the 384-partition case (Hyperthreading Off)

VM Type	Nodes	Partitions	vCPUs	Physical Cores	Execution Time	Cost
c4-highcpu-96	8	384	384	384	3h 42min	$54.03
c4-highcpu-192	4	384	384	384	4h 00min	$57.96

Table 3. Performance comparison for 384 partitions with hyperthreading turned off.

Key Observations

Disabling hyperthreading led to noticeably better performance. This is not really very surprising. However, interestingly, the 8-node configuration slightly outperformed the 4-node one, even though it involved more nodes, which typically increases communication overhead. This suggests that communication costs in this scenario are not yet a major limiting factor. One possible explanation is that we’re still encountering memory bandwidth bottlenecks, which are alleviated by spreading the workload across more nodes.

Next, we test the same setup with 768 partitions, again doubling both the number of physical cores and the number of nodes. The question now is: will the benefits of increased computational power be outweighed by higher inter-node communication costs?

Results for the 768-partition case (Hyperthreading Off)

VM Type	Nodes	Partitions	vCPUs	Physical Cores	Execution Time	Cost
c4-highcpu-96	16	768	768	768	4h 26min	$130.65
c4-highcpu-192	8	768	768	768	2h 38min	$76.53

Table 4. Performance comparison for 768 partitions with hyperthreading turned off.

Main Takeaways

The 8-node c4-highcpu-192 setup delivered the best overall performance, significantly outperforming its 4-node counterpart from the 384-partition experiment. This outcome was expected, or at least hoped for, as we brought more physical cores to bear on a larger workload.

However, the 16-node c4-highcpu-96 configuration performed considerably worse. This aligns with observations from other experiments we’ve conducted: once you exceed around 8 nodes, the cost of inter-node communication on cloud infrastructure becomes a major bottleneck, substantially degrading performance. This highlights a key difference between cloud-based and traditional HPC infrastructure. namely, the lower speed and bandwidth of cloud interconnects.

Moreover, costs increased notably.

So, the question remains: can we find more cost-effective alternatives within Inductiva’s cloud options that deliver similar performance without a significant increase in execution time?

Let’s find out in the next section…

Experiment 3: Pushing the Costs Down

The C4 series on GCP features some of the latest and fastest hardware, typically offered at a premium. However, one of the key advantages of cloud infrastructure is the flexibility to choose machines from different hardware generations. GCP, the cloud platform currently used by Inductiva, provides access to several older VM series at significantly lower prices.

One of our favorite cost-effective options is the C2D series, based on AMD EPYC 7003 processors, released in 2021. While these machines are no longer top of the line (they use DDR4 RAM, for example, unlike the faster DDR5 in C4 VMs), they remain extremely performant for their price. For the same number of vCPUs and RAM, C2D machines are over three times cheaper than their C4 counterparts, while being nowhere near three times slower.

To further reduce costs, we ran these machines with hyperthreading enabled. The C2D VMs are available in configurations with 56 and 112 vCPUs, which means we couldn’t create a perfect 1:1 mapping between partitions and vCPUs. Instead, our MPI clusters ended up with slightly more vCPUs than partitions, leaving some resources underutilized, but still incurring cost.

Let’s first look at what happens with the 384-partition configuration:

384 Partitions (Hyperthreading On)

VM Type	Nodes	Partitions	vCPUs	Physical Cores	Execution Time	Cost
c2d-highcpu-56	8	384	448	224	8h 10min	$21.78
c2d-highcpu-112	4	384	448	224	6h 16min	$15.97

Table 5. Results for 384 partitions using C2D machines with hyperthreading.

The key takeaway here is that we’re able to run the same simulations at significantly lower cost, while achieving comparable execution times to the C4 configurations. For instance, the 4-node c2d-highcpu-112 setup only took around 20% longer than the equivalent C4 setup (see Table 1), at roughly one-third of the price.

For the 768-partition case, we needed to double the number of VMs. We were able to successfully run this test on an 8-node c2d-highcpu-112 configuration. However, the alternative 16-node c2d-highcpu-56 setup proved too slow, likely due to increased overhead from inter-node communication, an expected effect as the cluster size grows.

768 Partitions (Hyperthreading On)

VM Type	Nodes	Partitions	vCPUs	Physical Cores	Inductiva Time	Cost
c2d-highcpu-112	8	768	896	448	4h 33min	$23.15

Table 6. Results for 768 partitions using C2D VMs with hyperthreading.

These are very strong results, even outperforming some of the C4 configurations – but at a fraction of the cost. It demonstrates that Inductiva’s infrastructure can support industry-scale simulations at great value.

Going Further: Fully Utilizing All vCPUs

As noted, in our previous C2D configurations we weren’t fully utilizing all vCPUs—our use case had 384 or 768 partitions, while the actual clusters had 448 or 896 vCPUs. This left about 1/6 of the machines idling, even though we still paid for them.

So what if we changed the partitioning scheme to fully utilize all available vCPUs? This wouldn’t comply with the Hardware Track rules, but it’s a worthwhile side experiment.

By switching from the standard (12, 8, 4) hierarchical partitioning to (14, 8, 4), we generate exactly 448 partitions, allowing us to match every vCPU in the 4-node and 8-node configurations.

448 Partitions (Hyperthreading On)

VM Type	Nodes	Partitioning	Physical Cores	vCPUs	Execution Time	Cost
c2d-highcpu-56	8	448	224	448	6h 01min	$16.13
c2d-highcpu-112	4	448	224	448	5h 56min	$15.22

Table 7. Performance when matching partition count to available vCPUs.

The results are impressive. Execution times dropped to just around 6 hours, and costs to $15–16, bringing us down to roughly $0.006 per vCPU-hour, or $0.012 per physical core-hour (due to hyperthreading).

Not bad at all. 🙂

Summary Tables

Here are the summary tables with the runs made for each partition scheme.

384 Partitions

VM Type	Nodes	Hyperthreading	vCPUs	Physical Cores	Execution Time	Cost
c4-highcpu-96	8	OFF	384	384	3h 42min	$54.03
c4-highcpu-192	4	OFF	384	384	4h 00min	$57.96
c4-highcpu-192	2	ON	384	192	5h 03min	$36.54
c4-highcpu-96	4	ON	384	192	5h 34min	$40.71
c2d-highcpu-112	4	ON	448	224	6h 16min	$15.97
c2d-highcpu-56	8	ON	448	224	8h 10min	$21.78

768 Partitions

VM Type	Nodes	Hyperthreading	vCPUs	Cores	Execution Time	Cost
c4-highcpu-192	8	OFF	768	768	2h 38min	$76.53
c4-highcpu-96	8	ON	768	384	3h 54min	$57.06
c4-highcpu-96	16	OFF	768	768	4h 26min	$130.65
c2d-highcpu-112	8	ON	896	448	4h 33min	$23.15
c4-highcpu-192	4	ON	768	384	5h 33min	$80.41

📌 Main Takeaways from the OpenFOAM HPC Challenge (OHC-1)

Overall, it’s quite clear that Inductiva’s infrastructure can successfully handle industry-scale OpenFOAM simulations (up to 768 partitions) using multi-node clusters, with very favourable price-performance relations.

Additionally:

Hyperthreading tends to hurt performance in high-core-count CFD workloads. Disabling it improves execution time significantly—but doubles the number of VMs needed, raising cost.
C2D VMs (AMD EPYC 7003) offer excellent cost-performance, especially with hyperthreading enabled and when partitions are matched to vCPUs. In some cases, they outperform C4 setups at one-third the cost.
Cloud-based HPC has different bottlenecks than traditional supercomputers. Above ~8 nodes, inter-node communication starts to become a major limiting factor. But up to that number of nodes, Cloud-based HPC can actually be faster (and more cost-effective) than supercomputers. This trend is evident in benchmarking efforts such as those involving Quantum ESPRESSO on Fugaku.
Matching the partition count to the actual vCPU count (e.g. 448 partitions on 448 vCPUs) is a great practical optimization for real-world jobs.

Of course, for real-world users, there’s no one-size-fits-all setup, but Inductiva offers the flexibility to balance cost, speed, and resource efficiency depending on your simulation needs. You can explore different cluster topologies, hardware choices, and hyperthreading settings, all with minimal code changes.

So, what is stopping you from running your OpenFOAM simulations on Inductiva?

Just register (link), install Inductiva’s Python Client API (link to how to), and use your complementary free credits to submit your first OpenFOAM (link to OpenFoam guide) simulations now!

Once you compute with Inductiva, nothing else really computes anymore.

What to read next

Can Inductiva Beat Fugaku on GRIR443 Benchmark?
Generate an OpenFOAM Dataset