Designing a Compressible CFD Solver for Custom Scientific Computing Chips
This note summarizes how I think about designing a compressible CFD solver that runs efficiently on a custom scientific computing chip, while still behaving like a serious engineering tool: verifiable, robust, and maintainable.
1. Context and Goals
In many CFD projects, the solver is written with a generic CPU cluster in mind. When we introduce a custom scientific computing chip into the picture, several things change at once:
- Memory bandwidth and on-chip storage become the primary bottlenecks.
- Vectorization and instruction-level parallelism are exposed much more explicitly.
- Data movement across chip boundaries (PCIe, NoC, etc.) can easily dominate cost.
From my experience, a reasonable goal is to:
- Keep the solver physically honest enough for engine simulation (fan, compressor, combustor, turbine).
- Restructure the code so that the most expensive kernels map cleanly onto the chip.
- Build a development workflow where accuracy and performance can both be regression-tested.
2. Equations and Discretization: What Must Not Change
On the mathematical side, I try not to “simplify away” the problem just to fit the chip. For compressible flow, the core choice is:
- Governing equations: 3D compressible Navier–Stokes in conservative form.
- Discretization: finite volume (FVM) on unstructured grids, with RANS / LES turbulence models.
- Time integration: fully implicit or strongly implicit schemes, leading to large sparse linear systems.
These choices are largely hardware-agnostic and driven by physics and engineering requirements. The chip enters the picture when we look at:
- How we store the mesh and unknowns.
- How we assemble and apply the discrete operators.
- How we solve the resulting linear systems.
3. Data Layout and Memory Bandwidth
On a bandwidth-limited accelerator, data layout is part of the numerical method. For a cell-centered FVM solver, typical choices include:
- Structure of Arrays (SoA) for conservative variables and residuals.
- Compressed sparse formats (CSR / BCSR / ELL) for matrix-free and matrix-based operators.
- Reordering (e.g., RCM, space-filling curves) to improve locality.
The goal is not just to “make it faster”, but to make memory access patterns predictable enough so that the chip’s prefetchers, local memories, and DMA engines can actually be used.
4. Sparse Linear Solvers on the Chip
In an implicit compressible solver, most of the time is spent in sparse linear solvers: typically BiCGStab, CG, or related Krylov methods, plus a preconditioner.
On a custom chip, I see three layers:
4.1 Algorithmic layer
At this level we decide:
- Which Krylov method is appropriate for each block (pressure, momentum, energy).
- How aggressive the preconditioner can be without breaking robustness.
- How tightly we couple the flow and turbulence variables.
4.2 Kernel layer
Here we look at kernels such as:
- SpMV / SpMM on custom sparse formats.
- Dot products and vector updates (AXPY, fused operations).
- Restriction / prolongation operations in multigrid settings.
These kernels must be designed with:
- Vector width and alignment requirements.
- Local memory capacity and banking.
- DMA transaction sizes and overlap with computation.
4.3 Mapping layer
Finally, we decide how to map the mesh and linear algebra objects onto:
- Chip tiles / cores / compute clusters.
- On-chip vs off-chip memory.
- PCIe or NoC links when running multi-chip configurations.
5. Verification, Regression, and “Not Lying to Yourself”
When chasing performance, it is easy to accidentally change the math. To avoid this, I try to build a verification and regression stack that includes:
- Canonical test cases: SOD tube, RAE2822, engine fan/compressor stages, and gradually more realistic combustor / turbine cases.
- Metric baselines: lift/drag, pressure ratios, efficiency, and residual histories, compared against trusted references.
- Hash-based checks on mesh and boundary inputs to catch silent file corruption.
The idea is that every time we change data layout, kernel implementation, or chip mapping, we can re-run a selected set of cases and confirm that the engineering answers remain within acceptable tolerances.
6. Towards a Full Engine Simulation Workflow
Ultimately, a solver is only useful if it fits into an end-to-end workflow:
- Reading meshes from commercial tools (no re-meshing).
- Handling realistic boundary conditions from engine engineers.
- Writing results to open formats for post-processing.
On a custom chip, this usually implies:
- A CPU-side orchestration layer that prepares data and launches chip kernels.
- A clear separation between physics modules and hardware-specific backends.
- Automation around performance testing on real machines, not just micro-benchmarks.
7. Closing Remarks
This note is intentionally high-level. In future updates, I plan to fill in more details on:
- Concrete sparse formats and kernel designs that worked (and failed) on our chip.
- Trade-offs between matrix-free and matrix-based implementations.
- How we organize the CFD codebase so that algorithm research and chip deployment share the same core.
If you are working on similar problems and would like to compare notes, feel free to reach out at xubinlab@gmail.com.