I reran the case, for slightly less time (tend=1
), and also turned off the integrate and nan checker plugins.
Here are the results for an A100 at p=3, for a 40^3 element mesh.
Single | Double | |
---|---|---|
Runtime [s] | 216.256 | 355.794 |
DoF | (40\times4)^3 | (40\times4)^3 |
RHS | 4\times10^4 | 4\times10^4 |
No. GPUs | 1 | 1 |
s/DoF/RHS/GPU | 1.320\times10^{-9} | 2.172\times10^{-9} |
Converting these to Freddie’s units this gives:
- FP32 3.788GDoF/s
- FP64 2.302GDoF/s
So the A100, as expected compares favourably with the V100, but given that the bandwidth is ~50% higher on the A100 I would’ve expected more.
Now either the case is too small to keep the GPU busy or some kernel isn’t performing well on the new hardware. I have some more data on this but this is getting a little out of scope for this topic, so will start a new thread in due course to discuss this further, if people are interested.
The moral of the story here is that the plugins can have significant overhead.
(We might want to think about moving the nan checker to device for GPU to avoid the PCIe bottleneck).