Why are there device to device copies

Hello , everyone !

I run tgv case TGV Performance Numbers - General - PyFR , and profile with nsys, I found there is DtoD operations . I only use one chip , why there is a copy operation ? Where does it appear in the source code?

Depending on what solver you are using and your config, there are some points where an array has to be copied. For example, here: https://github.com/PyFR/PyFR/blob/98929401cd2d607da4058f13ecac3f4a6aeee60b/pyfr/solvers/baseadvec/elements.py#L102

This can be achieved with a DtoD copy where both devices are the same, but where src and dst point to different memory. Using a DtoD, rather than trying to make your own kernel, is probably much faster and means we can leave hardware specific optimisations to the hardware vendor.