This is really very awesome! I noticed that it got a new performance boost on the cuda backend of the program. I would like to know how this was achieved. How much of a performance improvement over the old version?
The potential for a performance boost comes from the use of new GiMMiK kernels. If you will see a benefit is very situational, although it can be as much as 20%.