-
Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its last argument to csum_partial(). According to #cc44c17baf7f3, passing a non-zero value would not even yield the proper result on some architectures. However, other locations still expect a non-zero csum seed to be used in the next computation. Meanwhile, some benchmarking (*) revealed that folding copy and checksum operations may not be as optimal as one would have thought when the caches are under pressure, so we switch to a split version, first memcpy() then csum_partial(), so as to always benefit from memcpy() optimizations. As a bonus, we don't have to wrap calls to csum_partial_copy_nocheck() to follow the kernel API change. Instead we can provide a single implementation based on csum_partial() which works with any kernel version. (*) Below are benchmark figures of the csum_copy (folded) vs csum+copy (split) performances in idle vs busy scenarios. Busy means hackbench+dd loop streaming 128M in the background from zero -> null, in order to badly trash the D-caches while the test runs. Three different packet sizes are submitted to checksumming (32, 1024, 1500 bytes), all figures in nanosecs. iMX6QP (Cortex A9) ------------------ === idle CSUM_COPY 32b: min=333, max=1333, avg=439 CSUM_COPY 1024b: min=1000, max=2000, avg=1045 CSUM_COPY 1500b: min=1333, max=2000, avg=1333 COPY+CSUM 32b: min=333, max=1333, avg=443 COPY+CSUM 1024b: min=1000, max=2334, avg=1345 COPY+CSUM 1500b: min=1666, max=2667, avg=1737 === busy CSUM_COPY 32b: min=333, max=4333, avg=466 CSUM_COPY 1024b: min=1000, max=5000, avg=1088 CSUM_COPY 1500b: min=1333, max=5667, avg=1393 COPY+CSUM 32b: min=333, max=1334, avg=454 COPY+CSUM 1024b: min=1000, max=2000, avg=1341 COPY+CSUM 1500b: min=1666, max=2666, avg=1745 C4 (Cortex A55) --------------- === idle CSUM_COPY 32b: min=125, max=791, avg=130 CSUM_COPY 1024b: min=541, max=834, avg=550 CSUM_COPY 1500b: min=708, max=1875, avg=740 COPY+CSUM 32b: min=125, max=167, avg=133 COPY+CSUM 1024b: min=541, max=625, avg=553 COPY+CSUM 1500b: min=708, max=750, avg=730 === busy CSUM_COPY 32b: min=125, max=792, avg=133 CSUM_COPY 1024b: min=500, max=2000, avg=552 CSUM_COPY 1500b: min=708, max=1542, avg=744 COPY+CSUM 32b: min=125, max=375, avg=133 COPY+CSUM 1024b: min=500, max=709, avg=553 COPY+CSUM 1500b: min=708, max=916, avg=743 x86 (atom x5) ------------- === idle CSUM_COPY 32b: min=67, max=590, avg=70 CSUM_COPY 1024b: min=245, max=385, avg=251 CSUM_COPY 1500b: min=343, max=521, avg=350 COPY+CSUM 32b: min=101, max=679, avg=117 COPY+CSUM 1024b: min=296, max=379, avg=298 COPY+CSUM 1500b: min=399, max=502, avg=404 == busy CSUM_COPY 32b: min=65, max=709, avg=71 CSUM_COPY 1024b: min=243, max=702, avg=252 CSUM_COPY 1500b: min=340, max=1055, avg=351 COPY+CSUM 32b: min=100, max=665, avg=120 COPY+CSUM 1024b: min=295, max=669, avg=298 COPY+CSUM 1500b: min=399, max=686, avg=403 arm64 which has no folded csum_copy implementation makes the best of using the split copy+csum path. All architectures seem to benefit from optimized memcpy under load when it comes to the worst case execution time. x86 is less prone to jittery under cache trashing than others as usual, but even there, the max. figures for csum+copy in busy context look pretty much on par with the csum_copy version. Therefore, converting all users to csum+copy makes sense. Signed-off-by: Philippe Gerum <rpm@xenomai.org>
a6e9faa0