kernel/drivers/net/stack/ipv4/udp/udp.c · a6e9faa07fb02866db53a23cd7f377debc3bef30 · xenomai / xenomai

drivers/net: checksum: convert to memcpy+csum · a6e9faa0
Philippe Gerum authored Jan 01, 2021 and
Jan Kiszka committed May 12, 2021


Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its
last argument to csum_partial(). According to #cc44c17baf7f3, passing
a non-zero value would not even yield the proper result on some
architectures. However, other locations still expect a non-zero csum
seed to be used in the next computation.

Meanwhile, some benchmarking (*) revealed that folding copy and
checksum operations may not be as optimal as one would have thought
when the caches are under pressure, so we switch to a split version,
first memcpy() then csum_partial(), so as to always benefit from
memcpy() optimizations. As a bonus, we don't have to wrap calls to
csum_partial_copy_nocheck() to follow the kernel API change. Instead
we can provide a single implementation based on csum_partial() which
works with any kernel version.

(*) Below are benchmark figures of the csum_copy (folded) vs csum+copy
(split) performances in idle vs busy scenarios. Busy means
hackbench+dd loop streaming 128M in the background from zero -> null,
in order to badly trash the D-caches while the test runs. Three
different packet sizes are submitted to checksumming (32, 1024, 1500
bytes), all figures in nanosecs.

iMX6QP (Cortex A9)
------------------

=== idle

CSUM_COPY 32b: min=333, max=1333, avg=439
CSUM_COPY 1024b: min=1000, max=2000, avg=1045
CSUM_COPY 1500b: min=1333, max=2000, avg=1333
COPY+CSUM 32b: min=333, max=1333, avg=443
COPY+CSUM 1024b: min=1000, max=2334, avg=1345
COPY+CSUM 1500b: min=1666, max=2667, avg=1737

=== busy

CSUM_COPY 32b: min=333, max=4333, avg=466
CSUM_COPY 1024b: min=1000, max=5000, avg=1088
CSUM_COPY 1500b: min=1333, max=5667, avg=1393
COPY+CSUM 32b: min=333, max=1334, avg=454
COPY+CSUM 1024b: min=1000, max=2000, avg=1341
COPY+CSUM 1500b: min=1666, max=2666, avg=1745

C4 (Cortex A55)
---------------

=== idle

CSUM_COPY 32b: min=125, max=791, avg=130
CSUM_COPY 1024b: min=541, max=834, avg=550
CSUM_COPY 1500b: min=708, max=1875, avg=740
COPY+CSUM 32b: min=125, max=167, avg=133
COPY+CSUM 1024b: min=541, max=625, avg=553
COPY+CSUM 1500b: min=708, max=750, avg=730

=== busy

CSUM_COPY 32b: min=125, max=792, avg=133
CSUM_COPY 1024b: min=500, max=2000, avg=552
CSUM_COPY 1500b: min=708, max=1542, avg=744
COPY+CSUM 32b: min=125, max=375, avg=133
COPY+CSUM 1024b: min=500, max=709, avg=553
COPY+CSUM 1500b: min=708, max=916, avg=743

x86 (atom x5)
-------------

=== idle

CSUM_COPY 32b: min=67, max=590, avg=70
CSUM_COPY 1024b: min=245, max=385, avg=251
CSUM_COPY 1500b: min=343, max=521, avg=350
COPY+CSUM 32b: min=101, max=679, avg=117
COPY+CSUM 1024b: min=296, max=379, avg=298
COPY+CSUM 1500b: min=399, max=502, avg=404

== busy

CSUM_COPY 32b: min=65, max=709, avg=71
CSUM_COPY 1024b: min=243, max=702, avg=252
CSUM_COPY 1500b: min=340, max=1055, avg=351
COPY+CSUM 32b: min=100, max=665, avg=120
COPY+CSUM 1024b: min=295, max=669, avg=298
COPY+CSUM 1500b: min=399, max=686, avg=403

arm64 which has no folded csum_copy implementation makes the best of
using the split copy+csum path. All architectures seem to benefit from
optimized memcpy under load when it comes to the worst case execution
time. x86 is less prone to jittery under cache trashing than others as
usual, but even there, the max. figures for csum+copy in busy context
look pretty much on par with the csum_copy version. Therefore,
converting all users to csum+copy makes sense.

Signed-off-by: Philippe Gerum <rpm@xenomai.org>
a6e9faa0