Skip to content
  • Philippe Gerum's avatar
    drivers/net: checksum: convert to memcpy+csum · a6e9faa0
    Philippe Gerum authored and Jan Kiszka's avatar Jan Kiszka committed
    
    
    Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its
    last argument to csum_partial(). According to #cc44c17baf7f3, passing
    a non-zero value would not even yield the proper result on some
    architectures. However, other locations still expect a non-zero csum
    seed to be used in the next computation.
    
    Meanwhile, some benchmarking (*) revealed that folding copy and
    checksum operations may not be as optimal as one would have thought
    when the caches are under pressure, so we switch to a split version,
    first memcpy() then csum_partial(), so as to always benefit from
    memcpy() optimizations. As a bonus, we don't have to wrap calls to
    csum_partial_copy_nocheck() to follow the kernel API change. Instead
    we can provide a single implementation based on csum_partial() which
    works with any kernel version.
    
    (*) Below are benchmark figures of the csum_copy (folded) vs csum+copy
    (split) performances in idle vs busy scenarios. Busy means
    hackbench+dd loop streaming 128M in the background from zero -> null,
    in order to badly trash the D-caches while the test runs. Three
    different packet sizes are submitted to checksumming (32, 1024, 1500
    bytes), all figures in nanosecs.
    
    iMX6QP (Cortex A9)
    ------------------
    
    === idle
    
    CSUM_COPY 32b: min=333, max=1333, avg=439
    CSUM_COPY 1024b: min=1000, max=2000, avg=1045
    CSUM_COPY 1500b: min=1333, max=2000, avg=1333
    COPY+CSUM 32b: min=333, max=1333, avg=443
    COPY+CSUM 1024b: min=1000, max=2334, avg=1345
    COPY+CSUM 1500b: min=1666, max=2667, avg=1737
    
    === busy
    
    CSUM_COPY 32b: min=333, max=4333, avg=466
    CSUM_COPY 1024b: min=1000, max=5000, avg=1088
    CSUM_COPY 1500b: min=1333, max=5667, avg=1393
    COPY+CSUM 32b: min=333, max=1334, avg=454
    COPY+CSUM 1024b: min=1000, max=2000, avg=1341
    COPY+CSUM 1500b: min=1666, max=2666, avg=1745
    
    C4 (Cortex A55)
    ---------------
    
    === idle
    
    CSUM_COPY 32b: min=125, max=791, avg=130
    CSUM_COPY 1024b: min=541, max=834, avg=550
    CSUM_COPY 1500b: min=708, max=1875, avg=740
    COPY+CSUM 32b: min=125, max=167, avg=133
    COPY+CSUM 1024b: min=541, max=625, avg=553
    COPY+CSUM 1500b: min=708, max=750, avg=730
    
    === busy
    
    CSUM_COPY 32b: min=125, max=792, avg=133
    CSUM_COPY 1024b: min=500, max=2000, avg=552
    CSUM_COPY 1500b: min=708, max=1542, avg=744
    COPY+CSUM 32b: min=125, max=375, avg=133
    COPY+CSUM 1024b: min=500, max=709, avg=553
    COPY+CSUM 1500b: min=708, max=916, avg=743
    
    x86 (atom x5)
    -------------
    
    === idle
    
    CSUM_COPY 32b: min=67, max=590, avg=70
    CSUM_COPY 1024b: min=245, max=385, avg=251
    CSUM_COPY 1500b: min=343, max=521, avg=350
    COPY+CSUM 32b: min=101, max=679, avg=117
    COPY+CSUM 1024b: min=296, max=379, avg=298
    COPY+CSUM 1500b: min=399, max=502, avg=404
    
    == busy
    
    CSUM_COPY 32b: min=65, max=709, avg=71
    CSUM_COPY 1024b: min=243, max=702, avg=252
    CSUM_COPY 1500b: min=340, max=1055, avg=351
    COPY+CSUM 32b: min=100, max=665, avg=120
    COPY+CSUM 1024b: min=295, max=669, avg=298
    COPY+CSUM 1500b: min=399, max=686, avg=403
    
    arm64 which has no folded csum_copy implementation makes the best of
    using the split copy+csum path. All architectures seem to benefit from
    optimized memcpy under load when it comes to the worst case execution
    time. x86 is less prone to jittery under cache trashing than others as
    usual, but even there, the max. figures for csum+copy in busy context
    look pretty much on par with the csum_copy version. Therefore,
    converting all users to csum+copy makes sense.
    
    Signed-off-by: default avatarPhilippe Gerum <rpm@xenomai.org>
    a6e9faa0