Skip to content
  • Michal Hocko's avatar
    mm: remove return value from init_currently_empty_zone · dc0bbf3b
    Michal Hocko authored
    Patch series "mm: make movable onlining suck less", v4.
    
    Movable onlining is a real hack with many downsides - mainly
    reintroduction of lowmem/highmem issues we used to have on 32b systems -
    but it is the only way to make the memory hotremove more reliable which
    is something that people are asking for.
    
    The current semantic of memory movable onlinening is really cumbersome,
    however.  The main reason for this is that the udev driven approach is
    basically unusable because udev races with the memory probing while only
    the last memory block or the one adjacent to the existing zone_movable
    are allowed to be onlined movable.  In short the criterion for the
    successful online_movable changes under udev's feet.  A reliable udev
    approach would require a 2 phase approach where the first successful
    movable online would have to check all the previous blocks and online
    them in descending order.  This is hard to be considered sane.
    
    This patchset aims at making the onlining semantic more usable.  First
    of all it allows to online memory movable as long as it doesn't clash
    with the existing ZONE_NORMAL.  That means that ZONE_NORMAL and
    ZONE_MOVABLE cannot overlap.  Currently I preserve the original ordering
    semantic so the zone always precedes the movable zone but I have plans
    to remove this restriction in future because it is not really necessary.
    
    First 3 patches are cleanups which should be ready to be merged right
    away (unless I have missed something subtle of course).
    
    Patch 4 deals with ZONE_DEVICE dependencies down the __add_pages path.
    
    Patch 5 deals with implicit assumptions of register_one_node on pgdat
    initialization.
    
    Patches 6-10 deal with offline holes in the zone for pfn walkers.  I
    hope I got all of them right but people familiar with compaction should
    double check this.
    
    Patch 11 is the core of the change.  In order to make it easier to
    review I have tried it to be as minimalistic as possible and the large
    code removal is moved to patch 14.
    
    Patch 12 is a trivial follow up cleanup.  Patch 13 fixes sparse warnings
    and finally patch 14 removes the unused code.
    
    I have tested the patches in kvm:
      # qemu-system-x86_64 -enable-kvm -monitor pty -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G ...
    
    and then probed the additional memory by
      (qemu) object_add memory-backend-ram,id=mem1,size=1G
      (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
    
    Then I have used this simple script to probe the memory block by hand
      # cat probe_memblock.sh
      #!/bin/sh
    
      BLOCK_NR=$1
    
      # echo $((0x100000000+$BLOCK_NR*(128<<20))) > /sys/devices/system/memory/probe
    
      # for i in $(seq 10); do sh probe_memblock.sh $i; done
      # grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
      /sys/devices/system/memory/memory33/valid_zones:Normal Movable
      /sys/devices/system/memory/memory34/valid_zones:Normal Movable
      /sys/devices/system/memory/memory35/valid_zones:Normal Movable
      /sys/devices/system/memory/memory36/valid_zones:Normal Movable
      /sys/devices/system/memory/memory37/valid_zones:Normal Movable
      /sys/devices/system/memory/memory38/valid_zones:Normal Movable
      /sys/devices/system/memory/memory39/valid_zones:Normal Movable
    
    The main difference to the original implementation is that all new
    memblocks can be both online_kernel and online_movable initially because
    there is no clash obviously.  For the comparison the original
    implementation would have
    
      /sys/devices/system/memory/memory33/valid_zones:Normal
      /sys/devices/system/memory/memory34/valid_zones:Normal
      /sys/devices/system/memory/memory35/valid_zones:Normal
      /sys/devices/system/memory/memory36/valid_zones:Normal
      /sys/devices/system/memory/memory37/valid_zones:Normal
      /sys/devices/system/memory/memory38/valid_zones:Normal
      /sys/devices/system/memory/memory39/valid_zones:Normal Movable
    
    Now
      # echo online_movable > /sys/devices/system/memory/memory34/state
      # grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
      /sys/devices/system/memory/memory33/valid_zones:Normal Movable
      /sys/devices/system/memory/memory34/valid_zones:Movable
      /sys/devices/system/memory/memory35/valid_zones:Movable
      /sys/devices/system/memory/memory36/valid_zones:Movable
      /sys/devices/system/memory/memory37/valid_zones:Movable
      /sys/devices/system/memory/memory38/valid_zones:Movable
      /sys/devices/system/memory/memory39/valid_zones:Movable
    
    Block 33 can still be online both kernel and movable while all
    the remaining can be only movable.
    
    /proc/zonelist says
      Node 0, zone   Normal
        pages free     0
              min      0
              low      0
              high     0
              spanned  0
              present  0
      --
      Node 0, zone  Movable
        pages free     32753
              min      85
              low      117
              high     149
              spanned  32768
              present  32768
    
    A new memblock at a lower address will result in a new memblock (32)
    which will still allow both Normal and Movable.
    
      # sh probe_memblock.sh 0
      # grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
      /sys/devices/system/memory/memory32/valid_zones:Normal Movable
      /sys/devices/system/memory/memory33/valid_zones:Normal Movable
      /sys/devices/system/memory/memory34/valid_zones:Movable
      /sys/devices/system/memory/memory35/valid_zones:Movable
    
    and online_kernel will convert it to the zone normal properly
    while 33 can be still onlined both ways.
    
      # echo online_kernel > /sys/devices/system/memory/memory32/state
      # grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
      /sys/devices/system/memory/memory32/valid_zones:Normal
      /sys/devices/system/memory/memory33/valid_zones:Normal Movable
      /sys/devices/system/memory/memory34/valid_zones:Movable
      /sys/devices/system/memory/memory35/valid_zones:Movable
    
    /proc/zoneinfo will now tell
      Node 0, zone   Normal
        pages free     65441
              min      165
              low      230
              high     295
              spanned  65536
              present  65536
      --
      Node 0, zone  Movable
        pages free     32740
              min      82
              low      114
              high     146
              spanned  32768
              present  32768
    
    so both zones have one memblock spanned and present.
    
    Onlining 39 should associate this block to the movable zone
    
      # echo online > /sys/devices/system/memory/memory39/state
    
    /proc/zoneinfo will now tell
      Node 0, zone   Normal
        pages free     32765
              min      80
              low      112
              high     144
              spanned  32768
              present  32768
      --
      Node 0, zone  Movable
        pages free     65501
              min      160
              low      225
              high     290
              spanned  196608
              present  65536
    
    so we will have a movable zone which spans 6 memblocks, 2 present and 4
    representing a hole.
    
    Offlining both movable blocks will lead to the zone with no present
    pages which is the expected behavior I believe.
    
      # echo offline > /sys/devices/system/memory/memory39/state
      # echo offline > /sys/devices/system/memory/memory34/state
      # grep -A6 "Movable\|Normal" /proc/zoneinfo
      Node 0, zone   Normal
        pages free     32735
              min      90
              low      122
              high     154
              spanned  32768
              present  32768
      --
      Node 0, zone  Movable
        pages free     0
              min      0
              low      0
              high     0
              spanned  196608
              present  0
    
    As a bonus we will get a nice cleanup in the memory hotplug codebase.
    
    This patch (of 16):
    
    init_currently_empty_zone doesn't have any error to return yet it is
    still an int and callers try to be defensive and try to handle potential
    error.  Remove this nonsense and simplify all callers.
    
    This patch shouldn't have any visible effect
    
    Link: http://lkml.kernel.org/r/20170515085827.16474-2-mhocko@kernel.org
    
    
    Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
    Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
    Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
    Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Daniel Kiper <daniel.kiper@oracle.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
    Cc: Igor Mammedov <imammedo@redhat.com>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: Joonsoo Kim <js1304@gmail.com>
    Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
    Cc: Tobias Regnery <tobias.regnery@gmail.com>
    Cc: Toshi Kani <toshi.kani@hpe.com>
    Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
    Cc: Xishi Qiu <qiuxishi@huawei.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    dc0bbf3b