1. 07 Dec, 2018 1 commit
  2. 09 Nov, 2018 1 commit
  3. 17 Oct, 2018 3 commits
  4. 05 Oct, 2018 1 commit
    • Sagi Grimberg's avatar
      nvmet-rdma: use a private workqueue for delete · 2acf70ad
      Sagi Grimberg authored
      Queue deletion is done asynchronous when the last reference on the queue
      is dropped.  Thus, in order to make sure we don't over allocate under a
      connect/disconnect storm, we let queue deletion complete before making
      forward progress.
      
      However, given that we flush the system_wq from rdma_cm context which
      runs from a workqueue context, we can have a circular locking complaint
      [1]. Fix that by using a private workqueue for queue deletion.
      
      [1]:
      ======================================================
      WARNING: possible circular locking dependency detected
      4.19.0-rc4-dbg+ #3 Not tainted
      ------------------------------------------------------
      kworker/5:0/39 is trying to acquire lock:
      00000000a10b6db9 (&id_priv->handler_mutex){+.+.}, at: rdma_destroy_id+0x6f/0x440 [rdma_cm]
      
      but task is already holding lock:
      00000000331b4e2c ((work_completion)(&queue->release_work)){+.+.}, at: process_one_work+0x3ed/0xa20
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #3 ((work_completion)(&queue->release_work)){+.+.}:
             process_one_work+0x474/0xa20
             worker_thread+0x63/0x5a0
             kthread+0x1cf/0x1f0
             ret_from_fork+0x24/0x30
      
      -> #2 ((wq_completion)"events"){+.+.}:
             flush_workqueue+0xf3/0x970
             nvmet_rdma_cm_handler+0x133d/0x1734 [nvmet_rdma]
             cma_ib_req_handler+0x72f/0xf90 [rdma_cm]
             cm_process_work+0x2e/0x110 [ib_cm]
             cm_req_handler+0x135b/0x1c30 [ib_cm]
             cm_work_handler+0x2b7/0x38cd [ib_cm]
             process_one_work+0x4ae/0xa20
      nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma: disconnected (10): status 0 id 0000000040357082
             worker_thread+0x63/0x5a0
             kthread+0x1cf/0x1f0
             ret_from_fork+0x24/0x30
      nvme nvme0: Reconnecting in 10 seconds...
      
      -> #1 (&id_priv->handler_mutex/1){+.+.}:
             __mutex_lock+0xfe/0xbe0
             mutex_lock_nested+0x1b/0x20
             cma_ib_req_handler+0x6aa/0xf90 [rdma_cm]
             cm_process_work+0x2e/0x110 [ib_cm]
             cm_req_handler+0x135b/0x1c30 [ib_cm]
             cm_work_handler+0x2b7/0x38cd [ib_cm]
             process_one_work+0x4ae/0xa20
             worker_thread+0x63/0x5a0
             kthread+0x1cf/0x1f0
             ret_from_fork+0x24/0x30
      
      -> #0 (&id_priv->handler_mutex){+.+.}:
             lock_acquire+0xc5/0x200
             __mutex_lock+0xfe/0xbe0
             mutex_lock_nested+0x1b/0x20
             rdma_destroy_id+0x6f/0x440 [rdma_cm]
             nvmet_rdma_release_queue_work+0x8e/0x1b0 [nvmet_rdma]
             process_one_work+0x4ae/0xa20
             worker_thread+0x63/0x5a0
             kthread+0x1cf/0x1f0
             ret_from_fork+0x24/0x30
      
      Fixes: 777dc823
      
       ("nvmet-rdma: occasionally flush ongoing controller teardown")
      Reported-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Tested-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      2acf70ad
  5. 05 Sep, 2018 1 commit
    • Sagi Grimberg's avatar
      nvmet-rdma: fix possible bogus dereference under heavy load · 8407879c
      Sagi Grimberg authored
      
      
      Currently we always repost the recv buffer before we send a response
      capsule back to the host. Since ordering is not guaranteed for send
      and recv completions, it is posible that we will receive a new request
      from the host before we got a send completion for the response capsule.
      
      Today, we pre-allocate 2x rsps the length of the queue, but in reality,
      under heavy load there is nothing that is really preventing the gap to
      expand until we exhaust all our rsps.
      
      To fix this, if we don't have any pre-allocated rsps left, we dynamically
      allocate a rsp and make sure to free it when we are done. If under memory
      pressure we fail to allocate a rsp, we silently drop the command and
      wait for the host to retry.
      Reported-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      [hch: dropped a superflous assignment]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      8407879c
  6. 24 Jul, 2018 1 commit
  7. 23 Jul, 2018 3 commits
  8. 18 Jun, 2018 1 commit
  9. 26 Mar, 2018 5 commits
  10. 08 Jan, 2018 2 commits
  11. 06 Jan, 2018 1 commit
  12. 11 Nov, 2017 2 commits
  13. 18 Aug, 2017 1 commit
  14. 28 Jun, 2017 2 commits
  15. 20 May, 2017 1 commit
  16. 04 Apr, 2017 3 commits
  17. 16 Mar, 2017 1 commit
  18. 22 Feb, 2017 2 commits
  19. 26 Jan, 2017 1 commit
  20. 14 Dec, 2016 1 commit
  21. 06 Dec, 2016 2 commits
  22. 14 Nov, 2016 3 commits
  23. 23 Sep, 2016 1 commit