Commit df632d3c authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge tag 'nfs-for-3.7-1' of git://

Pull NFS client updates from Trond Myklebust:
 "Features include:

   - Remove CONFIG_EXPERIMENTAL dependency from NFSv4.1
     Aside from the issues discussed at the LKS, distros are shipping
     NFSv4.1 with all the trimmings.
   - Fix fdatasync()/fsync() for the corner case of a server reboot.
   - NFSv4 OPEN access fix: finally distinguish correctly between
     open-for-read and open-for-execute permissions in all situations.
   - Ensure that the TCP socket is closed when we're in CLOSE_WAIT
   - More idmapper bugfixes
   - Lots of pNFS bugfixes and cleanups to remove unnecessary state and
     make the code easier to read.
   - In cases where a pNFS read or write fails, allow the client to
     resume trying layoutgets after two minutes of read/write-
   - More net namespace fixes to the NFSv4 callback code.
   - More net namespace fixes to the NFSv3 locking code.
   - More NFSv4 migration preparatory patches.
     Including patches to detect network trunking in both NFSv4 and
   - pNFS block updates to optimise LAYOUTGET calls."

* tag 'nfs-for-3.7-1' of git:// (113 commits)
  pnfsblock: cleanup nfs4_blkdev_get
  NFS41: send real read size in layoutget
  NFS41: send real write size in layoutget
  NFS: track direct IO left bytes
  NFSv4.1: Cleanup ugliness in pnfs_layoutgets_blocked()
  NFSv4.1: Ensure that the layout sequence id stays 'close' to the current
  NFSv4.1: Deal with seqid wraparound in the pNFS return-on-close code
  NFSv4 set open access operation call flag in nfs4_init_opendata_res
  NFSv4.1: Remove the dependency on CONFIG_EXPERIMENTAL
  NFSv4 reduce attribute requests for open reclaim
  NFSv4: nfs4_open_done first must check that GETATTR decoded a file type
  NFSv4.1: Deal with wraparound when updating the layout "barrier" seqid
  NFSv4.1: Deal with wraparound issues when updating the layout stateid
  NFSv4.1: Always set the layout stateid if this is the first layoutget
  NFSv4.1: Fix another refcount issue in pnfs_find_alloc_layout
  NFSv4: don't put ACCESS in OPEN compound if O_EXCL
  NFSv4: don't check MAY_WRITE access bit in OPEN
  NFS: Set key construction data for the legacy upcall
  NFSv4.1: don't do two EXCHANGE_IDs on mount
  NFS: nfs41_walk_client_list(): re-lock before iterating
parents 2474542f af283885
......@@ -12,9 +12,47 @@ and work is in progress on adding support for minor version 1 of the NFSv4
The purpose of this document is to provide information on some of the
upcall interfaces that are used in order to provide the NFS client with
some of the information that it requires in order to fully comply with
the NFS spec.
special features of the NFS client that can be configured by system
The nfs4_unique_id parameter
NFSv4 requires clients to identify themselves to servers with a unique
string. File open and lock state shared between one client and one server
is associated with this identity. To support robust NFSv4 state recovery
and transparent state migration, this identity string must not change
across client reboots.
Without any other intervention, the Linux client uses a string that contains
the local system's node name. System administrators, however, often do not
take care to ensure that node names are fully qualified and do not change
over the lifetime of a client system. Node names can have other
administrative requirements that require particular behavior that does not
work well as part of an nfs_client_id4 string.
The nfs.nfs4_unique_id boot parameter specifies a unique string that can be
used instead of a system's node name when an NFS client identifies itself to
a server. Thus, if the system's node name is not unique, or it changes, its
nfs.nfs4_unique_id stays the same, preventing collision with other clients
or loss of state during NFS reboot recovery or transparent state migration.
The nfs.nfs4_unique_id string is typically a UUID, though it can contain
anything that is believed to be unique across all NFS clients. An
nfs4_unique_id string should be chosen when a client system is installed,
just as a system's root file system gets a fresh UUID in its label at
install time.
The string should remain fixed for the lifetime of the client. It can be
changed safely if care is taken that the client shuts down cleanly and all
outstanding NFSv4 state has expired, to prevent loss of NFSv4 state.
This string can be stored in an NFS client's grub.conf, or it can be provided
via a net boot facility such as PXE. It may also be specified as an nfs.ko
module parameter. Specifying a uniquifier string is not support for NFS
clients running in containers.
The DNS resolver
......@@ -1730,6 +1730,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
will be autodetected by the client, and it will fall
back to using the idmapper.
To turn off this behaviour, set the value to '0'.
[NFS4] Specify an additional fixed unique ident-
ification string that NFSv4 clients can insert into
their nfs_client_id4 string. This is typically a
UUID that is generated at system install time.
nfs.send_implementation_id =
[NFSv4.1] Send client implementation identification
......@@ -7,7 +7,6 @@
#include <linux/types.h>
#include <linux/utsname.h>
#include <linux/kernel.h>
#include <linux/ktime.h>
#include <linux/slab.h>
......@@ -19,6 +18,8 @@
#include <asm/unaligned.h>
#include "netns.h"
#define NSM_PROGRAM 100024
#define NSM_VERSION 1
......@@ -40,6 +41,7 @@ struct nsm_args {
u32 proc;
char *mon_name;
char *nodename;
struct nsm_res {
......@@ -70,7 +72,7 @@ static struct rpc_clnt *nsm_create(struct net *net)
struct rpc_create_args args = {
.net = net,
.address = (struct sockaddr *)&sin,
.addrsize = sizeof(sin),
.servername = "rpc.statd",
......@@ -83,10 +85,54 @@ static struct rpc_clnt *nsm_create(struct net *net)
return rpc_create(&args);
static int nsm_mon_unmon(struct nsm_handle *nsm, u32 proc, struct nsm_res *res,
struct net *net)
static struct rpc_clnt *nsm_client_get(struct net *net)
static DEFINE_MUTEX(nsm_create_mutex);
struct rpc_clnt *clnt;
struct lockd_net *ln = net_generic(net, lockd_net_id);
if (ln->nsm_users) {
clnt = ln->nsm_clnt;
goto out;
clnt = nsm_create(net);
if (!IS_ERR(clnt)) {
ln->nsm_clnt = clnt;
ln->nsm_users = 1;
return clnt;
static void nsm_client_put(struct net *net)
struct lockd_net *ln = net_generic(net, lockd_net_id);
struct rpc_clnt *clnt = ln->nsm_clnt;
int shutdown = 0;
if (ln->nsm_users) {
if (--ln->nsm_users)
ln->nsm_clnt = NULL;
shutdown = !ln->nsm_users;
if (shutdown)
static int nsm_mon_unmon(struct nsm_handle *nsm, u32 proc, struct nsm_res *res,
struct rpc_clnt *clnt)
int status;
struct nsm_args args = {
.priv = &nsm->sm_priv,
......@@ -94,31 +140,24 @@ static int nsm_mon_unmon(struct nsm_handle *nsm, u32 proc, struct nsm_res *res,
.vers = 3,
.mon_name = nsm->sm_mon_name,
.nodename = clnt->cl_nodename,
struct rpc_message msg = {
.rpc_argp = &args,
.rpc_resp = res,
clnt = nsm_create(net);
if (IS_ERR(clnt)) {
status = PTR_ERR(clnt);
dprintk("lockd: failed to create NSM upcall transport, "
"status=%d\n", status);
goto out;
BUG_ON(clnt == NULL);
memset(res, 0, sizeof(*res));
msg.rpc_proc = &clnt->cl_procinfo[proc];
status = rpc_call_sync(clnt, &msg, 0);
status = rpc_call_sync(clnt, &msg, RPC_TASK_SOFTCONN);
if (status < 0)
dprintk("lockd: NSM upcall RPC failed, status=%d\n",
status = 0;
return status;
......@@ -138,6 +177,7 @@ int nsm_monitor(const struct nlm_host *host)
struct nsm_handle *nsm = host->h_nsmhandle;
struct nsm_res res;
int status;
struct rpc_clnt *clnt;
dprintk("lockd: nsm_monitor(%s)\n", nsm->sm_name);
......@@ -150,7 +190,15 @@ int nsm_monitor(const struct nlm_host *host)
nsm->sm_mon_name = nsm_use_hostnames ? nsm->sm_name : nsm->sm_addrbuf;
status = nsm_mon_unmon(nsm, NSMPROC_MON, &res, host->net);
clnt = nsm_client_get(host->net);
if (IS_ERR(clnt)) {
status = PTR_ERR(clnt);
dprintk("lockd: failed to create NSM upcall transport, "
"status=%d, net=%p\n", status, host->net);
return status;
status = nsm_mon_unmon(nsm, NSMPROC_MON, &res, clnt);
if (unlikely(res.status != 0))
status = -EIO;
if (unlikely(status < 0)) {
......@@ -182,9 +230,11 @@ void nsm_unmonitor(const struct nlm_host *host)
if (atomic_read(&nsm->sm_count) == 1
&& nsm->sm_monitored && !nsm->sm_sticky) {
struct lockd_net *ln = net_generic(host->net, lockd_net_id);
dprintk("lockd: nsm_unmonitor(%s)\n", nsm->sm_name);
status = nsm_mon_unmon(nsm, NSMPROC_UNMON, &res, host->net);
status = nsm_mon_unmon(nsm, NSMPROC_UNMON, &res, ln->nsm_clnt);
if (res.status != 0)
status = -EIO;
if (status < 0)
......@@ -192,6 +242,8 @@ void nsm_unmonitor(const struct nlm_host *host)
nsm->sm_monitored = 0;
......@@ -430,7 +482,7 @@ static void encode_my_id(struct xdr_stream *xdr, const struct nsm_args *argp)
__be32 *p;
encode_nsm_string(xdr, utsname()->nodename);
encode_nsm_string(xdr, argp->nodename);
p = xdr_reserve_space(xdr, 4 + 4 + 4);
*p++ = cpu_to_be32(argp->prog);
*p++ = cpu_to_be32(argp->vers);
......@@ -12,6 +12,10 @@ struct lockd_net {
struct delayed_work grace_period_end;
struct lock_manager lockd_manager;
struct list_head grace_list;
spinlock_t nsm_clnt_lock;
unsigned int nsm_users;
struct rpc_clnt *nsm_clnt;
extern int lockd_net_id;
......@@ -596,6 +596,7 @@ static int lockd_init_net(struct net *net)
INIT_DELAYED_WORK(&ln->grace_period_end, grace_ender);
return 0;
......@@ -95,8 +95,8 @@ config NFS_SWAP
This option enables swapon to work on files located on NFS mounts.
config NFS_V4_1
bool "NFS client support for NFSv4.1 (EXPERIMENTAL)"
depends on NFS_V4 && EXPERIMENTAL
bool "NFS client support for NFSv4.1"
depends on NFS_V4
This option enables support for minor version 1 of the NFSv4 protocol
......@@ -37,6 +37,7 @@
#include <linux/bio.h> /* struct bio */
#include <linux/buffer_head.h> /* various write calls */
#include <linux/prefetch.h>
#include <linux/pagevec.h>
#include "../pnfs.h"
#include "../internal.h"
......@@ -162,25 +163,39 @@ static struct bio *bl_alloc_init_bio(int npg, sector_t isect,
return bio;
static struct bio *bl_add_page_to_bio(struct bio *bio, int npg, int rw,
static struct bio *do_add_page_to_bio(struct bio *bio, int npg, int rw,
sector_t isect, struct page *page,
struct pnfs_block_extent *be,
void (*end_io)(struct bio *, int err),
struct parallel_io *par)
struct parallel_io *par,
unsigned int offset, int len)
isect = isect + (offset >> SECTOR_SHIFT);
dprintk("%s: npg %d rw %d isect %llu offset %u len %d\n", __func__,
npg, rw, (unsigned long long)isect, offset, len);
if (!bio) {
bio = bl_alloc_init_bio(npg, isect, be, end_io, par);
if (!bio)
return ERR_PTR(-ENOMEM);
if (bio_add_page(bio, page, PAGE_CACHE_SIZE, 0) < PAGE_CACHE_SIZE) {
if (bio_add_page(bio, page, len, offset) < len) {
bio = bl_submit_bio(rw, bio);
goto retry;
return bio;
static struct bio *bl_add_page_to_bio(struct bio *bio, int npg, int rw,
sector_t isect, struct page *page,
struct pnfs_block_extent *be,
void (*end_io)(struct bio *, int err),
struct parallel_io *par)
return do_add_page_to_bio(bio, npg, rw, isect, page, be,
end_io, par, 0, PAGE_CACHE_SIZE);
/* This is basically copied from mpage_end_io_read */
static void bl_end_io_read(struct bio *bio, int err)
......@@ -228,14 +243,6 @@ bl_end_par_io_read(void *data, int unused)
static bool
bl_check_alignment(u64 offset, u32 len, unsigned long blkmask)
if ((offset & blkmask) || (len & blkmask))
return false;
return true;
static enum pnfs_try_status
bl_read_pagelist(struct nfs_read_data *rdata)
......@@ -246,15 +253,15 @@ bl_read_pagelist(struct nfs_read_data *rdata)
sector_t isect, extent_length = 0;
struct parallel_io *par;
loff_t f_offset = rdata->args.offset;
size_t bytes_left = rdata->args.count;
unsigned int pg_offset, pg_len;
struct page **pages = rdata->args.pages;
int pg_index = rdata->args.pgbase >> PAGE_CACHE_SHIFT;
const bool is_dio = (header->dreq != NULL);
dprintk("%s enter nr_pages %u offset %lld count %u\n", __func__,
rdata->pages.npages, f_offset, (unsigned int)rdata->args.count);
if (!bl_check_alignment(f_offset, rdata->args.count, PAGE_CACHE_MASK))
goto use_mds;
par = alloc_parallel(rdata);
if (!par)
goto use_mds;
......@@ -284,36 +291,53 @@ bl_read_pagelist(struct nfs_read_data *rdata)
extent_length = min(extent_length, cow_length);
if (is_dio) {
pg_offset = f_offset & ~PAGE_CACHE_MASK;
if (pg_offset + bytes_left > PAGE_CACHE_SIZE)
pg_len = PAGE_CACHE_SIZE - pg_offset;
pg_len = bytes_left;
f_offset += pg_len;
bytes_left -= pg_len;
isect += (pg_offset >> SECTOR_SHIFT);
} else {
pg_offset = 0;
hole = is_hole(be, isect);
if (hole && !cow_read) {
bio = bl_submit_bio(READ, bio);
/* Fill hole w/ zeroes w/o accessing device */
dprintk("%s Zeroing page for hole\n", __func__);
zero_user_segment(pages[i], 0, PAGE_CACHE_SIZE);
zero_user_segment(pages[i], pg_offset, pg_len);
} else {
struct pnfs_block_extent *be_read;
be_read = (hole && cow_read) ? cow_read : be;
bio = bl_add_page_to_bio(bio, rdata->pages.npages - i,
bio = do_add_page_to_bio(bio, rdata->pages.npages - i,
isect, pages[i], be_read,
bl_end_io_read, par);
bl_end_io_read, par,
pg_offset, pg_len);
if (IS_ERR(bio)) {
header->pnfs_error = PTR_ERR(bio);
bio = NULL;
goto out;
isect += (pg_len >> SECTOR_SHIFT);
extent_length -= PAGE_CACHE_SECTORS;
if ((isect << SECTOR_SHIFT) >= header->inode->i_size) {
rdata->res.eof = 1;
rdata->res.count = header->inode->i_size - f_offset;
rdata->res.count = header->inode->i_size - rdata->args.offset;
} else {
rdata->res.count = (isect << SECTOR_SHIFT) - f_offset;
rdata->res.count = (isect << SECTOR_SHIFT) - rdata->args.offset;
......@@ -461,6 +485,106 @@ map_block(struct buffer_head *bh, sector_t isect, struct pnfs_block_extent *be)
static void
bl_read_single_end_io(struct bio *bio, int error)
struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
struct page *page = bvec->bv_page;
/* Only one page in bvec */
static int
bl_do_readpage_sync(struct page *page, struct pnfs_block_extent *be,
unsigned int offset, unsigned int len)
struct bio *bio;
struct page *shadow_page;
sector_t isect;
char *kaddr, *kshadow_addr;
int ret = 0;
dprintk("%s: offset %u len %u\n", __func__, offset, len);
shadow_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
if (shadow_page == NULL)
return -ENOMEM;
bio = bio_alloc(GFP_NOIO, 1);
if (bio == NULL)
return -ENOMEM;
isect = (page->index << PAGE_CACHE_SECTOR_SHIFT) +
(offset / SECTOR_SIZE);
bio->bi_sector = isect - be->be_f_offset + be->be_v_offset;
bio->bi_bdev = be->be_mdev;
bio->bi_end_io = bl_read_single_end_io;
if (bio_add_page(bio, shadow_page,
SECTOR_SIZE, round_down(offset, SECTOR_SIZE)) == 0) {
return -EIO;
submit_bio(READ, bio);
if (unlikely(!test_bit(BIO_UPTODATE, &bio->bi_flags))) {
ret = -EIO;
} else {
kaddr = kmap_atomic(page);
kshadow_addr = kmap_atomic(shadow_page);
memcpy(kaddr + offset, kshadow_addr + offset, len);
return ret;
static int
bl_read_partial_page_sync(struct page *page, struct pnfs_block_extent *be,
unsigned int dirty_offset, unsigned int dirty_len,
bool full_page)
int ret = 0;
unsigned int start, end;
if (full_page) {
start = 0;
} else {
start = round_down(dirty_offset, SECTOR_SIZE);
end = round_up(dirty_offset + dirty_len, SECTOR_SIZE);
dprintk("%s: offset %u len %d\n", __func__, dirty_offset, dirty_len);
if (!be) {
zero_user_segments(page, start, dirty_offset,
dirty_offset + dirty_len, end);
if (start == 0 && end == PAGE_CACHE_SIZE &&
trylock_page(page)) {
return ret;
if (start != dirty_offset)
ret = bl_do_readpage_sync(page, be, start, dirty_offset - start);
if (!ret && (dirty_offset + dirty_len < end))
ret = bl_do_readpage_sync(page, be, dirty_offset + dirty_len,
end - dirty_offset - dirty_len);
return ret;
/* Given an unmapped page, zero it or read in page for COW, page is locked
* by caller.
......@@ -494,7 +618,6 @@ init_page_for_write(struct page *page, struct pnfs_block_extent *cow_read)
if (bh)
if (ret) {
......@@ -566,6 +689,7 @@ bl_write_pagelist(struct nfs_write_data *wdata, int sync)
struct parallel_io *par = NULL;
loff_t offset = wdata->args.offset;
size_t count = wdata->args.count;
unsigned int pg_offset, pg_len, saved_len;
struct page **pages = wdata->args.pages;
struct page *page;
pgoff_t index;
......@@ -574,10 +698,13 @@ bl_write_pagelist(struct nfs_write_data *wdata, int sync)
NFS_SERVER(header->inode)->pnfs_blksize >> PAGE_CACHE_SHIFT;
dprintk("%s enter, %Zu@%lld\n", __func__, count, offset);
/* Check for alignment first */
if (!bl_check_alignment(offset, count, PAGE_CACHE_MASK))
goto out_mds;
if (header->dreq != NULL &&
(!IS_ALIGNED(offset, NFS_SERVER(header->inode)->pnfs_blksize) ||
!IS_ALIGNED(count, NFS_SERVER(header->inode)->pnfs_blksize))) {
dprintk("pnfsblock nonblock aligned DIO writes. Resend MDS\n");
goto out_mds;
/* At this point, wdata->pages is a (sequential) list of nfs_pages.
* We want to write each, and if there is an error set pnfs_error
* to have it redone using nfs.
......@@ -674,10 +801,11 @@ next_page:
if (!extent_length) {
/* We've used up the previous extent */
bio = bl_submit_bio(WRITE, bio);
/* Get the next one */
be = bl_find_get_extent(BLK_LSEG2EXT(header->lseg),
isect, NULL);
isect, &cow_read);
if (!be || !is_writable(be, isect)) {
header->pnfs_error = -EINVAL;
goto out;
......@@ -694,7 +822,26 @@ next_page:
extent_length = be->be_length -
(isect - be->be_f_offset);
if (be->be_state == PNFS_BLOCK_INVALID_DATA) {
dprintk("%s offset %lld count %Zu\n", __func__, offset, count);
pg_offset = offset & ~PAGE_CACHE_MASK;
if (pg_offset + count > PAGE_CACHE_SIZE)
pg_len = PAGE_CACHE_SIZE - pg_offset;
pg_len = count;
saved_len = pg_len;
if (be->be_state == PNFS_BLOCK_INVALID_DATA &&
!bl_is_sector_init(be->be_inval, isect)) {
ret = bl_read_partial_page_sync(pages[i], cow_read,
pg_offset, pg_len, true);
if (ret) {
dprintk("%s bl_read_partial_page_sync fail %d\n",
__func__, ret);
header->pnfs_error = ret;
goto out;
ret = bl_mark_sectors_init(be->be_inval, isect,
if (unlikely(ret)) {
......@@ -703,15 +850,35 @@ next_page:
header->pnfs_error = ret;
goto out;
/* Expand to full page write */
pg_offset = 0;
} else if ((pg_offset & (SECTOR_SIZE - 1)) ||
(pg_len & (SECTOR_SIZE - 1))){
/* ahh, nasty case. We have to do sync full sector
* read-modify-write cycles.
unsigned int saved_offset = pg_offset;
ret = bl_read_partial_page_sync(pages[i], be, pg_offset,