Commit eedf265a authored by Eric W. Biederman's avatar Eric W. Biederman Committed by Linus Torvalds
Browse files

devpts: Make each mount of devpts an independent filesystem.

The /dev/ptmx device node is changed to lookup the directory entry "pts"
in the same directory as the /dev/ptmx device node was opened in.  If
there is a "pts" entry and that entry is a devpts filesystem /dev/ptmx
uses that filesystem.  Otherwise the open of /dev/ptmx fails.

The DEVPTS_MULTIPLE_INSTANCES configuration option is removed, so that
userspace can now safely depend on each mount of devpts creating a new
instance of the filesystem.

Each mount of devpts is now a separate and equal filesystem.

Reserved ttys are now available to all instances of devpts where the
mounter is in the initial mount namespace.

A new vfs helper path_pts is introduced that finds a directory entry
named "pts" in the directory of the passed in path, and changes the
passed in path to point to it.  The helper path_pts uses a function
path_parent_directory that was factored out of follow_dotdot.

In the implementation of devpts:
 - devpts_mnt is killed as it is no...
parent 049ec1b5
Each mount of the devpts filesystem is now distinct such that ptys
and their indicies allocated in one mount are independent from ptys
and their indicies in all other mounts.
To support containers, we now allow multiple instances of devpts filesystem,
such that indices of ptys allocated in one instance are independent of indices
allocated in other instances of devpts.
All mounts of the devpts filesystem now create a /dev/pts/ptmx node
with permissions 0000.
To preserve backward compatibility, this support for multiple instances is
enabled only if:
To retain backwards compatibility the a ptmx device node (aka any node
created with "mknod name c 5 2") when opened will look for an instance
of devpts under the name "pts" in the same directory as the ptmx device
node.
- CONFIG_DEVPTS_MULTIPLE_INSTANCES=y, and
- '-o newinstance' mount option is specified while mounting devpts
IOW, devpts now supports both single-instance and multi-instance semantics.
If CONFIG_DEVPTS_MULTIPLE_INSTANCES=n, there is no change in behavior and
this referred to as the "legacy" mode. In this mode, the new mount options
(-o newinstance and -o ptmxmode) will be ignored with a 'bogus option' message
on console.
If CONFIG_DEVPTS_MULTIPLE_INSTANCES=y and devpts is mounted without the
'newinstance' option (as in current start-up scripts) the new mount binds
to the initial kernel mount of devpts. This mode is referred to as the
'single-instance' mode and the current, single-instance semantics are
preserved, i.e PTYs are common across the system.
The only difference between this single-instance mode and the legacy mode
is the presence of new, '/dev/pts/ptmx' node with permissions 0000, which
can safely be ignored.
If CONFIG_DEVPTS_MULTIPLE_INSTANCES=y and 'newinstance' option is specified,
the mount is considered to be in the multi-instance mode and a new instance
of the devpts fs is created. Any ptys created in this instance are independent
of ptys in other instances of devpts. Like in the single-instance mode, the
/dev/pts/ptmx node is present. To effectively use the multi-instance mode,
open of /dev/ptmx must be a redirected to '/dev/pts/ptmx' using a symlink or
bind-mount.
Eg: A container startup script could do the following:
$ chmod 0666 /dev/pts/ptmx
$ rm /dev/ptmx
$ ln -s pts/ptmx /dev/ptmx
$ ns_exec -cm /bin/bash
# We are now in new container
$ umount /dev/pts
$ mount -t devpts -o newinstance lxcpts /dev/pts
$ sshd -p 1234
where 'ns_exec -cm /bin/bash' calls clone() with CLONE_NEWNS flag and execs
/bin/bash in the child process. A pty created by the sshd is not visible in
the original mount of /dev/pts.
As an option instead of placing a /dev/ptmx device node at /dev/ptmx
it is possible to place a symlink to /dev/pts/ptmx at /dev/ptmx or
to bind mount /dev/ptx/ptmx to /dev/ptmx. If you opt for using
the devpts filesystem in this manner devpts should be mounted with
the ptmxmode=0666, or chmod 0666 /dev/pts/ptmx should be called.
Total count of pty pairs in all instances is limited by sysctls:
kernel.pty.max = 4096 - global limit
kernel.pty.reserve = 1024 - reserve for initial instance
kernel.pty.reserve = 1024 - reserved for filesystems mounted from the initial mount namespace
kernel.pty.nr - current count of ptys
Per-instance limit could be set by adding mount option "max=<count>".
This feature was added in kernel 3.4 together with sysctl kernel.pty.reserve.
In kernels older than 3.4 sysctl kernel.pty.max works as per-instance limit.
User-space changes
------------------
In multi-instance mode (i.e '-o newinstance' mount option is specified at least
once), following user-space issues should be noted.
1. If -o newinstance mount option is never used, /dev/pts/ptmx can be ignored
and no change is needed to system-startup scripts.
2. To effectively use multi-instance mode (i.e -o newinstance is specified)
administrators or startup scripts should "redirect" open of /dev/ptmx to
/dev/pts/ptmx using either a bind mount or symlink.
$ mount -t devpts -o newinstance devpts /dev/pts
followed by either
$ rm /dev/ptmx
$ ln -s pts/ptmx /dev/ptmx
$ chmod 666 /dev/pts/ptmx
or
$ mount -o bind /dev/pts/ptmx /dev/ptmx
3. The '/dev/ptmx -> pts/ptmx' symlink is the preferred method since it
enables better error-reporting and treats both single-instance and
multi-instance mounts similarly.
But this method requires that system-startup scripts set the mode of
/dev/pts/ptmx correctly (default mode is 0000). The scripts can set the
mode by, either
- adding ptmxmode mount option to devpts entry in /etc/fstab, or
- using 'chmod 0666 /dev/pts/ptmx'
4. If multi-instance mode mount is needed for containers, but the system
startup scripts have not yet been updated, container-startup scripts
should bind mount /dev/ptmx to /dev/pts/ptmx to avoid breaking single-
instance mounts.
Or, in general, container-startup scripts should use:
mount -t devpts -o newinstance -o ptmxmode=0666 devpts /dev/pts
if [ ! -L /dev/ptmx ]; then
mount -o bind /dev/pts/ptmx /dev/ptmx
fi
When all devpts mounts are multi-instance, /dev/ptmx can permanently be
a symlink to pts/ptmx and the bind mount can be ignored.
5. A multi-instance mount that is not accompanied by the /dev/ptmx to
/dev/pts/ptmx redirection would result in an unusable/unreachable pty.
mount -t devpts -o newinstance lxcpts /dev/pts
immediately followed by:
open("/dev/ptmx")
would create a pty, say /dev/pts/7, in the initial kernel mount.
But /dev/pts/7 would be invisible in the new mount.
6. The permissions for /dev/pts/ptmx node should be specified when mounting
/dev/pts, using the '-o ptmxmode=%o' mount option (default is 0000).
mount -t devpts -o newinstance -o ptmxmode=0644 devpts /dev/pts
The permissions can be later be changed as usual with 'chmod'.
chmod 666 /dev/pts/ptmx
7. A mount of devpts without the 'newinstance' option results in binding to
initial kernel mount. This behavior while preserving legacy semantics,
does not provide strict isolation in a container environment. i.e by
mounting devpts without the 'newinstance' option, a container could
get visibility into the 'host' or root container's devpts.
To workaround this and have strict isolation, all mounts of devpts,
including the mount in the root container, should use the newinstance
option.
......@@ -120,17 +120,6 @@ config UNIX98_PTYS
All modern Linux systems use the Unix98 ptys. Say Y unless
you're on an embedded system and want to conserve memory.
config DEVPTS_MULTIPLE_INSTANCES
bool "Support multiple instances of devpts"
depends on UNIX98_PTYS
default n
---help---
Enable support for multiple instances of devpts filesystem.
If you want to have isolated PTY namespaces (eg: in containers),
say Y here. Otherwise, say N. If enabled, each mount of devpts
filesystem with the '-o newinstance' option will create an
independent PTY namespace.
config LEGACY_PTYS
bool "Legacy (BSD) PTY support"
default y
......
......@@ -668,7 +668,7 @@ static void pty_unix98_remove(struct tty_driver *driver, struct tty_struct *tty)
else
fsi = tty->link->driver_data;
devpts_kill_index(fsi, tty->index);
devpts_put_ref(fsi);
devpts_release(fsi);
}
static const struct tty_operations ptm_unix98_ops = {
......@@ -733,10 +733,11 @@ static int ptmx_open(struct inode *inode, struct file *filp)
if (retval)
return retval;
fsi = devpts_get_ref(inode, filp);
retval = -ENODEV;
if (!fsi)
fsi = devpts_acquire(filp);
if (IS_ERR(fsi)) {
retval = PTR_ERR(fsi);
goto out_free_file;
}
/* find a device that is not in use. */
mutex_lock(&devpts_mutex);
......@@ -745,7 +746,7 @@ static int ptmx_open(struct inode *inode, struct file *filp)
retval = index;
if (index < 0)
goto out_put_ref;
goto out_put_fsi;
mutex_lock(&tty_mutex);
......@@ -789,8 +790,8 @@ err_release:
return retval;
out:
devpts_kill_index(fsi, index);
out_put_ref:
devpts_put_ref(fsi);
out_put_fsi:
devpts_release(fsi);
out_free_file:
tty_free_file(filp);
return retval;
......
......@@ -95,8 +95,6 @@ static struct ctl_table pty_root_table[] = {
static DEFINE_MUTEX(allocated_ptys_lock);
static struct vfsmount *devpts_mnt;
struct pts_mount_opts {
int setuid;
int setgid;
......@@ -104,7 +102,7 @@ struct pts_mount_opts {
kgid_t gid;
umode_t mode;
umode_t ptmxmode;
int newinstance;
int reserve;
int max;
};
......@@ -117,11 +115,9 @@ static const match_table_t tokens = {
{Opt_uid, "uid=%u"},
{Opt_gid, "gid=%u"},
{Opt_mode, "mode=%o"},
#ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES
{Opt_ptmxmode, "ptmxmode=%o"},
{Opt_newinstance, "newinstance"},
{Opt_max, "max=%d"},
#endif
{Opt_err, NULL}
};
......@@ -137,15 +133,48 @@ static inline struct pts_fs_info *DEVPTS_SB(struct super_block *sb)
return sb->s_fs_info;
}
static inline struct super_block *pts_sb_from_inode(struct inode *inode)
struct pts_fs_info *devpts_acquire(struct file *filp)
{
#ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES
if (inode->i_sb->s_magic == DEVPTS_SUPER_MAGIC)
return inode->i_sb;
#endif
if (!devpts_mnt)
return NULL;
return devpts_mnt->mnt_sb;
struct pts_fs_info *result;
struct path path;
struct super_block *sb;
int err;
path = filp->f_path;
path_get(&path);
/* Has the devpts filesystem already been found? */
sb = path.mnt->mnt_sb;
if (sb->s_magic != DEVPTS_SUPER_MAGIC) {
/* Is a devpts filesystem at "pts" in the same directory? */
err = path_pts(&path);
if (err) {
result = ERR_PTR(err);
goto out;
}
/* Is the path the root of a devpts filesystem? */
result = ERR_PTR(-ENODEV);
sb = path.mnt->mnt_sb;
if ((sb->s_magic != DEVPTS_SUPER_MAGIC) ||
(path.mnt->mnt_root != sb->s_root))
goto out;
}
/*
* pty code needs to hold extra references in case of last /dev/tty close
*/
atomic_inc(&sb->s_active);
result = DEVPTS_SB(sb);
out:
path_put(&path);
return result;
}
void devpts_release(struct pts_fs_info *fsi)
{
deactivate_super(fsi->sb);
}
#define PARSE_MOUNT 0
......@@ -154,9 +183,7 @@ static inline struct super_block *pts_sb_from_inode(struct inode *inode)
/*
* parse_mount_options():
* Set @opts to mount options specified in @data. If an option is not
* specified in @data, set it to its default value. The exception is
* 'newinstance' option which can only be set/cleared on a mount (i.e.
* cannot be changed during remount).
* specified in @data, set it to its default value.
*
* Note: @data may be NULL (in which case all options are set to default).
*/
......@@ -174,9 +201,12 @@ static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts)
opts->ptmxmode = DEVPTS_DEFAULT_PTMX_MODE;
opts->max = NR_UNIX98_PTY_MAX;
/* newinstance makes sense only on initial mount */
/* Only allow instances mounted from the initial mount
* namespace to tap the reserve pool of ptys.
*/
if (op == PARSE_MOUNT)
opts->newinstance = 0;
opts->reserve =
(current->nsproxy->mnt_ns == init_task.nsproxy->mnt_ns);
while ((p = strsep(&data, ",")) != NULL) {
substring_t args[MAX_OPT_ARGS];
......@@ -211,16 +241,12 @@ static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts)
return -EINVAL;
opts->mode = option & S_IALLUGO;
break;
#ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES
case Opt_ptmxmode:
if (match_octal(&args[0], &option))
return -EINVAL;
opts->ptmxmode = option & S_IALLUGO;
break;
case Opt_newinstance:
/* newinstance makes sense only on initial mount */
if (op == PARSE_MOUNT)
opts->newinstance = 1;
break;
case Opt_max:
if (match_int(&args[0], &option) ||
......@@ -228,7 +254,6 @@ static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts)
return -EINVAL;
opts->max = option;
break;
#endif
default:
pr_err("called with bogus options\n");
return -EINVAL;
......@@ -238,7 +263,6 @@ static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts)
return 0;
}
#ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES
static int mknod_ptmx(struct super_block *sb)
{
int mode;
......@@ -305,12 +329,6 @@ static void update_ptmx_mode(struct pts_fs_info *fsi)
inode->i_mode = S_IFCHR|fsi->mount_opts.ptmxmode;
}
}
#else
static inline void update_ptmx_mode(struct pts_fs_info *fsi)
{
return;
}
#endif
static int devpts_remount(struct super_block *sb, int *flags, char *data)
{
......@@ -344,11 +362,9 @@ static int devpts_show_options(struct seq_file *seq, struct dentry *root)
seq_printf(seq, ",gid=%u",
from_kgid_munged(&init_user_ns, opts->gid));
seq_printf(seq, ",mode=%03o", opts->mode);
#ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES
seq_printf(seq, ",ptmxmode=%03o", opts->ptmxmode);
if (opts->max < NR_UNIX98_PTY_MAX)
seq_printf(seq, ",max=%d", opts->max);
#endif
return 0;
}
......@@ -410,40 +426,11 @@ fail:
return -ENOMEM;
}
#ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES
static int compare_init_pts_sb(struct super_block *s, void *p)
{
if (devpts_mnt)
return devpts_mnt->mnt_sb == s;
return 0;
}
/*
* devpts_mount()
*
* If the '-o newinstance' mount option was specified, mount a new
* (private) instance of devpts. PTYs created in this instance are
* independent of the PTYs in other devpts instances.
*
* If the '-o newinstance' option was not specified, mount/remount the
* initial kernel mount of devpts. This type of mount gives the
* legacy, single-instance semantics.
*
* The 'newinstance' option is needed to support multiple namespace
* semantics in devpts while preserving backward compatibility of the
* current 'single-namespace' semantics. i.e all mounts of devpts
* without the 'newinstance' mount option should bind to the initial
* kernel mount, like mount_single().
*
* Mounts with 'newinstance' option create a new, private namespace.
*
* NOTE:
*
* For single-mount semantics, devpts cannot use mount_single(),
* because mount_single()/sget() find and use the super-block from
* the most recent mount of devpts. But that recent mount may be a
* 'newinstance' mount and mount_single() would pick the newinstance
* super-block instead of the initial super-block.
* Mount a new (private) instance of devpts. PTYs created in this
* instance are independent of the PTYs in other devpts instances.
*/
static struct dentry *devpts_mount(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
......@@ -456,18 +443,7 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type,
if (error)
return ERR_PTR(error);
/* Require newinstance for all user namespace mounts to ensure
* the mount options are not changed.
*/
if ((current_user_ns() != &init_user_ns) && !opts.newinstance)
return ERR_PTR(-EINVAL);
if (opts.newinstance)
s = sget(fs_type, NULL, set_anon_super, flags, NULL);
else
s = sget(fs_type, compare_init_pts_sb, set_anon_super, flags,
NULL);
s = sget(fs_type, NULL, set_anon_super, flags, NULL);
if (IS_ERR(s))
return ERR_CAST(s);
......@@ -491,18 +467,6 @@ out_undo_sget:
return ERR_PTR(error);
}
#else
/*
* This supports only the legacy single-instance semantics (no
* multiple-instance semantics)
*/
static struct dentry *devpts_mount(struct file_system_type *fs_type, int flags,
const char *dev_name, void *data)
{
return mount_single(fs_type, flags, data, devpts_fill_super);
}
#endif
static void devpts_kill_sb(struct super_block *sb)
{
struct pts_fs_info *fsi = DEVPTS_SB(sb);
......@@ -516,9 +480,7 @@ static struct file_system_type devpts_fs_type = {
.name = "devpts",
.mount = devpts_mount,
.kill_sb = devpts_kill_sb,
#ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES
.fs_flags = FS_USERNS_MOUNT | FS_USERNS_DEV_MOUNT,
#endif
};
/*
......@@ -531,16 +493,13 @@ int devpts_new_index(struct pts_fs_info *fsi)
int index;
int ida_ret;
if (!fsi)
return -ENODEV;
retry:
if (!ida_pre_get(&fsi->allocated_ptys, GFP_KERNEL))
return -ENOMEM;
mutex_lock(&allocated_ptys_lock);
if (pty_count >= pty_limit -
(fsi->mount_opts.newinstance ? pty_reserve : 0)) {
if (pty_count >= (pty_limit -
(fsi->mount_opts.reserve ? 0 : pty_reserve))) {
mutex_unlock(&allocated_ptys_lock);
return -ENOSPC;
}
......@@ -571,30 +530,6 @@ void devpts_kill_index(struct pts_fs_info *fsi, int idx)
mutex_unlock(&allocated_ptys_lock);
}
/*
* pty code needs to hold extra references in case of last /dev/tty close
*/
struct pts_fs_info *devpts_get_ref(struct inode *ptmx_inode, struct file *file)
{
struct super_block *sb;
struct pts_fs_info *fsi;
sb = pts_sb_from_inode(ptmx_inode);
if (!sb)
return NULL;
fsi = DEVPTS_SB(sb);
if (!fsi)
return NULL;
atomic_inc(&sb->s_active);
return fsi;
}
void devpts_put_ref(struct pts_fs_info *fsi)
{
deactivate_super(fsi->sb);
}
/**
* devpts_pty_new -- create a new inode in /dev/pts/
* @ptmx_inode: inode of the master
......@@ -607,16 +542,12 @@ void devpts_put_ref(struct pts_fs_info *fsi)
struct dentry *devpts_pty_new(struct pts_fs_info *fsi, int index, void *priv)
{
struct dentry *dentry;
struct super_block *sb;
struct super_block *sb = fsi->sb;
struct inode *inode;
struct dentry *root;
struct pts_mount_opts *opts;
char s[12];
if (!fsi)
return ERR_PTR(-ENODEV);
sb = fsi->sb;
root = sb->s_root;
opts = &fsi->mount_opts;
......@@ -676,20 +607,8 @@ void devpts_pty_kill(struct dentry *dentry)
static int __init init_devpts_fs(void)
{
int err = register_filesystem(&devpts_fs_type);
struct ctl_table_header *table;
if (!err) {
struct vfsmount *mnt;
table = register_sysctl_table(pty_root_table);
mnt = kern_mount(&devpts_fs_type);
if (IS_ERR(mnt)) {
err = PTR_ERR(mnt);
unregister_filesystem(&devpts_fs_type);
unregister_sysctl_table(table);
} else {
devpts_mnt = mnt;
}
register_sysctl_table(pty_root_table);
}
return err;
}
......
......@@ -1416,21 +1416,28 @@ static void follow_mount(struct path *path)
}
}
static int path_parent_directory(struct path *path)
{
struct dentry *old = path->dentry;
/* rare case of legitimate dget_parent()... */
path->dentry = dget_parent(path->dentry);
dput(old);
if (unlikely(!path_connected(path)))
return -ENOENT;
return 0;
}
static int follow_dotdot(struct nameidata *nd)
{
while(1) {
struct dentry *old = nd->path.dentry;
if (nd->path.dentry == nd->root.dentry &&
nd->path.mnt == nd->root.mnt) {
break;
}
if (nd->path.dentry != nd->path.mnt->mnt_root) {
/* rare case of legitimate dget_parent()... */
nd->path.dentry = dget_parent(nd->path.dentry);
dput(old);
if (unlikely(!path_connected(&nd->path)))
return -ENOENT;
int ret = path_parent_directory(&nd->path);
if (ret)
return ret;
break;
}
if (!follow_up(&nd->path))
......@@ -2514,6 +2521,34 @@ struct dentry *lookup_one_len_unlocked(const char *name,
}
EXPORT_SYMBOL(lookup_one_len_unlocked);
#ifdef CONFIG_UNIX98_PTYS
int path_pts(struct path *path)
{
/* Find something mounted on "pts" in the same directory as
* the input path.
*/
struct dentry *child, *parent;
struct qstr this;
int ret;
ret = path_parent_directory(path);
if (ret)
return ret;
parent = path->dentry;
this.name = "pts";
this.len = 3;
child = d_hash_and_lookup(parent, &this);
if (!child)
return -ENOENT;
path->dentry = child;
dput(parent);
follow_mount(path);
return 0;
}
#endif
int user_path_at_empty(int dfd, const char __user *name, unsigned flags,
struct path *path, int *empty)
{
......
......@@ -15,13 +15,12 @@
#include <linux/errno.h>
struct pts_fs_info;
#ifdef CONFIG_UNIX98_PTYS
/* Look up a pts fs info and get a ref to it */
struct pts_fs_info *devpts_get_ref(struc