Howto/FAQ project vserver

1: contributions

1.1: documentation

Is there other documentation available ?

There is the excellent FAQ written by Paul Sladen at http://www.paul.sladen.org/vserver/faq

1.2: kernel

A patch to have ext3 in the 2.4.13-ctx-3 kernel

Contributed by Guillaume Dallaire (info@guillaum.org). The patch is available at here.

Are there any patches against the -ac kernel versions (Alan Cox dev tree)

Some patches are maintained by Paul Kreiner deacon@thedeacon.org. You can find them at http://thedeacon.org/patches

Is it working on kernel 2.2

A patch is available at http://vserver.digitalangel.com.au/patch-2.2.20ctx-8 .

You will find some notes about the patch here

2: File-systems

2.1: Access

2.1.1: Sharing

Is it possible to share one area of a file system between vservers

Vservers are running in chroot environment. As such, they can only see what is under their / directory. So it does not sounds possible to share one area or one file-system between several virtual servers.

There is an option. Kernel 2.4 allows one volume to be mounted several time with different mount point. Say you have a volume /dev/hda3 and you would like to share it between vserver v1 and v2. You can do the following

	mkdir /vservers/v1/data
	mkdir /vservers/v2/data
	mount /dev/hda3 /vservers/v1/data
	mount /dev/hda3 /vservers/v2/data

You can fill the /etc/fstab file so that /dev/hda3 is mounted at boot time.

This is not completely flexible since you can only share a full partition. If you want to share a smaller area (and potentially several of those small area), you can loopback mount a file and share it. For example:

	dd if=/dev/zero bs=1024k count=10 of=/var/data
	/sbin/losetup /dev/loop0 /var/data
	/sbin/mke2fs /dev/loop0
	mount /dev/loop0 /vservers/v1/data
	mount /dev/loop0 /vservers/v2/data

Kernel 2.4 also support the mount --bind option. This allows one to connect a directory in multiple places in the file system, even if this directory is not a mount point. For example, you may want to create several vservers to tests various distributions, yet you want to share the /home directory between each. The following command will do the job. This is probably the easiest way to share data between vservers.

	mount --bind /home /vservers/name/home

3: general

3.1: administration

Changing the host-name or the IP of a vserver

For a vserver named xxx, do:

How do I know in which security context process N is running

You can tell this from the /proc/N/status file. Do this

	/usr/sbin/chcontext --ctx 1 cat /proc/N/status

Check the s_context entry. You will see the context number (say X). Then do

	/usr/sbin/chcontext --ctx X /bin/sh

You are now in the same security context as process N. You can kill it, trace it, whatever.

Is it possible to have a different time zone for one vserver

Yes. The timezone is really cosmetic. The timezone is handled by a file called /etc/localtime. Whenever a program tries to present the system time, it request the current time from the kernel and receive a GMT time. It then uses /etc/localtime to find out how to translate the time to local time. Each vserver may have a different one. Just copy the proper file from /usr/share/zoneinfo over /etc/localtime.

A vserver is not allowed to change the system time unless it has the CAP_SYS_ADMIN capability. But it is not needed to have a different timezone.

Is it possible to move a vserver from one physical server to another

Yes. In fact, a vserver is fairly hardware independent. You can move it from an IDE + uniprocessor server to SCSI + multiprocessor server without any reconfiguration. Just copy /etc/vservers/XX.conf and /vservers/XX to the new physical server and start it there. To move /vservers/XX, you may want to use rsync. For example

rsync -e ssh -avHl /vservers/XX new-server:/vserver/XX

will do. You can use this to have fail-over. If a vserver is kept updated on a regular basis, using either rsync, or even network raid, a vserver may be started on a new machine without any fixes.

Is it possible to run vservers based on different distro on ?

Yes, no problem. For now the vserver project is a little redhat/mandrake aware, but some are using it with other distro. Once a service in running in a vserver, it is talking directly to the kernel. So a debian vserver could be running on a redhat server or the reverse. The only issues are

Is it possible to see processes in other vservers

A vserver can only see its own processes + init (to make pstree cute).

The root server can only see its own process as well, to make the root server less scary to use. "killall httpd" in the root server will kill httpd in the root server only.

The security context number 1 is reserved. This context can see all processes. The vserver package provides 3 little wrapper to help manage all the processes:

Those wrappers are simply doing

	# The vpstree wrapper
	/usr/sbin/chcontext --ctx 1 pstree $*

Only root in the root server is allowed to "jump" into a specific context.

May I rename a vserver ?

To change the name of a vserver, from oldname to newname. do the following:

	mv /vservers/oldname /vservers/newname
	mv /etc/vservers/oldname.conf /etc/vservers/newname.conf
	mv /etc/vservers/oldname.sh /etc/vservers/newname.sh

To avoid problems, you must stop the vserver before doing so. If you want to rename the vserver while it is running, you will have to do the following:

Stopping the vserver first is a better idea :-)

3.1.1: unification

A unified vserver seems as big as the reference, how come ?

You have created a new vserver using the /usr/sbin/newvserver command. You have selected the "unified mode" check-box. Once created, just to make sure, you run the du command on both the reference vserver and the new one

	du /vservers/ref
	du /vservers/new

The du command produced the same result. Not saving much disk space ?

The du" command is not the right tool to test this. One easy way to test is to run df before and after the new vserver creation. This will show the exact amount of disk space allocated to the new vserver. Here is the explanation:

Unification is made using hard links. A hard link is another name pointing to the same data. The entity controlling the mapping of a file on a Unix/Linux file system is called an inode. It contains information about the file location (the blocks making the file), the access right and ownership, and a few other flags. The name is not stored in the inode itself. A directory contains a list of names. Each name points to an inode. The inode also hold a reference counter so it knows how many directory entries point to itself. From this explanation, you see that a file name points to inode, and does not relate to any other name pointing to the same inode. All we can tell is how many names are pointing to the same inode. Finding which names point to a single inode involves a complete file system traversal, opening every directory to find a name pointing to the given inode.

Here is a little demonstration:

	cd /tmp
	# We create a dummy file and see which inode number it has
	touch dummy
	ls -i dummy
	# The number printed is the inode number
	# Now we create a link to this file
	ln dummy dummy2
	# Now we check that the reference count is 2 since
	# both dummy and dommy2 points to the same inode
	ls -l dummy2
	# What is the inode of dummy2
	ls -i dummy2
	# The same as dummy.

What is the point ? We have two files, dummy and dummy2, each pointing to the same inode. Which one is the real file ? Anyone is the real. I can delete dummy and dummy2 will continue to exist unchanged. I can delete dummy2 and dummy will continue to exist. If I delete both, then the space allocated will be freed.

Back to our du utility

	du dummy
	du dummy2

We are getting the same result. The command do not care about the reference count. dummy2 is as real as dummy. Applied to a unified vserver, we get the same result on the original and new. For the du command, neither is more the owner of the files. This also shows how independent are two unified vservers. They are sharing the same data space, yet they are truly independent. Package may be updated in one and it won't affect the other. The vserver ref may be delete and this won't affect the new vserver.

Sometime, it is useful to find how much disk space is used by one vserver alone. The /usr/lib/vserver/vdu utility was written for this purpose. It works like du (a minimal one) except it ignores files with more than one link. Hard links are seldom used in a vserver, so the is rather precise. vdu will indeed show that your new vserver is not so big after all. But if you apply it to the ref vserver, you will get the same (small) result.

How to update 10 unified vservers and keep them unified ?

You have 10 vservers and they are unified. So you are saving a good amount of disk space. Although, unified vservers are sharing common file through hard link and special immutability flags, they can be updated independently. Well this is in fact the only way. There is no magic way to update one package on one vserver (the reference one or not) and have the change inherited magically by the other vserver. The update operation has to be done 10 times.

The /usr/sbin/vrpm utility has been created to ease those updates. For example, say you have 4 vservers v1 v2 v3 and v4 and 3 packages a.rpm, b.rpm and c.rpm to update. You do:

vrpm v1 v2 v3 v4 -- a.rpm b.rpm c.rpm
or
vrpm ALL -- a.rpm b.rpm c.rpm

The last command will apply the updates to all your vservers, one after the other.

Now, after performing this steps, you end up with 4 vservers updated independently. The disk space is not unified any more, for those 3 packages. To regain unification, you do:

/usr/lib/vserver/vunify v1 v2 v3 v4 -- a b c

vunify may be use any time.to re-unify vservers. You may want to run it after you have performed major RPM updates.

Is it possible to move a unified vserver without the reference vserver ?

yes, the unification (hard linking common file) does not establish a parent-hood relation with the reference server. they just end up sharing common area on the disk drive (the hard linked file). A reference vserver may be updated without affected vservers created from it. once a vserver is created, unified or not, it is fully independent.

Is it possible to use hard links between vservers or the root server

Yes, hard link are low level and work across chroot(). This is exactly what the vunify command is doing to save disk space. Using the immutable ext2 file attribute, you can share files between virtual server and be sure none can change them.

In fact, newvserver default to create unified vservers (vservers sharing common files using hard links). Using the new immutable-linkage-invert, vserver are sharing common file, using much much less disk space (a common vserver is between 20-40 megs) yet they can be updated independently without side effects.

3.2: misc

Execution of commands with wild-cards

I would like to execute a command using the /usr/sbin/vserver front-end, but I would like to see the shell wild-card expanded on the other side (inside the vserver).

If you do

/usr/sbin/vserver server exec command \*

You end up with \* passed to the command directly, without shell expansion. The /usr/sbin/vserver front-end is preserving the arguments as much as possible. So if you escaped something to prevent shell expansion, it will remain that way.

The trick is to use a shell on the other side (in the vserver). The command is simply rewritten like this:

/usr/sbin/vserver server exec /bin/sh -c "command *"

How does this differs from the BSD jail system call

It differs a little. It is somewhat more flexible because it uses 3 system calls (chroot, set_ipv4root, new_s_context) to achieve the job. So each system call may be used independently.

For example, if you want to limit xinetd service in the root server to a single IP, you can do

	/usr/sbin/chbind --ip eth0 /etc/rc.d/init.d/xinetd restart

The package provides the v_xinetd for this purpose. So to get this going, you need very little reconfiguration. No fiddling in configuration files and so on.

I am unsure about the jail system call and the new_s_context() I have implemented though. The later is used to isolate the process in a private world where it can't see and interact with other processes in the box, except itself. The new_s_context is not privileged, so a normal user can use this to, for example, setup a personal security box before executing a not-so-trusted game.

Also the new_s_context() syscall allow root user in the root server to "enter" a running vserver, unlike the jail syscall (which can't add new processes to a running "jail"). On this side, the implementation is also more flexible. This is very useful, because it allows the root server to monitor the vservers and to start and stop them very easily, in a clean way.

How many vservers may run at once ?

A vserver does not use any resource by itself. There is no "invisible" overhead for each vserver. The overhead comes from the tasks you are running inside the vserver. In general a vserver will run minimally

So this is the overhead. Now each vserver will do something useful. Run apache or run mysql for example. Running a task inside a vserver uses the same resources as running it outside (a vserver).

Memory wise, because of the unification, most task will be sharing the text (program code), so this is fairly efficient.

Now you may want to run very specialized vservers, potentially running a single task without cron and syslog. So goes down the overhead.

For sure it also depends on the activity of the services. The real issue is probably there. If you run 50 vservers each running apache and taking enough hit, you may have performance problem.

Anyway, you will have to try. All I can say is that vserver do not use resource by itself. It only depends on the apps you are running inside and they are using the same resources inside or outside a vserver.

PS: If you run cron on redhat distro, before of task like updatedb. With 10 vservers they will all wake up at 4 in the morning. The load will go up. You may want to disable this.

What about performance

You can expect the exact same performance in a vserver as compared to the root server. There is no overhead. Processes running in the vserver are talking directly to the kernel. Only few system calls (kill for one) have special checks to insure processes isolation.

3.3: starting

Is it possible to execute some tasks when a vserver is started

The vserver utility checks if there is a file /etc/vservers/name.sh when it is operating a vserver called "name". This file is a script and is called in four case: Before starting a vserver, after, before stopping it and after. The first argument one of pre-start, post-start, pre-stop and post-stop. The second argument is the name of the vserver. A typical script looks like:

	#!/bin/sh
	case $1 in
	pre-start)
		mount --bind /home /vservers/$2/home
		;;
	post-start)
		;;
	pre-stop)
		;;
	post-stop)
		umount /vservers/$2/home
		;;
	esac

4: issues

4.1: applications

bind does not work in a vserver (capset failed)

The bind package expect to have the capability CAP_SYS_RESOURCE. It expects this because it may need to increase its ulimit. By default, a vserver does not have this capability. A vserver starts with some ulimit values and can only reduce them, not enlarge them. The idea is to control what a vserver can use.

To fix that, one can give the capability to the vserver running bind. Edit the vserver configuration file (/etc/vservers/*.conf) and modify the S_CAPS line like this

	S_CAPS="CAP_SYS_RESOURCE"

Using DHCP server in a vserver

Since 2.4.18ctx-9, this is possible, but there is a catch. The set_ipv4root assign one IP and one broadcast address to a vserver. UDP service listening to 0.0.0.0 (bind any) in a vserver are indeed listening to the vserver IP and vserver broadcast. This is all they will get. This is fine for most service.

Unfortunately, dhcpd is receiving special broadcasts. Its clients are unaware of their IP number, so they are using special broadcast 255.255.255.255 address.

A vserver generally runs with the broadcast address of the network device (the one used to setup the IP alias). This network device has a broadcast address which is never 255.255.255.255. Those special broadcast are not sent to the vserver. The solution is to set the IPROOTBCAST entry in the vserver configuration file like this

IPROOTBCAST=255.255.255.255

Restart your vserver and dhcpd will work. There is a catch (at least with 2.4.18ctx-9). If you are using other services in the save vserver, also relying on broadcast for proper operation (samba for one), they won't operate properly.

One solution would be to enhance the semantic a little: A vserver would listen for its IP address, its broadcast address and also for 255.255.255.255. The dhcpd case is probably very specific though.

Btw, we are running dhcpd in a vserver because we are using heartbeat to provide failover for this service as well.

5: security

5.1: misc

Vservers can write to /dev/random, is this a problem ?

I found the following post on linux-kernel

which states:

No, writing to /dev/random does not feed update entropy estimate. It does mix data into the pool, but the mixing algorithm is designed so that you can do no harm by mixing any data into the pool --- even nasty data chosen by an attacker. Hence, allowing someone to write into /dev/random is perfectly safe; it can cause no damage, and might improve things. That's why /dev/random should be world-writable. There is a separate ioctl which requires root privs to atomically mix data into the pool and update the entropy estimate. That's the interface which is supposed to be used by trusted daemons which pull data from various hardware devices, and feed them into /dev/random.

So writing is safe. How about ioctls. Some may indeed influence the entropy pool. But they are already protected by the CAP_SYS_ADMIN capability, so even root in a virtual private server can't use them.

5.2: principles

Is a chroot() environment really unbreakable

Since the kernel 2.4.17ctx-6, all issues with chroot are now plugged. root inside a vserver, even with the CAP_SYS_CHROOT capability can't escape out.

Here are the usual tricks used to escape a chroot environment.

So it seems chroot() is safe. Anyone has more information about this ?