|Running independent Linux servers inside a single PC is now possible. They offer many advantages, including higher security, flexibility and cost reduction.|
Linux computers are getting faster every day. So we should probably end up with less, more powerful servers. Instead we are seeing more and more servers. While there are many reasons for this trend (more services offered), the major issue is more related to security and administrative concerns.
Is it possible to split a Linux server into virtual ones with as much isolation as possible between each one, looking like real servers, yet sharing some common tasks (monitoring, backup, ups, hardware configuration, ...) ?
We think so ... NEW
Who needs that
The short answer is everybody, or everybody managing a server. Here are some applications:
Just think about all the viruses and worms out there, you end up with a big everybody using a computer needs this. :-) NEW
Non reversible isolation
Unix and Linux have always had the chroot() system call. This call was used to trap a process into a sub-directory. After the system-call, the process is led to believe that the sub-directory is now the root directory. This system call can't be reversed. In fact, the only thing a process can do is trap itself further and further in the file-system (calling chroot() again).
The strategy is to introduce new system calls trapping the processes in other areas within the server. NEW
A virtual server is isolated from the rest of the server in 5 areas:
New system calls
The new system calls, as well as the existing chroot() system call are sharing one common feature: Their effect can't be reversed. Once you have executed one of those system call (chroot, new_s_context, set_ipv4root), you can't get back. This affects the current process and all the child processes. The parent process is not influenced.
Those system calls are not privileged. Any user may issue them. NEW
Limiting super-user: The capabilities system
Once you have created a virtual environment where processes have a limited view of the file-system, can't see processes outside of their world and can only use a single IP number, you still must limit the damages those processes can do. The goal is to run virtual environments and provide some root privileges.
How do you limit those root processes from taking over the system, or even just re-booting it. Enter the capability system. This is not new, but we suspect many people have never heard of it.
In the old Unix/Linux days, user root (user ID 0) could do things other user ID could not. All over the place in the kernel, system calls were denying access to some resources unless the user ID of the process (effective ID in fact) was 0. Plain zero.
The only way a process with user ID 0 could loose some privileges was by changing to another ID. Unfortunately this was an all or nothing deal. Enter the capabilities.
Today, the difference between root and the other users is the capability set. User root has all capabilities and the other users have none. The user ID 0 does not mean anything special anymore. There are around 30 capabilities defined currently. A process may request to loose a capability forever. It won't be able to get it back.
Capabilities allows a root process to diminish its power. This is exactly what we need to create custom super-user. A super-user process in a virtual server would have some privileges such as binding port below 1024, but would not be able to reconfigure the network or reboot the machine. Check the file /usr/include/linux/capability.h to learn which one are available.
Note that the new system calls (new_s_context and set_ipv4root) are not controlled by capabilities. They are by nature irreversible. Once a virtual server is trapped in a chroot/s_context/ipv4root box, it can't escape from the parameters of this trap.
Enhancing the capability system
The Linux capability system, is still a work in progress. At some point, we expect to see capabilities attached to programs, generalizing the setuid concept. A setuid program would become a program with all capability granted.
For now, this is not available. As explained above a process may request to loose capabilities and its child process will be trapped with a smaller capability set.
Well, ..., it does not work that way. Unfortunately, until capabilities could be assigned to program, we still need a way to get back capabilities even in a child process. So the irreversible logic of the capabilities is kind of short circuited in the kernel.
To solve this, we have introduced a new per-process capability ceiling (cap_bset). This one represents the capability set inherited by child process, including setuid root child process. Lowering this ceiling is irreversible for a process and all its child.
This ceiling is handled by the new_s_context system call and the reducecap and chcontext utilities (part of the vserver package).
Using this, we can setup a virtual server environment where root has less capabilities, so can't reconfigure the main server.
Playing with the new system calls
The vserver package provides 3 utilities to make use of the new system calls. We will describe shortly how they work and provide few example. We invite the reader to try those example, so it has a better feel and trust.
After re-booting with a kernel implementing the new system calls, and installing the vserver package, one is ready to do experiment. You do not need to be root to test those new utilities. None of them is setuid either. NEW
Playing with /usr/sbin/chcontext
The /usr/sbin/chcontext utility is used to enter into a new security context. The utility switch the security context and execute a program, specified on the command line. This program is now isolated and can't see the other processes running on the server.
The experiment with this, start two command
windows (xterm), as the same user ID. In each window
execute the following commands:
Using chcontext: first window
Using chcontext: second window
In the first window, you start the xterm command (or any command you like). In the second window you execute chcontext. This starts a new shell. You execute pstree and see very little. You attempt to kill the xterm and you fail. You exit this shell and you are back seeing all processes.
Here is another example. You switch context and
you get a new shell. In this shell you start an xterm.
Then you switch context again and start another sub-shell.
Now the sub-shell is again isolated.
Using chcontext several times
Processes isolated using chcontext are doubly isolated: They can't see the other processes on the server, but the other processes can't see them either. The original security context (0) when you boot is no better than the other: It sees only process in security context 0.
While playing with chcontext, you will notice an exception. The process 1 is visible from every security context. It is visible to please utilities like pstree. But only root processes in security context 0 are allowed to interact with it. NEW
Playing with /usr/sbin/chcontext as root
The new_s_context system call has a special semantic for root processes running in security context 0 and having the CAP_SYS_ADMIN capability: They can switch to any context they want.
Normally, new_s_context allocates a new security context by selecting an unused one. It walks all processes and find an ID (an integer) not currently in use.
But root in security context 0 is allowed to select the context it wants. This allow the main server to control the virtual server. The chcontext utility has the --ctx option to specify the context ID you want.
To help manage several virtual server, given that the security context 0 can't see processes in other security context, it is a good thing root in the main server (security context 0) is allowed to select a specific context. Cool. But we also need a way to have a global picture showing all processes in all security context. The security context 1 is reserved for this. Security context 1 is allowed to see all processes on the server but is not allowed to interact with them (kill them).
This special feature was allocated to security context 1 and not 0 (the default when you boot) to isolate virtual servers from the main. This way, while maintaining services on the main server, you won't kill service in vservers accidentally.
Here is an example showing those concepts:
chcontext as root
The /usr/sbin/vpstree and /usr/sbin/vps commands are supplied by the vserver package. They simply runs ps and pstree in security context 1. NEW
Playing with /usr/sbin/chbind
The chbind utility is used to lock a process
and its children into using a specific IP number.
This applies to services and client connection as
well. Here are few examples. Execute them as root:
Playing with /usr/sbin/reducecap
The reducecap utility is used to lower the capability
ceiling of a process and child process. Even
setuid program won't be able
to grab more capabilities.
Installing a virtual private server copies a linux installation inside a sub-directory. It is a linux inside linux. If you intend to run several vservers on the same box (which you will certainly do :-) ), you will end up using a lot of disk space needlessly: Each vserver is made up hundreds of megabytes of the same stuff. This is a big waste of disk space.
A solution is to use hard links to connect together common files. Using the package information, we can tell which packages are shared between various vservers, which files are configuration files and which are not (binaries, libraries, resource files, ...). Non configuration files may be linked together saving a huge amount of disk space: A 2 GIG rh7.2 installation shrinks to 38megs.
Using hard links is cool, but may be a problem. If one vserver overwrite one file, say /bin/ls, then every vserver will inherit that change. Not fun! The solution is to set the immutable bit on every linked file. A file with such a bit on can't be modified, even by root. The only way to modify it is to turn off the bit first. But within a vserver environment, even root is not allowed to perform this task. So linked file, turned immutable, are now safe: They can be shared between vservers without side effects: Cool!
Well, there is still another side effect. All vservers are now locked with the same files. We are saving a lot of disk space, but we pay a very heavy price for that: Vservers can't evolve independantly.
A solution was found. A new bit call immutable-linkage-invert was added. Combined with the immutable bit, a file may not be modified, but may be unlinked. Unlinking a file in Unix/Linux means disconnecting the name from the data and removing the name from the directory. If the data is still referenced by one or more vservers, it continue to exist on disk. So doing "rm /bin/ls" on a vserver, removes the file /bin/ls for that vserver and that's all. If all vservers perform this task, then /bin/ls (the data) will be forgotten completly and the disk space will be recovered.
Using this trick, a vserver gets back its independance. It becomes possible to update packages by using the unlink/update sequence: Unlink /bin/ls first and then put a new copy in place. Luckily, package manager works this way.
To keep this story short (probably too late :-) ), a unified vserver:
The first goal of this project is to create virtual servers sharing the same machine. A virtual server operate like a normal Linux server. It runs normal services such as telnet, mail servers, web servers, SQL servers. In most cases, the services run using standard configuration: The services are running unaware of the virtual server concept.
Normal system administration is performed with ordinary admin tool. Virtual servers have users account and a root account.
Packages are installed using standard packages (RPMs for example).
There are few exceptions. Some configuration can't be done inside a virtual server. Notably the network configuration and the file-system (mount/umount) can't be performed from a virtual server.
Per user fire-wall
The set_ipv4root() system call may be used to differentiate the various users running on an application server. If you want to setup a fire-wall limiting what each user is doing, you have to assign one IP per user, even if they are running application on the same server. The chbind utility may be used to achieve that. NEW
Secure server/Intrusion detection
While it can be interesting to run several virtual servers in one box, there is one concept potentially more generally useful. Imagine a physical server running a single virtual server. The goal is isolate the main environment from any service, any network. You boot in the main environment, start very few services and then continue in the virtual server. The service in the main environment could be
Fail over servers
One key feature of a virtual server is the independence from the actual hardware. Most hardware issues are irrelevant for the virtual server installation. For example:
The main server acts as a host and takes care of all those details. The virtual server is just a client and ignores all the details. As such, the client can be moved to another physical server will very few manipulations. For example, to move the virtual server v1 from one physical one computer to another, you do
As you see, there is no adjustment to do:
This opens the door to fail over servers. Imagine a backup server having a copy of many virtual servers. It can take over their tasks with a single command. Various options exists for managing this backup server:
Setting a virtual server
To set a virtual server, you need to copy in a sub-directory
a Linux installation. One way to achieve that is
to copy some parts of the the current server by issuing the
command vserver XX build, where XX is
the name of the virtual server (pick one). This basically does
(Well, it does a little more than that, but this give you an
Building a virtual server
This is normally done using the command /usr/sbin/newvserver. This is a text mode/graphical front-end allowing to setup the vserver runtime and configure it. NEW
Basic configuration of the virtual server
A virtual private server has a few settings. They are defined in the file /etc/vservers/XX.conf where XX is the name of the virtual server. This is a simple script like configuration. Here are the various parameters:
Entering the virtual server
It is possible to enter a virtual server context from the main server just by executing /usr/sbin/vserver XX enter (where XX is the virtual server name).
This creates a shell. From there you can execute anything administrative you normally do on a Linux server.
Configuring the services
The virtual server can run pretty much any services. Many pseudo services, such as network configuration are useless (the server is already configured). After building the environment, enter it (without starting the virtual server) using the vserver name enter command. Then using a tool like Linuxconf (control/control service activity) , or ntsysv, browse all service and keep only the needed ones.
So after building the server, you enter it and you select the service you need in that server. Many services such as network, and apmd are either useless or won't run at all in the virtual server. They will fail to start completely. NEW
Starting/Stopping the virtual server
Virtual server with ONBOOT=yes will be started and stopped like any other services of the main server. But you can stop and start a virtual server at any time. Starting a server means that all configured service will be started. Stopping it means that all configured services will be stopped and then all remaining process will be killed.
Oddly, starting a virtual server does not mean much. There is no overhead. No monitoring process or proxy or emulator. Starting a virtual server with 4 services is the same as running those 4 services in the main server, at least performance wise (the service inside a virtual server are locked inside the security context).
The following commands may be used to control a virtual server:
The running command prints if there are any processes running in the virtual server context.
The processes running in a virtual server are invisible from the main server. The opposite is true. This is very important. Managing the main server must not cause problems to the various virtual servers. For example, doing killall httpd will kill all the httpd processes in the current context ( the main server or a virtual one).
Starting/Stopping all the virtual servers
The sysv script /etc/rc.d/init.d/vserver is used to start and stop the virtual server at boot and shutdown time. It may be used at any time to operate all virtual servers. The following commands are supported:
The status command reports the running status of every virtual server. NEW
Restarting a virtual server from inside
A virtual server administrator is not allowed to reboot the machine (the kernel). But it is useful to restart his virtual server from scratch. This allow him to make sure all the services are properly configured to start at boot time.
The /sbin/vreboot and /sbin/vhalt utilities are installed in each virtual server so they can request a restart or stop.
The rebootmgr service must be enabled in the main server.
Executing tasks at vserver start/stop time
You can setup a script called /etc/vservers/XX.sh where XX is the name of the virtual server. This script will be called four time:
You generally perform tasks such as mounting file system (mapping some directory in the vserver root using "mount --bind").
Here is an example where you map the /home directory
as the vserver /home directory.
There are some common problem you may encounter. Here they are.
How real is it ?
The project is new. So far, experiments have shown very little restrictions. Service works the same in a virtual server. Further, performance is the same. And there is a high level of isolation between the various virtual servers and the main server. NEW
There are various tricks one can use to make the virtual servers more secure.
User controlled security box
By combining the capabilities, the s_context, the ipv4root and the AclFS (component of the virtualfs package), we can produce a user level tool allowing controlled access to the user own resources. For example the user may download any program he wants and execute them under control. Whenever the program tries to access something not specified by the user, a popup is presented and the user may choose to terminate the program or allow the access.
We expect to see some wider usage of the virtual servers. As usage grow, we expect to see needs for more control. Here are some ideas.
Per context disk quota
If one installs virtual servers and grant access to less trusted users, he may want to limit the disk space used. Since a virtual server may create new user accounts and run processes with any user ID it wants, the current kernel disk quota is not powerful enough. First, it can't differentiate between user ID 100 in one virtual server and user ID 100 in another one.
Further, the main administrator may want to control disk space allocated to the virtual server on a server per server basis, unrelated to the various user ID in use in those virtual servers.
The kernel has already user and group disk quota. Adding security context disk quota should be easily done.
To differentiate between user IDs in virtual servers, the kernel could coin together the security context and the user id to create a unique ID. The kernel 2.4 now supports 32 user ID, so combining security context and user ID in a single 32 bits number should be acceptable.
The kernel has supports for user limit (memory, processes file handles). With virtual server, we may want to limit the resources used by all processes in the virtual server. The security context would be used as the key here. The following resources could be limited on a security context basis (as opposed to user or process basis)
The scheduler may become security context aware. It could potentially use this to provide some fairness and control priority based on context. Currently the scheduler is process oriented and does not group process together to qualify their priorities. For example, a user running 10 compilations will get more CPU than another user running a single compilation.
Currently, it is possible to raise the nice (lower priority) for all processes in a virtual server. This can't be reversed, so you are setting an upper limit on priority (Just set the S_NICE variable in the vserver configuration file). Note that a virtual server may still start many low priority processes and this can grab significant share of the CPU. A global per security context might be needed to really provide more control and fairness between the various virtual servers.
Done: The sched security context flag group all process in a vserver so their priority is kind of unified. If you have 50 processes running full speed in one vserver, they will take as much CPU resource than a single process in the root server. A vserver can't starve the other... NEW
The current kernel + patch provides a fair level of isolation between the virtual servers. User root can't take over the system: He sees only his processes, has only access to his area of the file system (chroot) and can't reconfigure the kernel. Yet there are some potential problems. They are fixable. As usage grows, we will know if they are real problems. Comments are welcome:
Writing to /dev/random is not limited by any capability. Any root user (virtual included) is allowed to write there. Is this a problem ?
(kernel expert think it is ok) NEW
/dev/pts is a virtual file-system used to allocate pseudo-tty. It presents all the pseudo-tty in use on the server (including all virtual server). User root is allowed to read and write to any pseudo-tty, potentially causing problems on other vservers.
Starting with the ctx-6 patch, /dev/pts is virtualised. Although the file numbers are allocated from a single pool, a vserver only see the pseudo-tty it owns. NEW
Anyone can list the network devices configurations. This may inform a virtual user that another vserver is on the same physical server. By using as much resources as possible in his own vservers, a malicious user could slow down the other server. Modification to the scheduler explained above could stop this.
Starting with the ctx-6 patch, a vserver only see the device corresponding to its IP number. NEW
Using virtual servers may be a cost effective alternative to several independent real servers. You get the administrative independence of independent servers, but share some costs including operation costs.
Other technologies exist offering some of the advantages talked in this document as well as other. Two technologies are available on various hardware platform: Virtual machines and partitioning, NEW
This has been available for mainframes for a while now. You can boot several different OS at once on the same server. This is mainly used to isolate environments. For example, you can install the new version of an OS on the same server, even while the server is running the current version. This allows you to test and do a roll-out gracefully.
The advantages of virtual machines are:
This technology is not directly available on PCs. The Intel x86 architecture does not support visualization natively. Some products nevertheless have appeared and provide this. You can run Linux inside Linux, or this other OS (Which BTW has a logo showing a window flying in pieces, which quite frankly tells everything about it).
The solutions available on PCs carry most of the advantages of the virtual machines found on mainframe, except for performance. You can't run that many virtual Linux's using this technology and expect it to fly. One example of this technology is vmware, which is quite useful, especially if you must run this other OS... vmware may be used to run Linux inside Linux, even test Linux installation while running linux... NEW
Partitioning (domains ?) is a way to split the resources of a large server so you end up with independent servers. For example, you can take a 20 CPUs server and create 3 servers, two with 4 CPUs and one with 12. You can very easily re-assign CPUs to servers in case you need more for a given tasks.
This technology provides full Independence, but much less flexibility. If your 12 CPUs server is not working much, the 4 CPUs one can't borrow some CPUs for 5 minutes. NEW
Limitation of those technologies
Oddly, one disadvantage of those technologies is a side effect of their major advantage: Total Independence. Each virtual server is running its own kernel. Cool. This makes the following tasks more difficult or impossible:
Virtual servers are interesting because they can provide a higher level of security while potentially reducing the administration task. Common operation such as backup, are shared between all servers. Services such as monitoring may be configured once.
A Linux server can run many services at once with a high level of reliability. As servers are evolving, more and more services are added, often unrelated. Unfortunately there are few details here and there, making the server more complex than it is in reality. When one wants to move one service to another server, it is always a little pain: Some user accounts have to be moved and some configuration files. A lot of hand tweaking.
By installing services in separate virtual servers, it becomes much easier to move services around (just by moving a directory although a big one).
Virtual servers may become a preferred way to install common Linux servers. NEW
The ftp site for this project is ftp://ftp.solucorp.qc.ca/pub/vserver . You will find there the following components.
This project is maintained by Jacques Gelinas email@example.com
The vserver package is licensed under the GNU PUBLIC LICENSE.
A FAQ can be found at http://www.solucorp.qc.ca/howto.hc?projet=vserver
The mailing list is archived here.
The mailing list is also archived here.
The change logs for the vserver package are here .
The official copy of this document is found at http://www.solucorp.qc.ca/miscprj/s_context.hc
This document was produced using the TLMP documentation system
Table of content
Back to project page
About tlmpdoc and cookies
Document maintained by Jacques Gélinas (firstname.lastname@example.org)
Last update: Tue Oct 5 10:48:46 2004