Matthew Dillon wrote up an explanation of how performance on systems with a lot of CPU cores has been significantly improved – up to 300%! (He says 200%, but I think he’s treating it as a percentage of a whole rather than percent changed.) Apparently finally getting rid of lock contention is the trick.
Wednesday, November 02, 2011
Tuesday, November 01, 2011
Two years ago I wrote an article presenting some Linux performance improvements. These performance improvements are still valid, but it is time to talk about some new improvements available. As I am using Debian now, I will focus on that distribution, but you should be able to easily implement these things on other distributions too. Some of these improvements are best suited for desktop systems, other for server systems and some are useful for both.
First of all, it is important that your system is up to date. Update it to Debian testing if you have not done that yet. It will give your, amongst others:
- Updated eglibc 2.13, which includes functions optimized for instruction sets SSE2, SSSE3 and SSE4.2 provided by recent processors
- Updated GCC 4.5 (with GCC 4.6 being on the way)
- Updated graphics stack with Xserver 1.10, Mesa 7.10.2 and new X drivers and up to date pixman and cairo libraries, all improving performance
- A recent kernel which brings improvements to process and disk schedulers, better hardware drivers, transparant hugepages (see further), scalability improvements to the EXT4 and XFS file systems and the Virtual File System layer, vhost-net for reduced network latency for KVM virtual machines and more. Debian testing has a 2.6.38 kernel, while 2.6.39 is available in unstable and will migrate to testing in the near future.
- Parts of GNOME 2.32, such as Evolution which has improved start-up performance and important bug fixes (for example support of mailboxes larger than 2GB)
- Iceweasel 4 is available in Debian Experimental and the upcoming 5 version, bringing even more performance improvements, is already available in an external repository.
Transparant hugepages is a feature introduced in Linux 2.6.38 which can improve performance of applications. The processor has a translation lookaside buffer (TLB) which is a CPU cache used to speed up mapping of virtual memory addresses to physical memory addresses. This TLB has a limited size. By transparently combining several small 4 KB pages to larger “hugepages”, more pages can fit into the TLB. Transparent hugepages can be enabled on the fly, however it will only have effect on applications started after you have enabled this feature. For this reason, it is best to activate it right from the start by using a kernel boot parameter. With transparent_hugepage=always, the kernel will use transparant hugepages for any process. If you want to use transparent hugepages only for applications which explicitly indicate that they prefer hugepages, you can use transparent_hugepage=madvise. You have to add one of these boot parameters to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub in Debian, and run the update-grub command. At the next boot, you can take a look at the contents of the /sys/kernel/mm/transparent_hugepage file to verify that it is really enabled.
Tuning ondemand cpufreq governor
The ondemand cpufreq governor (which should be used on most systems by default; make sure you have Debbian’s cpufrequtils package installed) tends to switch back to slower CPU frequency speeds a bit too early in some cases, hurting performance. By setting the sampling_down_factor to a value higher than 1, you can prevent it from reducing the clock speed too quickly.
I have added this to my /etc/rc.local script:
if test -f /sys/devices/system/cpu/cpufreq/ondemand/sampling_down_factor
echo 10 > /sys/devices/system/cpu/cpufreq/ondemand/sampling_down_factor
On server systems, I even use 100 instead of 10.
VM dirty ratio
By default the VM dirty ratio is set to 20% and the background dirty ratio is set to 10%. This means that when 10% of memory is filled with dirty pages (cached data which has to be flushed to disk), the kernel will start writing out the data to disk into the background, without interrupting processes. If the amount of dirty pages raises up to 20%, processes will be forced to write out data to disk and cannot continue other work before they have done so.
By increasing these values, more data will be cached into memory, and data will be written out in larger I/O streams. This can be a good thing on servers with fast storage and lots of memory.
To increase these values, create a file /etc/sysctl.d/dirty_ratio.conf with these contents:
vm.dirty_ratio = 40
vm.dirty_background_ratio = 15
Then with the command <code>sysctl -p /etc/sysctl.d/dirty_ratio.conf</code> you make these settings become in effect immediately.
On desktop systems, the default dirty_ratio of 20 and dirty_background_ratio of 10 should be reasonable. You do not want a too high dirty_ratio on desktop systems, because applications will stall for too long if they have to write out all these dirty pages at once.
CFS scheduler tuning
CFS (Competely Fair Scheduler) is the name of the Linux process scheduler. By default it is tuned for desktop workloads. For server systems where throughput is more important than latency, Red Hat’s tuned package proposes these sysctl settings for CFS for servers:
kernel.sched_min_granularity_ns = 10000000
kernel.sched_wakeup_granularity_ns = 15000000
ulatencyd is a daemon which uses cgroups to give hints to the kernel’s process scheduler CFS to improve desktop latency and make applications feel more responsive. It will prevent individual applications from hogging the system, slowing down other applications . This is somewhat simpler than the much hyped (but controversial) autogroup kernel patch but this solution is much more extensive in that ulatencyd knows different applications and desktops and knows how to configure the scheduler to improve responsiveness.
On Debian, you install the ulatencyd package and start the ulatencyd init script. The ulatency package contains the ulatency client too, which shows you the cgroups ulatencyd has set up.
In my opinion, ulatencyd is great for desktop systems but I do not recommend to install this on server systems.
KVM performance improvements
When using Linux guests in kvm virtual machines, it is important to configure the network interface and hard drive as virtio devices in order to experience the best performance. KVM also benefits from transparant hugepages: be sure to enable them both in the host as in the guest machine.
VHostNet improves network latency and throughput of KVM guests. To enable it, you need to load the vhost_net kernel module in your host system. In Debian you can add vhost_net to /etc/modules to load it automatically when booting the system. Then if you use libvirt to manage your virtual machines, VHostNet will be used automatically when starting virtual machines. If you start qemu-kvm by hand, you will need to add the vhost=on option to the netdev device.
Raw devices, disable cache and choose deadline I/O scheduler
For best performance, raw devices are recommended instead of qcow2 or other image files. In libvirt/virt-manager I have defined a storage pool on an LVM volume group, and let virt-install create logical volumes on it containing raw images.
It is recommended to disable I/O caching in KVM because it reduces data copies and bus traffic. In the libvirt XML file for your virtual machine, set the cache=’none’ attribute for the driver tag for the disk device. You can also use virt-manager to make this change: look for the cache mode under the advanced options for the disk.
Benchmarks seems to indicate that it is best to to use the deadline I/O scheduler instead of the default CFQ scheduler. Using deadline in the guest seems also beneficial. To make deadline the default scheduler, edit /etc/default/grub.conf and add elevator=deadline to the GRUB_CMDLINE_LINUX_DEFAULT variable.
Native AIO offers better performance than the thread based AIO in KVM. However, it should only be enabled on raw images, because it can lead to disk corruption in some cases otherwise. This problem is supposed to be fixed in recent kernels according to information I got on the #kvm IRC channel, but better be safe than sorry.
To enable this, add the parameter io=’native’ to the driver tag of the disk in the XML file for the virtual machine in libvirt.
By default, KVM only provides a common limited set of CPU instructions implemented by different CPU’s from Intel and AMD. This is needed to permit live migration of a virtual machine to hardware with a CPU which does not implement all instructions available in the original system. If you do not plan on doing doing that, you can enable all instruction sets of your host CPU in the virtual machines, so that your virtual machine can make use of all advanced features of your CPU (for example SSE3 and others). The easiest way to do this, is by using virt-manager. Click on “Copy host configuration” in the Processor – Configuration settings of the virtual machine. The next time you start up the virtual machine, it will have access to all extended instruction sets of your CPU.
Kernel Samepage Merging is a kernel feature which merges identical pages in memory. If you are using different virtual machines, with the same operating system and applications running in it, lots of memory pages will actually be identical. KSM will save memory by merging the identical pages.
To enable this on Debian, I have put this in my /etc/rc.local script:
echo 1 > /sys/kernel/mm/ksm/run
echo 200 > /sys/kernel/mm/ksm/sleep_millisecs
The last line is optional. It raises the interval during two different memory scans, so that the CPU is not too busy scanning for duplicate memory pages all the time.
If you do not have much RAM available in your system, it is useful to compress part of the data in memory. This can be done by using a zram disk, which is a ram disk on which all data is transparently compressed. On this zram disk you create a swap partition which you give a higher priority than the normal on disk swap space. Once the available RAM (total RAM – RAM reserved for zram disk) is used, data will be swapped out to the zram disk, which is much faster than swap space on a rotating hard disk. This way, more data fits into the RAM.
On my 1 GB netbook system which runs a full GNOME desktop, I have reserved 512 MB for the zram disk. To do so, I added the following in /etc/rc.local:
echo $((512*1024*1024)) > /sys/block/zram0/disksize
swapon -p 60 /dev/zram0
Of course, a better solution is to add RAM to your system, especially on server systems.