Saturday, January 21, 2012

Problems with Storage Performance in Virtual Environments

There are two aspects of performance, latency and throughput. In this post I will write about the former one.

Let's say you have reports of poor performance affecting multiple virtual machines. You take a look at the virtual machines in question, but everything seems fine. No significant CPU usage, no swapping nor ballooning and almost no disk activity.

ESX host with occasional IO latency spikes
Assuming you are running vSphere you check the performance dashboard of one of the ESX hosts running the VMs.
Does the disk latency graph look like the first graph having occasional spikes way over 30ms?
If so, check other ESX hosts. Are there similar spikes in common with the first ESX host?
In this particular case multiple hosts are effected, so we are almost certainly experiencing a saturation in the storage environment.

Another ESX host probably causing the issue

But before we go on, how much latency will cause issues? The short answer is: it entirely depends on your workload. Lets say you are running a database for a web application on the first ESX host. In addition assume that the normal latency is 5ms, the start page takes 2 seconds to load and requires 5 database queries resulting in 20 synchronous IOs each. With a latency increase of 20ms the web site would require an additional 2 seconds to load, doubling the page load time. This might be Ok for an internal web application and a short duration of time, but imagine a public web site with the spikes of the first graph. The web page would take over 10 seconds to load.

Now back on topic, how do we solve the issue? The graph from the second ESX host is typical for batch jobs like backup or database imports/exports. Those jobs tend to run as fast as possible so using a faster disk array will only shorten the duration of side effects but will not solve it, as long as the disk array is the slowest part in the equation. So, in this case the issue is not having two ESX hosts with high latency IOs, but having one virtual machine on one host starving all other in machines in the cluster. This issue is known as the noisy neighbor problem.

A manual approach would be to isolate those bursty workloads on separate disks and in a separate datastore. Doing this easily becomes time consuming and cumbersome, especially in large environments. If you have vSphere 4.1 or higher and have, or can afford Enterprise Plus Licenses there is an easier solution: Storage IO Control or SIOC for short. It will distribute the available IO capacity fairly among all virtual machines as soon as the latency on a datastore passes a configured threshold and therefore preventing a noisy neighbor from severely affecting other virtual machines running from the same datastore.

hth someone,
/jr

No comments:

Post a Comment