Saturday, January 21, 2012

Problems with Storage Performance in Virtual Environments

There are two aspects of performance, latency and throughput. In this post I will write about the former one.

Let's say you have reports of poor performance affecting multiple virtual machines. You take a look at the virtual machines in question, but everything seems fine. No significant CPU usage, no swapping nor ballooning and almost no disk activity.

ESX host with occasional IO latency spikes
Assuming you are running vSphere you check the performance dashboard of one of the ESX hosts running the VMs.
Does the disk latency graph look like the first graph having occasional spikes way over 30ms?
If so, check other ESX hosts. Are there similar spikes in common with the first ESX host?
In this particular case multiple hosts are effected, so we are almost certainly experiencing a saturation in the storage environment.

Another ESX host probably causing the issue

But before we go on, how much latency will cause issues? The short answer is: it entirely depends on your workload. Lets say you are running a database for a web application on the first ESX host. In addition assume that the normal latency is 5ms, the start page takes 2 seconds to load and requires 5 database queries resulting in 20 synchronous IOs each. With a latency increase of 20ms the web site would require an additional 2 seconds to load, doubling the page load time. This might be Ok for an internal web application and a short duration of time, but imagine a public web site with the spikes of the first graph. The web page would take over 10 seconds to load.

Now back on topic, how do we solve the issue? The graph from the second ESX host is typical for batch jobs like backup or database imports/exports. Those jobs tend to run as fast as possible so using a faster disk array will only shorten the duration of side effects but will not solve it, as long as the disk array is the slowest part in the equation. So, in this case the issue is not having two ESX hosts with high latency IOs, but having one virtual machine on one host starving all other in machines in the cluster. This issue is known as the noisy neighbor problem.

A manual approach would be to isolate those bursty workloads on separate disks and in a separate datastore. Doing this easily becomes time consuming and cumbersome, especially in large environments. If you have vSphere 4.1 or higher and have, or can afford Enterprise Plus Licenses there is an easier solution: Storage IO Control or SIOC for short. It will distribute the available IO capacity fairly among all virtual machines as soon as the latency on a datastore passes a configured threshold and therefore preventing a noisy neighbor from severely affecting other virtual machines running from the same datastore.

hth someone,
/jr

Thursday, January 19, 2012

QoS, vSphere SIOC and Shared Disk Pools

In the last post I wrote about disk latency and how to detect them. Today I want to build on that.

Imagine an urgently called meeting where somebody says one of these:
  • »We have to run this on dedicated hardware, we need more performance«
  • »Customers are complaining about occasional slow response times, this app does not seem to work with insert-your-favorite-hypervisor-here«
Having a déjà vu reading this? Many people still think that critical or heavy workloads are not suited for virtual environments. But why is that so? Probably bad experiences while trying to virtualize something. While I am huge fan of virtualization I have to admit that I made similar experiences in the last years. Slow databases, unresponsive user interfaces, even application errors not happening on dedicated hardware were witnessed by me.
Those experiences are in direct contrast to vendor benchmarks like these, virtualization overhead analysis like this one or a study like this. In most of the analyzed performance issues by me the gap between almost native speed vs. witnessed unusable applications was caused by issues in the storage environment. The reason for that is simple: It is the weakest link in the virtual chain (i.e. the slowest component). In a dedicated setup like a vBlock you can easily detect issues, but how about large shared SAN infrastructures? Before I go into details some background details about VMware and todays storage arrays. If you are running another hypervisor hang on, most of what I write still applies, but the solution is different.
    
VMware added a pretty neat feature to vSphere 4.1 called Storage IO Control or SIOC for short. It distributes the available IO capacity fairly in case of a congestion (i.e. increased IO latency). I do not go into the gory details, so if you are interested I recommend this paper. In short SIOC distributes the available IO capacity to the VMs depending on their shares. A VM with 10% of the total shares will get at least 10% of the total capacity. Shares were available before 4.1, but with SIOC the mechanism works across all ESX hosts sharing a datastore (i.e. running in the same cluster). In addition SIOC does nothing as long as the storage array (or the path to it) does not get saturated. So less prioritized VMs can consume more IOs as long as other VMs do not need them.

Now, to the core part of this post. If you take a look at todays storage arrays you will notice that almost all vendors are offering pooling functionality aggregating multiple raid sets into a big chunk of storage. NetApp calls this aggregate, EMC, HDS and HP are calling it (disk/storage) pooling.
The block based arrays are often striping provisioned LUNs across all raid sets in a pool achieving throughput not possible using a single raid set. In addition this approach allows thin provisioning, easy dynamic LUN resizing and overall less work for the storage engineer.

No QoS: Per datastore latency on a
shared disk pool without SIOC
Sounds great? Imagine a mid range array with 200 2.5" 10k disks in a single pool offering approx. 15,000 IOPS in a standard RAID 5 7+1 setup and a 70/30 read/write ratio. What happens if you provision multiple datastores from this single pool and all VMs combined want to consume more than the offered 15k IOPS?
In a classical setup you cannot guarantee any quality of service as you can see in the shown latency graph. As soon as a single data store has issues, all other stores in the pool will also be affected.
Luckily as you can read in a knowledge base entry from VMware, SIOC can handle this situation as long as two requirements are met. First, all provisioned datastores in the disk pool must be managed by a single vCenter. Secondly the disk pool should not be shared with other non-virtual workloads. While vSphere SIOC can detect these workloads it can only prevent starvation and assure that the remaining IO capacity is fairly distributed across all virtual machines. SIOC simply can't offer any real quality of service in this setup as long as there is no array/SAN based QoS mechanism in place. But more on that in a later post.

All this sounds reasonable but you are running another hypervisior or can't afford the pricy Enterprise Plus licenses required for Storage IO Control? Right now you probably have to go a traditional approach and split up the disk pool in as many parts required (e.g. one pool per service and/or customer) to assure a decent quality of service.

/jr