Running The Matrix

Friday, July 19, 2013

Least Effort Architecture and the LEA-Principle

Sometimes, when I try to visualize data flows within an IT landscape I end up with something that has a surprising similarity with the cabling often found under desks of IT professionals. In both cases you might end up asking yourself: »How could you end up with something as messy as this?« Before I try to answer this question let me give you a little tale based on a true story.

How things ended up

PBX exporting phone numbers to a CRM and ERP system

Assume its the late 90ies or early 2000s and you want to automate the creation of a printed corporate phone book. What would you do, if you are responsible for the company's ERP system and you have all necessary data except phone numbers which are only stored within the company's PBX system?
Back then the natural choice was to write a small script that regularly generates a CSV-file from the data stored in the PBX system and another script which imports said file into the ERP system. All this didn't require a lot of effort and a secretary in the company got more time for other, hopefully more meaningful, tasks. Now assume it is one year later and someone in the sales department needed all key account manager's phone numbers in their CRM software. The easiest thing was to copy the scripts used for the ERP-import and adjust them for the task at hand.

PBX exporting phone numbers to a CRM, ERP, Intranet and Directory

A year or two have passed by and printed phone books are totally out. The manager responsible for the shiny new corporate intranet wanted to replace it with an online variant. He asked the manager responsible for the business telephony system and they agreed to simply re-use the scripts already available given that this would require the least effort.
In the same year the IT department responsible for the email service replaced the existing mail system with Microsoft Exchange and as a requirement the old Windows NT domain was migrated to Active Directory. Given the tight budget and time constraints the directory was only used for the mail system and client logins.
A few months after the launch the admin managing the mail system wanted to write a script for automated mail signature generation and needed the phone numbers for that. Therefore a script was written that regularly imports the phone numbers into the directory.

Mobile Phone numbers are exported to the intranet application

Almost a decade passes by, some of the systems have been replaced or renewed, but the data architecture is basically the same. With many employees having a corporate mobile phone and the rise of BYOD the manager for the intranet is asked to give employees the possibility to add their private phone number to the corporate phone book. Luckily the newly established mobile device management software already has the numbers. The only thing required is an import script and an opt-in checkbox for the employee within the intranet. Therefore the manager accepts the request.

How things should ended up

Target Architecture in Theory

Obviously, the realized data architecture in this example is far away from anything that would have been derived from modern standards. In addition the question might arise why one particular solution was never implemented given that it almost suggests itself. Best practice would dictate that all applications in need simply retrieve the required number from the directory. A simple, clean and extendable solution. This could have been implemented a long time ago, but nobody did.

Architecture management and the LEA-Principle

A lot of data is passed between systems in an inefficient way

Effect of the LEA-Principle

How cloud something like this happen? The answer is simple. Without dedicated architecture management nobody had a look at the big picture. Whenever a change is needed the responsible person implements it as fast as possible and with the least effort required to achieve the goal. Given that many companies apply change in form of projects, I like to call the resulting architecture management style least effort or project driven architecture (LEA or PDA) management.
So, in summary, the LEA-principle dictates that if you have no one responsible for architectural house keeping you inevitably end up with something similar like knotted cables under the desk.

/jr

Monday, July 9, 2012

Performance Tests: Theory and Practice using Little's Law, JMeter and gnuplot

Two graphs showing throughput in request/s and response time in ms with different concurrency

Last week I was asked by a client how many requests per second his new web application could handle in the initial setup. Instead of a plain number I presented him the graph on the left and said: »It depends«.
Without going too deep into theory the throughput (responses per second in this case) depends on the number of items in the system (i.e. active TCP connections with a pending response) and cycle time (i.e. response time) in a operationalized variant of Little's Law. So, to be more precise and applied to our current topic we get equation 1. Let's say we have an application with an average response time of 100ms and we use one connection that constantly requests a resource (for instance with ApacheBench: ab -c 1 -n 100 http://example.com/). According to Little's Law a benchmark like this will yield 10 requests per second*. With two connections the result should be 20 requests per second, with four connections 40 requests and so on. In short, as long as you do not reach the capacity limit of a system, the throughput doubles with with each doubling of connections (i.e. scales linearly with factor 1).

Equation 1: Operationalized Little's Law Applied to Web Applications

Checkin Terminals
at the Airport

Now to the capacity limit. An illustrative example would be a check in terminal at the airport. Let's assume that the average checkin time is 1 minute and there are 10 terminals available for check in. Clearly, the maximum throughput will be 10 check ins per minute. But what happens if more as 10 customers arrive at the same time? The simple answer is: They have to wait and in the end it took them more than 1 minute to check in. The graph at the bottom will show this example in the same way as in the previous one. As you can see the average time required for a checkin increases linearly as soon as the capacity limit is reached while the throughput remains constant at that point. On one hand, as you can see, the graph looks quite similar to the first one from the web application. On the other hand there is also one key difference, but let me finish first by defining a key term. Without going into the details of queueing theory the inflection point of the function is also the saturation limit (reached at 10 arrivals per minute in this case) of the system.

Throughput and Response Time in Theory

Going back to the first graph from the web app we see that the saturation limit is only achieved once (at 4 active connections). All consecutive results are below that. The reason for that is simple. Reaching the saturation limit of a web application means that at least one element in your architecture (e.g. the CPU of the web server or the disks of the database) have reached their saturation limit. As more and more tasks/processes/threads are competing over the limited resources the operating system is usually trying to give each of them a fair share which in turn requires more and more context switches. Generally the inefficiency incurred due to time spent doing context switches instead of actual work increases with the load of the system. Therefore you want to make sure that your application is running under the saturation point (i.e. at maximum efficiency and throughput). Unfortunately, on the other hand, you cannot always predict the actual traffic which is why you want to know how your application behaves beyond the saturation limit. In practice you need to know at which point an intolerable response time is reached and, even as important, at which point you application stops working at all (e.g. crash due to the lack of memory).

From Theory to Practice

I prepared a small example which shows how a performance test reflecting the former theory could be conducted. A "Hello World" PHP application was deployed to Amazon's PaaS service Elastic Beanstalk using a single micro instance (no auto scaling). In addition I used Apache's JMeter for the test. The test plan fetches the single URL for 30 seconds and records the results. The test was repeated with increased concurrency levels beyond the saturation limit until the application stopped working altogether or reached a maximum response time of 10 seconds. In order to simulate a new release deployment I ran an additional test replacing the Hello World one liner with a small app that prints out one of the first 10 Fibonacci numbers randomly (recursive implementation). As you can see in the results the simple change has quite an impact on throughput and response time with increasing concurrency. With such results you have the chance to either optimize the application or increase the capacity before deployment. Moreover the throughput graph is color coded by error rate. With the shown results you should not run this demo application with more than 128 active connections. If have an application delivery controller (ADC) that is capable of multiplexing backend connections (ELB is not) you should make sure that the configured maximum of backend connection lies beneath 128 and as close as possible to the saturation limit. This way the ADC will handle the queueing and the backend can continue to operate close to the measured peak throughput**.

Occasionally Freezing Micro
Instance during Load Test

If you try to reproduce this example you will notice that you will get different results on repeated tests. The reason for that is the use of micro instances. The point at which the CPU is throttled varies quite a bit and therefore the results are changing. During the 30 second test there are periods where the instance seems to freeze for a few seconds. More details about this behavior can be found here. Moreover, with higher concurrency the likeness of failed health checks form the ELB increases. I had two tests were the ELB marked the instance as unhealthy and delivered 503 error pages with no instance left. These error pages were delivered faster than the actual "Hello World" example. This is why you shouldn't trust results with errors and why those errors are color coded in the result graph.

Actual Practice

JMeter Summary Report

The graphs shown in this blog post were created with gnuplot using this script. It expects CSV-files generated from the JMeter summary report as input. Those can be provided via command line by using gnuplot -e "datafile = 'summary.csv';" or gnuplot -e "datafile = 'summary1.csv'; datafile_cmp = 'summary_old.csv';" for a graph comparing two results. Additionally the labels are defaulting to the provided file names. Those can be customized by setting the datafile_title and datafile_cmp_title variables.

The corresponding JMeter test file is using a single thread group for each concurrency step. I decided to use powers of two for this, but you might want to add intermediate steps on higher concurrencies. Note that this test only requests a single resource (the home page). An actual application will probably consist of more URLs requiring modifications to this test plan. My preferred setup is to use the CSV data set config element to provide a representative set of resources to test. If you run a test using this example make sure to adjust the variables from the test plan (especially the target of the performance test). After the test is done save the result using the "Save Table Data" button from the summary report. Expect that this test will render your web application unresponsive. So do not run it against a production environment.

/jr

* Due to the nature of the test we can safely assume flow balance which means that the responses per seconds are equal to the request per seconds
** This requires similar response times across all resources. A few long running scripts could block the back end connections

Friday, June 15, 2012

Converged Infrastructure: An Overview

Today I'd like to share my collected overview of converged infrastructure solutions. For those of you unfamiliar with this topic: Converged infrastructure can be best summarized as servers, network switches, storage and virtualization pre-engineered and ready to be deployed in your data center. Right now many IT departments are building their own virtualization platform by selecting the components themselves. Those of you who have already done it know that the task is pretty straight forward but in the end there are a lot of details to sort out. It is a lot more than just selecting the desired servers, switches, storage and an additional cross-check of the compatibility matrixes of the involved vendors.

With the emerging converged infrastructure market the question arises whether you want to start buying pre-build solutions or continue with the old build-your-own approach. I won't try to answer this question here given that others already got it well covered.

It all started with VCE's Vblock platform solution in 2010, but today you have a wide variety of other choices form HP, IBM, HDS and Dell to name a few. Besides the differences in hardware there are key aspects not easily taken from the white papers which is why I compiled the listing at the end of the post. In my experience the management solution integrating the different components into one seamless solution is the most important differentiation followed by the deployment (ready-to-buy product vs. reference design) and support model. Right now the hardware seems to have the least important role as long as the solution can satisfy the capacity, availability and performance requirements.

Please let me now if I got something wrong or missed something.

/jr

VCE Vblock

Type: Turnkey solution
Compute: Cisco UCS 5100 Series (blade center)
Network: Cisco Nexus 5000 or Nexus 7000 Series

SAN: Cisco MDS 9100 Series switches or MDS 9500 Series directors

Storage: EMC VNX Series or VMAX(e) using Fibre Channel and optional NFS (VNX only)
Virtualization: VMware vSphere and bare metal
Management Solution: EMC Ionix Unified Infrastructure Manager
Support: VCE (formerly Acadia) as joint venture from VMware, Cisco, EMC and Intel
Architecture: Vblock 300, 700

Cisco / NetApp FlexPod

Type: Reference design
Compute: Cisco UCS 5100 Series (blade center)
Network: Cisco Nexus 5000 Series

SAN: Unified over Ethernet (NFS, iSCSI or FCoE)

Storage: NetApp FAS Series (e.g. FAS 3200 Series)
Virtualization: VMware vSphere or Microsoft Hyper-V
Management Solution: CA Automation Suite for Data Centers, Cloupia Unified Infrastructure Controller or Gale Technologies GaleForce Turnkey Cloud for FlexPod
Support: FlexPod Premium Partners
Architecture: FlexPod Solution Guide, vSphere on FlexPod

Dell vStart

Type: Turnkey solution
Compute: Dell PowerEdge R720 (rack server)
Network: Dell PowerConnect 6248 or 7048 switches

SAN: Unified over Ethernet (iSCSI)

Storage: Dell EqualLogic PS6100 Series using iSCSI
Virtualization: VMware vSphere or Microsoft Hyper-V
Management Solution: VMware vCenter Server with Plugins or Microsoft System Center
Support: Dell
Architecture: vStart 100 and 200

Hitachi (HDS) Unified Compute Platform

Type: Turnkey solution
Compute: Hitachi Compute Blade 2000
Network: Brocade 10G Ethernet Switch (probably Brocade 8000)

SAN: Unified with Fibre Channel Storage Connection?

Storage: Hitachi AMS 2300
Virtualization: Microsoft Hyper-V and bare metal
Management Solution: Hitachi Unified Compute Platform Management Software and Microsoft System Center Operations Manager
Support: HDS
Architecture: Hitachi Unified Compute Platform Solution Profile

IBM PureFlex System

Type: Turnkey solution
Compute: IBM Flex System Enterprise Chassis (Power7 or x86 blades)
Network: IBM Flex System Fabric EN4093 10Gb Scalable Switch

SAN: IBM Flex System FC3171 8Gb SAN Switch (branded QLogic switch)

Storage: IBM Storwize V7000
Virtualization: VMware vSphere, Microsoft Hyper-V, KVM (Linux) or IBM PowerVM
Management Solution: IBM Flex System Manager
Support: IBM
Architecture: IBM RedBook PureFlex System

HP CloudSystem Matrix

Type: Reference design
Compute: HP BladeSystem c7000 (blade chassis)
Network: HP 5800 switches

SAN: Not specified, probably direct connected Fibre Channel via Virtual Connect (in case of 3PAR Storage)

Storage: HP 3PAR recommended, EVA and XP Series (branded HDS storage with custom software form HP) supported
Virtualization: VMware vSphere or Microsoft Hyper-V
Management Solution: HP Matrix Operating Environment
Support: HP
Architecture: HP CloudSystem Reference Architecture, CloudSystem Matrix Technology

HP Virtual System

Type: Turnkey solution
Compute: HP BladeSystem c7000 (blade chassis)
Network: HP 5800 Series switches

SAN: Unified over Ethernet (iSCSI) in VS1/VS2, HP 8/24 or 8/40 SAN Switches (Fibre Channel) in VS3

Storage: HP LeftHand P4800 G2 BladeSystem with HP MDS600 storage (VS1/VS2), HP 3PAR F400 or V400 (VS3)
Virtualization: Microsoft Hyper-V, Citrix Xen and VMware vSphere
Management Solution: VMware vCenter Server with Plugins
Support: HP (via CarePack)
Architecture: HP VirtualSystem VS2 Solution Architecture, VirtualSystem VS3 QuickSpecs

Tuesday, February 7, 2012

Chaos Monkey vs. SPOF Monster

»The best way to avoid failure is to fail constantly«

Just about a few days ago I heard about chaos monkey for the first time. Employees at Netflix experienced the hard way that things in a public cloud (AWS in their case) will fail. This is why any application architecture for a cloud scenario must handle failure.
Actually, this isn't something new. You have to do the very same in a classic data center scenario. Just the way to achieve fault tolerance is usually a bit different. In the cloud you try to scale horizontally as much as possible and due to the nature of how the scaling works (e.g. auto scaling web servers in a shared nothing architecture) a single failing component can be easily and automatically replaced. The most common non-cloud approach to fight against the failure is to duel it with duality. With other words: dual servers, power supplies, network links, switches, HBAs, SAN links, fabrics, controller, storage arrays and up to entire data centers often deployed twice or, in larger setups, with n+1 (n>1) redundancy.


The SPOF Monster is often introduced during changes to existing infrastructure

So be it in a cloud or not, in theory you architecture is resilient and fault tolerant, but how about the real world? If you are absolutely sure you could let someone walk into one of your data centers and plug out an arbitrary device without anybody noticing the incident.
Feeling uncomfortable with that thought in mind? Simply because we have the tendency to avoid failure there are almost no chances to spot weaknesses in the design up until something bad happens. In reality there is always something that has been overlooked and the lurking SPOF Monster is waiting for it, striking hard when you are not prepared.

This is why Netflix runs a script that randomly terminates stuff within their infrastructure. This script is called chaos monkey. The fact that they are running within a cloud environment makes it easy to kill something automatically, no need to run into a data center and plug something out.
At first this approach might sound horrifying, but think about it. If your architecture is indeed fault tolerant, why not prove it by constantly challenging it in a controlled way? That way the SPOF Monster has no chance to grow and will be revealed immediately.

/jr

Saturday, January 21, 2012

Problems with Storage Performance in Virtual Environments

There are two aspects of performance, latency and throughput. In this post I will write about the former one.

Let's say you have reports of poor performance affecting multiple virtual machines. You take a look at the virtual machines in question, but everything seems fine. No significant CPU usage, no swapping nor ballooning and almost no disk activity.

ESX host with occasional IO latency spikes

Assuming you are running vSphere you check the performance dashboard of one of the ESX hosts running the VMs.
Does the disk latency graph look like the first graph having occasional spikes way over 30ms?
If so, check other ESX hosts. Are there similar spikes in common with the first ESX host?
In this particular case multiple hosts are effected, so we are almost certainly experiencing a saturation in the storage environment.

Another ESX host probably causing the issue

But before we go on, how much latency will cause issues? The short answer is: it entirely depends on your workload. Lets say you are running a database for a web application on the first ESX host. In addition assume that the normal latency is 5ms, the start page takes 2 seconds to load and requires 5 database queries resulting in 20 synchronous IOs each. With a latency increase of 20ms the web site would require an additional 2 seconds to load, doubling the page load time. This might be Ok for an internal web application and a short duration of time, but imagine a public web site with the spikes of the first graph. The web page would take over 10 seconds to load.

Now back on topic, how do we solve the issue? The graph from the second ESX host is typical for batch jobs like backup or database imports/exports. Those jobs tend to run as fast as possible so using a faster disk array will only shorten the duration of side effects but will not solve it, as long as the disk array is the slowest part in the equation. So, in this case the issue is not having two ESX hosts with high latency IOs, but having one virtual machine on one host starving all other in machines in the cluster. This issue is known as the noisy neighbor problem.

A manual approach would be to isolate those bursty workloads on separate disks and in a separate datastore. Doing this easily becomes time consuming and cumbersome, especially in large environments. If you have vSphere 4.1 or higher and have, or can afford Enterprise Plus Licenses there is an easier solution: Storage IO Control or SIOC for short. It will distribute the available IO capacity fairly among all virtual machines as soon as the latency on a datastore passes a configured threshold and therefore preventing a noisy neighbor from severely affecting other virtual machines running from the same datastore.

hth someone,
/jr

Thursday, January 19, 2012

QoS, vSphere SIOC and Shared Disk Pools

In the last post I wrote about disk latency and how to detect them. Today I want to build on that.

Imagine an urgently called meeting where somebody says one of these:

»We have to run this on dedicated hardware, we need more performance«
»Customers are complaining about occasional slow response times, this app does not seem to work with insert-your-favorite-hypervisor-here«

Having a déjà vu reading this? Many people still think that critical or heavy workloads are not suited for virtual environments. But why is that so? Probably bad experiences while trying to virtualize something. While I am huge fan of virtualization I have to admit that I made similar experiences in the last years. Slow databases, unresponsive user interfaces, even application errors not happening on dedicated hardware were witnessed by me.

Those experiences are in direct contrast to vendor benchmarks like these, virtualization overhead analysis like this one or a study like this. In most of the analyzed performance issues by me the gap between almost native speed vs. witnessed unusable applications was caused by issues in the storage environment. The reason for that is simple: It is the weakest link in the virtual chain (i.e. the slowest component). In a dedicated setup like a vBlock you can easily detect issues, but how about large shared SAN infrastructures? Before I go into details some background details about VMware and todays storage arrays. If you are running another hypervisor hang on, most of what I write still applies, but the solution is different.

VMware added a pretty neat feature to vSphere 4.1 called Storage IO Control or SIOC for short. It distributes the available IO capacity fairly in case of a congestion (i.e. increased IO latency). I do not go into the gory details, so if you are interested I recommend this paper. In short SIOC distributes the available IO capacity to the VMs depending on their shares. A VM with 10% of the total shares will get at least 10% of the total capacity. Shares were available before 4.1, but with SIOC the mechanism works across all ESX hosts sharing a datastore (i.e. running in the same cluster). In addition SIOC does nothing as long as the storage array (or the path to it) does not get saturated. So less prioritized VMs can consume more IOs as long as other VMs do not need them.

Now, to the core part of this post. If you take a look at todays storage arrays you will notice that almost all vendors are offering pooling functionality aggregating multiple raid sets into a big chunk of storage. NetApp calls this aggregate, EMC, HDS and HP are calling it (disk/storage) pooling.
The block based arrays are often striping provisioned LUNs across all raid sets in a pool achieving throughput not possible using a single raid set. In addition this approach allows thin provisioning, easy dynamic LUN resizing and overall less work for the storage engineer.

No QoS: Per datastore latency on a
shared disk pool without SIOC

Sounds great? Imagine a mid range array with 200 2.5" 10k disks in a single pool offering approx. 15,000 IOPS in a standard RAID 5 7+1 setup and a 70/30 read/write ratio. What happens if you provision multiple datastores from this single pool and all VMs combined want to consume more than the offered 15k IOPS?
In a classical setup you cannot guarantee any quality of service as you can see in the shown latency graph. As soon as a single data store has issues, all other stores in the pool will also be affected.
Luckily as you can read in a knowledge base entry from VMware, SIOC can handle this situation as long as two requirements are met. First, all provisioned datastores in the disk pool must be managed by a single vCenter. Secondly the disk pool should not be shared with other non-virtual workloads. While vSphere SIOC can detect these workloads it can only prevent starvation and assure that the remaining IO capacity is fairly distributed across all virtual machines. SIOC simply can't offer any real quality of service in this setup as long as there is no array/SAN based QoS mechanism in place. But more on that in a later post.

All this sounds reasonable but you are running another hypervisior or can't afford the pricy Enterprise Plus licenses required for Storage IO Control? Right now you probably have to go a traditional approach and split up the disk pool in as many parts required (e.g. one pool per service and/or customer) to assure a decent quality of service.

/jr