Cloud – No fairy tale for performance-sensitive applications

Every organization that builds software products and/or services has been striving to develop new offerings for (or move existing ones to) the cloud, lately.

Cloud – No fairy tale for performance-sensitive applications

Start-ups and small organizations see it as an obvious choice, or rather the only viable option, since it helps them save the upfront set-up costs of infrastructure and maintenance, and reduce the time-to-market significantly. Larger organizations with well-established products/services see it as an opportunity to offer greater flexibility and cost efficiency to their customers, hence making their products more competitive and future-safe. It is a win-win proposition for all stakeholders – OEMs, vendors, service providers and customers.

Even though it is considered a panacea and the future of computing, the cloud does not come without its share of problems. It is not possible to discuss all these problems in a single post, so I’ll confine myself to one of the foremost challenges that architects face while designing software and planning cloud deployments, software performance. The three main aspects of software performance in cloud deployments are:

  • Forecasting and ensuring performance and capacity
  • SLA monitoring
  • Troubleshooting and isolating infrastructure issues

While performance and capacity may not be of significant concern for some software products, they decide the fate for many of them, especially in telecom or other such time-critical domains.

Performance Prediction

A traditional carrier-grade software is often packaged with engineered hardware underneath it. Performance-targeted lab tests provide a fair assurance of software performance and the limits on any given hardware. In most cases, the hardware resources (compute, network, IO, memory, and storage) are tuned to optimize the software performance. However, this does not hold true in case of cloud deployments especially when the software product claims compatibility with multiple third-party cloud products available in the market.

In case of software for cloud, the software manufacturer loses control and visibility over the underlying hardware/infrastructure unless it is a self-managed private cloud. Hence, vendors with cloud-hosted solutions often find it difficult to offer minimum guaranteed performance (for example, transactions per second, and number of users concurrently supported) while still meeting the SLA.

One could argue that cloud computing provides an inherent flexibility of scaling the infrastructure, both vertically and horizontally and this should mitigate any performance concerns. This, however, is not true for a couple of reasons:

  • Software, if not appropriately designed to be scalable, does not always perform better with additional hardware resources
  • Scaling up the infrastructure increases the cost, which might not always be acceptable

Performance Monitoring

Monitoring the performance of cloud-hosted software is not as straightforward as it is for traditional solutions. Legacy performance management systems are not adequately equipped to provide a holistic view of the cloud environment.

The hypervisor adds an overhead to every hardware request. This is typically minimal, but needs to be tracked on a continuous basis to detect any anomalies in the early stages.

Delayed availability of the CPU to a guest contributes towards latency, known as steal time or ready time, and can seriously degrade performance. This impacts all resources (I/O, memory, network, CPU and so on) and cannot be detected through standard utilization metrics. A performance monitoring solution for a cloud-based service should be capable of dynamically identifying the VMs in which the application/service is currently running, and should take into account the CPU ready time (VMware) or CPU steal time (Xen, KVM) as reported by the hypervisor.

CPU utilization and time measurements at the VM/Guest level could be misleading at times due to VM suspension and timekeeping problems. Method-level performance analysis is a problem because a method might appear to be more expensive (due to VM suspension or catch-up) than it actually is. The timekeeping problem is nicely explained in a whitepaper available at Virtual Machine.

Hardware is shared and finite. Whenever dealing with shared resources, as in a cloud infrastructure, performance fluctuations in one workload can affect the hardware's other workloads. This is known as the problem of Noisy Neighbors and can be detrimental to time-critical services. In case of noisy neighbors, analyzing the application in the impacted VM for performance issues fails to identify the root cause because it actually lies in another application running on a VM sharing the same infrastructure resources.

Cloud infrastructure is extremely complex and the weakest link of the chain defines the overall capacity and quality. The performance metrics are broadly classified under one of the following:

  • Timing: Response time, processing time, throughput rate
  • Resource utilization: Degree to which the resources are used
  • Capacity: Maximum limits of the system, for example the transport bandwidth, the throughput of transactions, the size of database, and the maximum number of subscriber records

The user’s view of timing and capacity, and the system’s view of resource utilization must be recorded and examined.

Performance Troubleshooting
This is how one typically troubleshoots performance issues:

  • Identify the specific scenario (if any) that is causing the performance bottleneck.
  • Verify the environmental constraints, such as CPU exhaustion, and memory.
  • Isolate the problem to a specific component, method, or service call within the application.
  • Determine if the root cause is algorithmic, CPU-centric, or caused by external bottlenecks, such as I/O, network, or locks (synchronization).

For a cloud environment, one also needs to identify the problematic tier or tiers. The problem might lie between two tiers (that is, due to network latency) as elaborated below:

  • If inter-tier time is rising, it can be due to over-utilization of the underlying network infrastructure.
  • If a single problematic tier is identified, it could be due to excessive VM suspensions. This can be verified by analyzing the steal time.

Many cloud vendors provide a management/monitoring interface that collects and reports a robust set of resource utilization statistics. They provide a good view of individual components but fail to provide a complete picture from the system and network perspective. The intra-cloud network topology is not exposed to users, but plays a vital role in shaping the performance of applications that span across multiple hosts or nodes in the cloud.

Cloud features such as automatic provisioning and orchestration, infrastructure management and auto-scaling hide the underlying hardware and the associated VM assignments, but applications still use the actual physical resources. So while we may be blissfully ignorant of these assignments, when things go wrong we need to be able to track down the problem.

What’s the solution?
Software manufacturers that have settled on or are considering the cloud for hosting their software face a substantial need for simulation and monitoring tools that can assess and monitor a given cloud infrastructure. Such tools must facilitate reliable forecasting of software performance, monitoring SLA, and isolating any performance/capacity issues to either the infrastructure or the hosted software. We will discuss more about frameworks/tools and relevant metrics in my next post. Stay tuned!

Leave a Reply

Your email address will not be published. Required fields are marked *