Reducing Cloud Operating Costs

Reducing Cloud Operating Costs
By Tim Pruitt / on 22 Mar, 2021

Optimizing Cloud Spend

Do your cloud servers scale on demand? What about recovering from errors and faults? Is that an automated process?

If you have load balanced server farms or other distributed processing design and have not taken advantage of the automation and recovery features of cloud platforms, then there is a massive opportunity to potentially save real dollars on operating costs or prevent revenue loss.

Autoscaling Cloud Servers for Cost Savings

You may design strictly based on peak traffic and the highest anticipated resource demands. Great idea…in a traditional data center model with legacy systems that cannot scale and deploy dynamically. In modern cloud design, the ability to scale up or down based on workload is a very efficient operations strategy. Scaling down at period of low demand can be a significant cost savings.

Maybe you run an online retail site. Your sales are primarily in North America with peak traffic occurring after the workday for all time zones. In the late morning hours, you have very little traffic. During peak you need 20 servers to handle workloads. But at 1 am you typically need 1 or 2. A strategy to always have 20 servers available to accommodate peak workloads is very valid and typically how companies begin operating in the cloud. But after a few months, you notice that most of your servers are idle in early morning hours.

You pay for the compute you use. If you have idle instances in production, you are paying for them. Setting thresholds and other criteria to terminate instances in periods of low demand can significantly reduce costs. This will take some expertise and analysis of historical trends. If that data is not available you may want to consider capturing a few metrics like CPU load, connections, memory utilization, and idle time to name a few.

Scaling down is a great cost savings effort but scaling up can also help your preserve revenue. Knowing the demand trend is critical for a scaling up policy. If you attempt to scale up too late, then you may be seeing errors due to resource consumption for the duration of the scaling operation. You will want to anticipate the lead time on beginning the scaling operation before you need the resources not when you need them. That will take some finessing. You do not want to unnecessarily scale up, while also trying to avoid scaling up too late.

Health and Error Checking to Auto-remediate Problem Instances.

In the same online retail scenario, in the high demand periods, you have an application that tends to max CPU utilization when certain error conditions are met. When 2 of your instances have maxed CPU, the application group produced a 10% error rate. This could be driving customers away from your perceived to be “junky” site.

If you happen to have decent monitoring setup, your NOC will see the errors and react. They are alerted, log into the console, and restart those instances or terminate and spin up new instances. A process that could allow several hours to elapse depending on human factors. Is your NOC 24x7? Does it take 20 minutes to reach someone?

A better solution is to automate remediation based on those conditions. Through health checks or logging/monitoring, you can identify instances with greater than x% CPU utilization, remove them from the load balancing group, add new instances, and terminate the bad instances. In a matter of minutes.

We have touched on just the baby steps of automating and creating efficient operations regarding administrative overhead and consumer demand. For a business that is new to cloud infrastructure, there will be a lot of trial and error or historical analysis ahead. Without the manpower and expertise, attempts to reduce costs and increase efficiency may cause some headaches, especially at the executive level who may see a failure to execute as a cause for revenue loss. It is not always cost effective to outsource operations, but it often provides a huge benefit to get your IT team set in the right direction.

In future blogs, we will discuss automating deployments for high availability and failover as well as designing resource allocation for geographic performance. Subscribe to our blog to stay up to date!