Azure costs – how we reduced our bill by more than 50%
We’ve been happily reliant on Microsoft's Azure platform since we launched ShareIn. It provides a flexible set of services that let us to focus on building our products without worrying about our hosting environment. It’s enabled us to scale rapidly over the years. We quickly added new services as we needed them and moved on to the next project. But as we’ve grown so has the cost and I was mindful that we weren’t always using the services in the most efficient way.
We undertook a wholesale review of our Azure footprint and by the end of this process we had managed to reduce our monthly Azure bill by an amazing 50%.
How did we do it?
Our staring point was the Azure Cost Analysis tool. This is available in the Azure web portal. It gives you a very detailed breakdown of your spend. By filtering on daily spend by resource group name (the default) you can view the cost of each resource per day.
Immediately we can see where big chunks of our spend are going each day and it naturally suggests where to start looking for cost optimisation. Database and App Service costs were our biggest outgoings accounting for over half our daily spend.
At any given time we’re running production, UAT, Dev and Test databases for multiple separate applications. New databases are added in sets when we build out new products and over the years as we’ve built ShareInInvest, ShareInPay and the ISA API and added batches of new database instances.
By default Azure SQL server instances have Audit logs turned on at the server level. For every new database instance you will also get database audit logging activity. This is important for our production systems we didn’t need it on our test, dev and UAT environments. So we turned off Server level auditing and then enabled it on a database instance level as required. After turning it off, we delete the historic storage account / storage logs as so we wouldn't pay for that anymore.
Next – elastic pools. This is perhaps where the single biggest saving came from. Elastic pools allow you to share resources across databases. This is instead of configuring dedicated resources for each database. Our non-production systems do not require much on-going resource. They have spikes when running tests, but then will lie dormant. By including these environments in a single elastic pool we were able to stop paying for services we weren't using.
SSL & IP
Our ShareInInvest Saas platform has many domains resolving to a single app instance. We enforce https on all connections and In Azure you can host SSL certificates and bind these to subdomains using the SNI protocol which is free - great! But if you want to map a root domain then you’ll need to bind the SSL to a dedicated IP address. These IP addresses must be rented from Azure, with a maximum of 10 allowed per app service plan. With a growing number of domains these costs really started to add up. Given our strong preference for Azure tools we looked for solutions in Azure, but we couldn't avoid this expensive setup using Azure alone. In this case Azure wasn’t the best tool. We discovered Cloudflare which issues free SSL’s for your domains and allows you to manage subdomain and root domain SSL bindings (it actually binds automatically and issues you 15 year SSL certificates). So we swapped out all the IP bindings over to free cloudflare bindings and retired all rented IP addresses. This cut our App Service costs in half. It also means they won't grow as we add new domains.
Redis cache has made our sites faster. I love it. But when we first made use of it we weren't sure of the most appropriate size for our setup and bought a large tier. In addition we didn’t manage connections well, relying on the connections available with the pricing tier and not managing these in the application.
We addressed this by introducing connection pooling at the application level, thereby reducing the connections needed in the service instance. We then downgrade the tiers we used to better fit our actual usage. If downgrading an Azure Redis cache instance it will restart the cache and interrupt service delivery. To avoid this we setup new instances, change address pointers in applications and then deleted the old cache instances. This cut our redis cache service costs in half.
As we built out our solution we added virtual machines for specific functions such as sftp servers or running dedicated exes for single processing jobs we couldn't run in App Services. These instances grew quickly in an ad-hoc fashion. But there was duplication. After review we were able to consolidate these various services to reduce the number of VM’s we needed, thereby reducing our costs further.
Remove services we no longer needed
Azure is rich with services. We’re often trialing new features to see if they will improve our operations – this could be in testing, analytics, or new product configuration. We’d built up a collection of these test projects that didn’t ever make it to production. Time for a clear out. Again the cost analysis tool is excellent for detailing where your money is going. Drilling down into the smaller spend section we found services that could be immediately deleted. We’re now in the habit of monitoring services and pruning any services that didn’t make it to production.
If you’re not careful Azure will add additional services you might not need when you are creating new services. One that caught us out is Application Insights. Application Insights is an great for monitoring application behaviour in production and we do make use of this. But we also run many basic web site instances that have no need for this level of monitoring. Application Insights is bundled with a new App Service by default. We were running Application Insights for applications we just didn’t. We deleted those services and saved a little more.
With a multi-tenant setup our build times tended to increase slightly every time we added a new client as they would invariably have some bespoke views added to their theme. We got to a point where this was just operationally prohibitively slow. Not only was it slow but it was costing money in hosted build instances as for every minute the build was running was a minute we had to queue another build and with growing development team and multiple product pipeline long build queues were not tolerable and sapping morale!
We bought more hosted build agents to alleviate the bottleneck but this is an expense long term solution and only likely to increase in cost over time.
The first step we took was to reduce the build time by introducing parallel compilation of views using this package https://github.com/StackExchange/StackExchange.Precompilation. This brought build times down by around 50%. After that we hit a wall that no amount of further code optimisation seemed to get us past. We started to look at hardware.
Azure lets you run pipelines on your own machines. We did some benchmarking of builds using developer machines and saw an immediate improvement in build times. So we went all in an bought dedicated build machines to run all our pipelines. There’s an upfront cost here but hardware is much cheaper than it once was. It meant we stopped paying for hosted agents which was adding to our Azure monthly bill. It also meant we had faster build times which we good for development team productivity and team happiness!
There is no single silver bullet for reducing costs. We reviewed our entire setup and were able to find many small improvements. Combined these added up to significant savings. In some cases we replaced services with better options at lower cost. But overall the savings were achieved by examining each of the services we use in some detail and considering the setup again given Azures available configurations and our true requirements of the service. It’s a process that will continue.