Computing has become a ubiquitous and fundamental part of our everyday life. But it was not all this way, the role of computing has steadily increased in both it’s availability and importance until today, where the management of practically every complex system has been deferred to a computer of some kind or other.
Managing these computers has been a story as old as their development. This post seeks to explore the history as a lens through which to reflect on the present, discuss some of the challenges involved in the shifting responsibility of infrastructure and make a conjecture about what the future holds
What is infrastructure even?
Before diving too deeply into the history of infrastructure, it’s worth defining what infrastructure is (at least, in the context of this post).
Broadly speaking, I use the following definitions:
An application is the thing I actually care about working; the part I modify on a day to day basis. Things such as Magento, WordPress, Slack, MySQL, Redis etc.
The other part of the equation; the things that are necessary to keep the application working. Things such as EC2, Load Balancer, EC2, HDD, RAM, RDS, Elasticache.
These are not elegant definitions, but rather just serve a heuristic. Observant readers will note that RDS is hoste MySQL; the lines between “what is infrastructure” and “what is an application” are hard to draw.
In the beginning
In truth, it’s difficult for me to imagine the beginning. There has never been a time in which I was administering Linux systems in which there has not been configuration management tools. However, I can reflect on the personal journey I have had, and the transformation that I have been part of with the organisations I have worked with.
When I started developing and deploying software, we had “pets”. From the (now famous) “pets vs cattle” blog by Noah Slater at Engine Yard:
Pets are given loving names, like pussinboots (or in my case, Tortoise, Achilles, etc.). They are unique, lovingly hand-raised, and cared for. You scale them up by making them bigger. And when they get sick, you nurse them back to health.
We had dev machines, testing machines and prod machines. Rigid discipline was kept, and only a few special souls known as “sysadmins” were allowed to touch these machines. Deployments were a 2 man job, scheduled days in advanced and given warnings.
Often the management of these machines was simply dubbed “too hard”, and a third party providers was used. These providers provided a simple mechanism to deploy code, but limited ability to control or debug it. There were two teams of people:
- “operations” who managed the services that were deployed to, and
- “developers” who built the applications and deployed them to these systems
These two groups were often at loggerheads, refusing to accommodate the needs of their counterparts due to the great cost to their own projects.
Getting good at managing lots of machines
The nature of operations is that there is a large number of operations that are inherently repetitive. Regardless of whether the application is Magento, WordPress or even an IRC client, the machine still needed Linux, several core system packages, access configured, logs stored and exported and monitoring setup.
As early as 1993 open source tools started being developed to manage many machines. However, most notably they did so by defining a domain specific language (DSL) that declared a server specification, rather than a procedural set of steps to execute. The software would then reconcile the server state with the specification, automatically determining the steps to bring the machine into the expected state.
This dramatically reduces the overhead of managing machines. Rather than attempting to specify in which way specific machines needed to be setup, a user could write a definition and defer the management to the software. This also allowed the software to correct any deficiencies that inevitably appear over time.
However, and in my view most importantly, the development of the DSL provided a mechanism to communicate the (expected) state to others, and store it in version control. Others are then able to submit their own changes to the server specification, and have that applied to the server. Server management is now democratised and can evolve with the input of many developers over time. Additionally, as the specification is democratised there is no longer the need for a “provider” and a “consumer” of the service – developers can simply bundle their application as part of the server configuration.
Additionally, the configuration of the server can be *reapplied*. Servers can be continuously managed, reset every day at a given time. The beginning of continuous deployment.
This is possibly the most important advent in the management of infrastructure.
Infrastructure as a service
Even with this new DSL to describe the state of machines, there is the work of assembling, shipping, racking and some initial setup of the machines. This required large investments up front, and was tricky to modify over time as each modification inevitably meant either “we need more money” or “we spent too much money” — both uncomfortable positions to be in.
In 2006 Amazon (now Amazon Web Services) released EC2. Now basically infamous and responsible for a large chunk of the web services used by consumers today, EC2 represented the automation of the previously horribly expensive process of buying a physical machine. Instead, machines can be rented by the minute, do only work required, and stop.
The risks of purchasing machines basically eliminated human effort became the largest investment for infrastructure management. Automation as mentioned earlier became crucial to reduce these costs as companies expanded into the cloud, until the defacto standard has become to rent such machines off one of a few providers. Additionally, the DSLs used to describe server specifications were extended to include the services used by cloud systems.
The line between operations and developers now entirely blurred, infrastructure itself becomes programmable including the creation, provisioning and destruction. As well, this process can work extremely quickly and is easy to iterate on. Developers may create 5, 10, or 100 different machines just as they do with software builds iterating over a design to create something that is “production worthy”.
This, at least in my experience, has spelled the end of application services. Once parity can be so easily and cheaply made by developers building these systems without sacrificing the visibility and customisability of fully customised systems, hosted services no longer have their appeal. The ability to create a new build is simply too valuable to take the stock provided by another provider; optimised for compatibility rather than the bespoke use case to which we build now.
Not everything is awesome
Sitewards finds itself adopting both this infrastructure DSL management and the reconciliation based approach that will be shortly discussed. However, in doing so we have faced a number of issues that are worth addressing before endorsing shifting towards a greater control of infrastructure.
While cheaper, infrastructure management is not free. In particular it requires either developing or hiring a skills with people who have experience both programming and in system administration. Such skills are extremely expensive to develop, and systems that have been developed by a one or a small number of people are useless if those people leave the company.
Accordingly, shifting to infrastructure as code means embracing the idea that at least a fair chunk of our development team have a good working knowledge of the systems in which we operate, with at least 2 specialists who understand the stack intimately.
We’re getting there, but in truth we’re not there just yet. Like all things, this takes time.
Unfortunately, Infrastructure as code does not remove all the overhead of managing machines. In particular, machines tend to accumulate a kind of “crud” over time. Obviously, rigid discipline can help reduce this, but somehow it still occurs. Things such as:
- Configuration files that did not get cleaned up
- The artifacts of “emergency debugging”; 0 length files in the case of a full disk, or misconfigured dpkg in the case of failed automatic updates
- Improperly cleaned up migrations
Managing state is tricky. So, to get around this problem, we don’t. Specifically, there becomes a separation between “stateful” things and “stateless” things. Consider a web node for example; the databases are removed from the node, the files are stored in blob storage and logs are exported. Accordingly, there’s nothing special about this machine – it could be harmlessly replaced.
Immutable infrastructure takes advantage of this by replacing that node *with every specification change*. In this way it is impossible to build up cruft – the machine is refreshed periodically. This approach documented by Keif Morris separates the build and deploy step of entire machines; machines are first taken as a “known good” blank slate, configured, and then stored. When required, the are deployed – application included.
This approach allows placing the entire machine under the control of a developer in a reasonably risk free way. There is little state left, and changes can be deployed in stages and tested automatically, before being deployed to the production environment.
(Cheaper) immutable infrastructure
Immutable infrastructure is excellent and provides a wonderful model of managing software. However, building, packaging and storing machines is super expensive. Each machine may be several gigabytes, and must be run in it’s own allocation. Even when virtualised, these allocations are not cheap, and starting and health checking machines can take several minutes.
At this point in our history Google has been building and deploying software to thousands of physical machines for several years. Unable to allocate a single machine to a single service and unwilling to use hardware virtualisation for each process, they contributed a set of patches to the Linux Kernel that build on the idea of a FreeBSD jail to allow some process isolation. Termed “namespaces” and “cgroups”, these fundamental building blocks were later used by Docker to build an abstract called a “container”.
A “container” is a package that contains a specification for how a given service is supposed to run. Broadly, it consists of the following components:
- A root filesystem into which the process will be chrooted
- A promise that the application will not see any other applications, network connections, users or other machine metadata
- A promise that the application will only use a certain amount of machine resources
A container can be thought of as an extremely light weight virtualised machine which can be built and tested with easily accessible tooling, shipped over the network and stored cheaply and deployed and started extremely quickly. While they do not provide a perfect level of isolation, it is largely “good enough” given some careful practices; certainly good enough given how simple immutable infrastructure is made.
Turtles all the way down
In the case of immutable machines, there was no clear definition of “what is a machine” nor “what is immutable, what is not”. Rather, the pattern would be implemented in the primitives supplied by the cloud providers. These new immutable containers all worked on the basis of Linux machines — accordingly, they were not bound to any single cloud provider but simply to a Linux machine.
Because such Linux machines were readily available a series of tools quickly came about that controlled the management of theses containers *completely independent of any cloud providers*. For the first time, a consistent design could be implemented on Google Cloud, AWS, “Bare Metal” or an other cloud provider.
Broadly I think all of these tools work the same way, but I have the most experience with the tool “Kubernetes”. All tools take $N computers and treat them as a large, logical computer. Kubernetes works approximately as follows:
1. A specification is supplied to the Kubernetes API. It stores it.
2. A component called the ‘controller’ checks that specification against the current state of the infrastructure. It sees that something needs to change and marks what needs to change. Then, repeat.
3. Independently of the controller, each computer runs an agent called the “Kubelet”. The Kubelet sees that it has been assigned a container by the controller, and runs it. Then, repeat.
This is interesting in a couple of ways:
Nothing is sacred
When deploying things to Kubernetes, you abdicate control to the cluster entirely. It decides where to run things and where they should be. Because of this, it can also change it’s mind about where things should be when things stop working as they should.
To provide an example, we use the monitoring service Prometheus to track a number of different systems. It’s health is critical; it is required for all downstream checking. It might surprise you to learn then that it has, as of the time of writing, been shifted between machines 13 times as it exceeds it’s resource limits. Our most critical system, routinely shifted about.
Of course, we have monitoring on our monitoring. Perhaps most excitingly, *we do not notice these shifts*. In fact, I killed it in writing this post out of idle curiosity – 10 seconds later, it was back up again. And therein the beauty of these system; they not only apply the specification, but actively check to see if it still applies, and restore the application if it does not. The infrastructure is self healing.
All applications break. If for no other reason something will eventually leak memory and full it’s allocation. Deferring management to the tooling means I do not get woken up when it goes wrong.
My own, personal cloud
Not only does the system reconcile the specification, but *how* it reconciles the specification can be modified over time, and vary based on environment. The cluster can be extended to add new services such as automatic TLS certificate generation or new methods of service discovery implemented that provide visibility into how applications are communicating. These changes are completely transparent to the application; rather, the cluster understands that there is a new way these things should be implemented, and modifies the environment accordingly.
Such configuration allows a dramatic reduction in the amount of work required to introduce new systems to monitor the health of applications, harden applications or isolate network access or other system administration tasks it’s invariably difficult to get budget allocated for.
Still not all awesome here either
Practically speaking, the adoption of this reconciliation type systems has been slow here at Sitewards. In my mind, the problem is two fold:
1. They require an entirely new language to describe the application, and shift the concept of how such an application should be constructed, and
2. They require 100% buy in. It is difficult to reconcile this with tools such as Ansible; there is simply no place for Ansible in this ecosystem.
These are both extremely painful to overcome. I am personally of the opinion that it is a worthwhile change, but I have much more experience personally than we have as an organisation. We are slowly getting there, but in truth I’m not sure whether we will ever buy into this approach 100%. I have high hopes.
Packaging apps *including* infrastructure
The aforementioned reconciliation approach is certainly, in my mind, superior to the management of an expected state of machines. However, it is certainly not the most valuable aspect of this approach, and in particular Kubernetes.
Because the control plane was implemented entirely without any cloud providers help, we can take software from AWS and run it on Google Cloud. Or, from Google Cloud and run it in our own data centre. Or, our own Data Centre and run it back in Google Cloud. There is even some work to shift the work *between* cloud providers based on rates.
However, the particularly exciting part is that users are collaborating to package and distribute this software, completely free of charge. To express how cool this, I will explain how to install a functional copy of Mediawiki:
$ helm install stable/mediawiki
That’s it! For zero work (with the exception of purchasing a hosted cluster), I have deployed an application. At the time of writing, there are 113 different applications that can be packaged. Further, packages can be combined – H/A Redis (a very difficult application to host), MySQL and Varnish can all be included in our custom Magento installation; a large chunk of difficult hosting work essentially for free.
As a developer, I can now package applications faster, more reliably and more interoperable than ever before.
But I’m a developer I don’t want to deal with infrastructure
The role of a developer is a fluid one. Traditionally, it has been enough to simply develop the application and throw it “over the wall” to be run “somewhere”. This is both a good and a bad thing; it’s good, in the sense that a developers job is simpler, and has a limited responsibility. It’s bad in the sense that *no one* picks up this responsibility, and as a result applications to to reach production systems and fail for a number of mysterious ways.
A general tendency to reliable abstractions
No developer can be 100% versed in the application stack. “Full stack” developers are like “Jack of all trades”; having a reasonable knowledge of all layers, but not being able to solve complex problems at any single layer.
However, I think there is a general tendency to accept the abstractions that seem to be stable. At one time, using Linux required a level of black magic known only to few, and handed down carefully from generation to generation. Now, it is stable and in such wide use I think it powers most of the internet.
AWS started out with it’s fair share of critiques, but is now among the most stable platforms in the world – certainly far superior to managing the technology in the “old way”.
I see the next major abstraction likely to be around Kubernetes, and that model of managing infrastructure. Of course, these things are only obvious in retrospect, but I would (and have) personally sink my cash into this technology. All technology starts of flaky, becomes accepted, than boring. Then new technology built on top of it comes out.
For developers, this likely means that there will be those who specialise in the delivery of software, rather than just it’s day to day development. At Sitewards, we are starting to split these teams around continuous delivery – some of us help more with the delivery to testing / production, some of us use that delivery system to deliver software. Practically, this has made some of our jobs simpler; now, there is no debugging of life environments – by and large, they JustWork(tm).
Is this a reversal to Ye Olde days of “sysadmin vs dev”? I don’t think so. The difference is that each project 100% owns it’s infrastructure; it’s simply that some of us consult on the more bespoke aspects of pushing code to production.
So what does this mean for our day to day?
Concretely, it has become easier to reliably release software. Additionally, we are now developing more visibility into our applications in production, making it quicker and easier to debug problems. Lastly, we are able to influence the environment completely, creating optimisations in our delivery process.
Practically this means that we spend less time doing stuff that get bored of, and more time trying to find interesting problems to solve in application development. For our clients it means a more responsive development team, able to more quickly and easily showcase the clients wishes for feedback; personally, my goal here is new ticket to delivery in production in 4 hours (for routine work).
Each time that a shift in the software delivery paradigm has happened there has been companies who can take advantage of it to better match the market. Even if simply to iterate through design faster, or deploy a smaller chunk of a feature to users sooner software delivery influences a surprising amount of our day to day lives. Making this a smoother, more predictable and faster experience will positively affect us all.
I have worked now for 3 – 4 different software companies, each of which had their own policies and procedures around their infrastructure management. Given this, the above is the summary of my experience, rather than a record of any particular organisations development.
-  – History of software configuration management. (2017, October 18). Retrieved November 29, 2017, from https://en.wikipedia.org/wiki/History_of_software_configuration_management
-  – Slater, N. (2014, February 26). Pets vs. Cattle. Retrieved November 29, 2017, from https://www.engineyard.com/blog/pets-vs-cattle
-  – Chef (software). (2017, November 22). Retrieved November 29, 2017, from https://en.wikipedia.org/wiki/Chef_(software)
-  – CFEngine. (2017, October 26). Retrieved November 29, 2017, from https://en.wikipedia.org/wiki/CFEngine
-  – Announcing Amazon Elastic Compute Cloud (Amazon EC2) – beta. (2006, August 24). Retrieved November 29, 2017, from https://aws.amazon.com/de/about-aws/whats-new/2006/08/24/announcing-amazon-elastic-compute-cloud-amazon-ec2—beta/
-  – One third of internet users visit an Amazon Web Services cloud site daily | TheINQUIRER. (2012, April 19). Retrieved November 29, 2017, from https://www.theinquirer.net/inquirer/news/2169057/amazon-web-services-accounts-web-site-visits-daily
-  – Darrow, Barb. “Amazon Cloud Stands to Rake in $16 Billion This Year.” Fortune, 27 July 2017, fortune.com/2017/07/27/amazon-cloud-sales-q2/.
-  – Morris, Keif. “Bliki: ImmutableServer.” Martinfowler.com, 13 June 2013, martinfowler.com/bliki/ImmutableServer.html.
-  – Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, John Wilkes. (2016, May 01). Borg, Omega, and Kubernetes. Retrieved November 29, 2017, from https://cacm.acm.org/magazines/2016/5/201605-borg-omega-and-kubernetes/fulltext
-  – Docker overview. (2017, November 28). Retrieved November 29, 2017, from https://docs.docker.com/engine/docker-overview/
-  – Docker Containers vs Virtual Machines (VMs) | NetApp | Blog. (2017, May 25). Retrieved November 29, 2017, from https://blog.netapp.com/blogs/containers-vs-vms/
-  – Is it possible to escalate privileges and escaping from a Docker container? (n.d.). Retrieved November 29, 2017, from https://security.stackexchange.com/questions/152978/is-it-possible-to-escalate-privileges-and-escaping-from-a-docker-container
-  – Kernel.org. (2017). Cite a Website – Cite This For Me. [online] Available at: https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt [Accessed 29 Nov. 2017].
-  – Man7.org. (2017). namespaces(7) – Linux manual page. [online] Available at: http://man7.org/linux/man-pages/man7/namespaces.7.html [Accessed 29 Nov. 2017].
-  – Your Bibliography: GitHub. (2017). kubernetes/charts. [online] Available at: https://github.com/kubernetes/charts/tree/master/stable [Accessed 29 Nov. 2017].
-  – CNET. (2017). Amazon S3: For now at least, sometimes you have to reboot the cloud. [online] Available at: https://www.cnet.com/news/amazon-s3-for-now-at-least-sometimes-you-have-to-reboot-the-cloud/ [Accessed 29 Nov. 2017].
-  – P. (2017, November 06). PalmStoneGames/kube-cert-manager. Retrieved November 30, 2017, from https://github.com/PalmStoneGames/kube-cert-manager
-  – I. (2017, November 30). Istio/istio. Retrieved November 30, 2017, from https://github.com/istio/istio