Monday, January 05, 2015

Shifting IT to scale

While some IT groups are fine tuning their practices, many others are still on the journey to achieve greater efficiency at scale. Not all can make this shift (typically due to size, and budget) and that may be fine for them, but there are the unfortunate few who have not yet realized the need to shift their IT practices. Integration, deployment, and operations have been going through a huge transition, in large part due to business needs to better support more customers globally, provide better availability and response times; all at low costs (how else can you maximize profits?). A simple phrase for this is “cloud enablement”. Yeah, that catch all phrase that hopefully you’ve come to realize you need to be a part of, or risk failing behind. To achieve this, a fully automated deployment pipeline is a necessary component which requires a few things be put in place, namely:
  • Thorough application monitoring
  • A Collaborative culture
  • Developer virtual machines
  • One-click integrations
  • Continuous integration
To support many daily deployments, the development process should revolve around making many small, continuous changes, while keeping risk to a minimum. To be comfortable with at all times, you'll need to adopt a range of tools and practices.


Making Developers Comfortable
One of the best ways to ensure a developer can be comfortable with making any deployment is to ensure each developer has their own full production stack. Every developer should have their own virtual machine (there are free options like Virtual Box, my preferred, KVM and Xen), configured with a configuration management tool such as Puppet, Chef, Ansible, Salt or whatever you use internally, with the same configuration used in production. Ensuring the whole provisioning process is automated is another necessity since this not only optimizes productivity by minimizing creation times, but eliminates human error in typing in multiple commands which may vary according to environment and platforms.

On the continuous integration front, having a tool which allows developers to test their changes without having to commit to the production code repository helps to keep production clean and thus deployable while allowing quick and reliable testing. With the rising increase in popularity of container technologies (looking at you Docker), one can spin up on-demand, isolated and parallelized containers to conduct separate tests. The deployment process then becomes a simple one-click process between environments. A/B or other such zero-downtime type production testing further adds to the comfort level, not just for developers but the business in general.

Just as important to doing continuous delivery is monitoring. KPIs (key performance indicators) should be well known and graphed (I like graphing everything that can graphed as being a visual person I find it tells the story much quicker and is more effective). Most monitoring solutions now provide anomaly pattern detection which can be quite useful vs. eyeballing some numbers or even a graph. Your typically better off with a hybrid log approach where each application/service sends logs to two locations, i.e. it’s own log aggregator service which provides short term storage, and insight into it’s local activities without any external depedencies; and a centralized log aggregator service providing longer term storage with the end-to-end insight of all services and clients.


Achieving a Collaborative Culture
Any highly ambitious or for that matter successful endeavor requires a high degree of collaboration, ongoing collaboration to be exact. Most high performers by their very nature are social and want to talk, to share, and be collaborative. One only needs to look at the success of Twitter, Facebook and other social media platforms to see this truth. Enabling that collaboration with the appropriate solution and practices are the only seeds required to make this desire grow and succeed. A highly collaborative communication style, which I like, is based on IRC with chat rooms or channels for various specific purposes. For example, each team can have it’s own room/channel for private communication, another room/channel for a specific service/application, and yet another for general discussion or  perhaps “war room” (such as #warroom for outage related conversations to coordinate an investigation, discuss counter measures and resolution monitoring). Many such solutions are available in the market offering a full breadth of features such as email and ticket integration, video conferencing, white boarding and so on.

Part of a collaborative culture also involves doing a post-mortem, lessons learned or root cause analysis following an incident. I really like the idea of making this blameless as, I think, this gets things done more effectively. Typically everyone already knows whose at fault (or the team) and assigning blame in a public manner only serves to decrease moral, job satisfaction and actually learning more about what and how something happened. Finger pointing is never productive in my experience.


A final word on on-call
I don’t like being on call, as in I don’t like being called at 3am out of my sleep or having to sit by the phone, I’m sure no one actually does, but it is a necessary process for operations, support and developers. Being on-call not only makes you want to have things working so you don’t get called, but also ensures you stay in touch of the day-to-day issues that are faced. This is especially important when introducing new features or improving existing processes. I like going with a rotation schedule of one-week every four weeks. I think this quite typically and agreeable to most.