October 23, 2011

DevOps in the Cloud Explained


This post was done by @martinjlogan and based on a presentation given by George Reese @GeorgeReese at Camp DevOps 2011

Time to throw some more buzzwords at you. Nothing makes peoples eyes roll back more quickly than saying DevOps and Cloud in the same sentence. This is going to be all about Cloud and DevOps.

The theory of DevOps is predicated on the idea that all elements of a technology infrastructure can be controlled through code. Without cloud that can’t be entirely true. Someone has to back the servers into the data center and so on. In pure cloud operations we get to this sort of nirvana where everything relating to technology is controllable purely through code.

A key thing with respect to DevOps plus Cloud that follows from the statements made above is that everything becomes repeatable. At some level that is the point of DevOps. You take the repeatable elements of Dev and apply them to Ops. Starting a server becomes a repeatable testable process.

Scalability; this is another thing you get with DevOps + Cloud. DevOpsCloud allows you to increase the server to admin ratio. No longer is provisioning a server kept as a long set of steps stored in a black binder somewhere. It is kept as actual software. This reduces errors tremendously.

DevOps + Cloud is self healing. What I mean by that is that when you get to a scenario where you entire infrastructure is governed by code, that code can become aware of anything that is going wrong within its domain. You can detect VM failures and automagically bring up a replacement VM for example. You know that the replacement is going to work the way you designed it. No more 3:00 AM pager alerts for many of your VM failures because this is all handled automatically. Some organizations even go so far as to have “chaos monkeys” these folks are paid to wreak havoc just to ensure/prove that the system is self healing.

Continuous integration and deployment. This means that your environments are never static. There is no fixed production environment with change management processes that govern how you get code into that environment. The walls between dev and ops fall down. We are running CI and delivering code continuously not just at the application level but at the infrastructure level when you fully go DevOps and Cloud.

CloudDevOpsNoSQLCoolness

DevOps needs these buzzwords. The reason it does is because DevOps only succeeds in so far as you are able to manage things through code. Anywhere you drop back to human behavior you are getting away from the value proposition that is represented by DevOps. What cloud and NoSQL bring into the mix is this ability, or ease, to treat all elements of your infrastructure as a programmable component that can be managed through DevOps processes.

I will start off with NoSQL. I am not saying that you can’t automate RDBMS systems, it is just that it is quite a bit easier to automate NoSQL systems because of their properties. It will make more sense in a bit. Let’s talk CAP theorem. CAP theorem is an over arching theorem about how you can manage consistency, availability and partition tolerance.

Consistency – how consistent is your data for all readers in a system. Can one reader see a different value than another at times – for how long?

Availability – how tolerant is the entire system of a failure of a single node

Partition tolerance – a system that continuous to operate despite message loss between partitions.

Now all that was an oversimplification but lets just go with it for now. Cap theorem says that you can’t have more than 2 of the aforementioned properties at once. Relational DBs are essentially focused on consistency. It is very hard [impossible] to have a relational system that has nodes in Chicago and London and has nodes that are perfectly available and consistent. These distributed setups present some of the largest failure scenarios we see in traditional systems. These failures are the types of things we would want to handle automatically with a self healing system. Due to the properties of RDBMS systems, as we will see, this can be very difficult.

The deployment of a relational DB is fairly easy. This can be automated well enough. What gets hard is if you want to scale your reads based on autoscaling metrics. Lets say I have my reads spread across my slaves and I want the number of slaves to go up and down based on demand. The problem here is that each slave brought up needs to pull a ton of data in order to meet the consistency requirements of the system. This is not very efficient.

NoSQL systems go for something called eventual consistency. Most data that you are dealing with on a day to day basis can be eventually consistent. If, for example, if I update my Facebook status, delete it, and then add another it is ok if some people saw the first update and others only saw the second. Lots of data is this way. If you rely on eventual consistency things become a lot easier.

NoSQL systems by design can deal with node failures. In the SQL side if the master fails your application can’t do any writes until you solve the problem by bringing up a new master or promoting a slave. Many NoSQL systems have a peer to peer based relationship between its nodes. This lack of differentiation makes the system much more resistent to failure and much easier to reason about. Automating the recovery of these NoSQL systems is far easier than the RDBMS as you can imagine. At the end of the day, tools with these properties, being easy to reason about and being simple to automate are exactly the types of tools that we should prefer in our CloudDevOps type environments.

Cloud

Cloud is, for the purposes of this discussion, is sort of the pinnacle of SOA in that it makes everything controllable through an API. If it has no API it is not Cloud. If you buy into this then you agree that everything that is Cloud is ultimately programmable.

Virtualization is the foundation of Cloud but virtualization is not Cloud by itself. It certainly enables many of the things we talk about when we talk Cloud but it is not necessary sufficient to be a cloud. Google app engine is a cloud that does not incorporate virtualization. One of the reasons that virtualization is great is because you can automate the procurement of new boxes.

The Cloud platform has 2 key components to it that turn virtualization into Cloud. One of them is locational transparency. When you go into a vSphere console you are very aware of where the VM sits. With a cloud platform you essentially stop caring about that stuff. You don’t have to care anymore about where things lie which means that topology changes become much easier to handle in the scaling case or failure cases. Programming languages like Erlang have made heavy use of this property for years and have proven that this property is highly effective. Let’s talk now about configuration management.

Configuration management is one of the most fundamental elements allowing DevOps in the cloud. It allows you to have different VMs that have just enough OS that they can be provisioned, automatically through virtualization, and then through configuration management can be assigned to a distinct purpose within the cloud. The CM system handles turning the lightly provisioned VM into the type of server that it is intended to be. Orchestration is also a key part of management.

Orchestration is the understanding of the application level view of the infrastructure. It is knowing which apps need to communicate and interact in what ways. The orchestration system armed with this knowledge can then provision and manage nodes in a way consistent with that knowledge. This part of the system also does policy enforcement to please all the governance folks. Moving to an even higher level of abstraction and opinionatedness we will talk about Platform Clouds (PaaS) in the next section.

Platform Clouds (PaaS)

Platform clouds are a bit different. They are not VM focused but instead focus on providing resources and containers for automated scalable applications. Some examples of the type of resources that are provided by a PaaS system are database platforms including both RDBMS and NoSQL resource types. The platform allows you to spin up instances of these sorts of resources through an API on demand. Amazon SimpleDB is an example of this. Messaging components, things that manage communication between components are yet another example of the type of service that these sorts of systems provide. The management of these platforms and the resources provisioned within them is handled by the cloud orchestration and management systems. The systems are also highly effective at providing containers for resources that are user created.

The real power here is that you can package up your application, say a war file or some ruby on rails app, and then just hand it off to the cloud which has specific containers for encapsulating your code and managing things like security, fault tolerance, and scalability. You don’t have to think about it.

One caveat to be aware of though is that you can run into vendor lock in. Moving from one of these platforms, should you rely heavily on its services and resources, can be very difficult and require a lot of refactoring.

Cloud and DevOps

Cloud with your DevOps offers some fantastic properties. The ability to leverage all the advancements made in software development around repeatability and testability with your infrastructure. The ability to scale up as need be real time (autoscaling) and among other things being able to harness the power of self healing systems. DevOps better with Cloud.

Info

Twitter: @GeorgeReese
Email: george.reese at enstratus dot com

October 22, 2011

Overcoming Organizational Hurdles

By Seth Thomson and Chris Read @cread given at Camp DevOps 2011

This post was live blogged by @martinjlogan so expect errors.

This talk is about how to overcome organizational hurdles and get DevOps humming in your org. This illustrates how we did it at DRW Trading.

DRW needed to adjust. The problem was that we are not exposing people to problems upfront. Everyone was only exposed to their local problems and only optimized locally. We looked and continue to look at DevOps as our tool to change this.

Cultural lessons

[Seth is talking a bit about the lessons that were learned at DRW that can really be applied at all levels in the org.]

The first ting you need to do if you are introducing DevOps to your org is define what DevOps is do you. Gartner has an interesting definition, not sure if it reflects our opinions, but at least they are trying to figure it out. At DRW we use the words “agile operations” and DevOps interchangeably. We are integrating IT operations with agile and lean principles. Fast iterative work, embedding people on teams and moving people as close to the value they are delivering as possible. DevOps is not a job, it is a way of working. You can have people in embedded positions using these practices as easily as you can for folks in shared teams.

The next thing you need to do is focus on the problem that you are trying to solve. This is obvious but not all that simple. Here is an example. We had a complaint from our high frequency trading folks last year saying that servers were not available fast enough. It took on average 35 days for us to get a server purchased and ready to run. Dan North and I were reading the book “The Goal” – a book I highly recommend. It is a really good read. In the book he talks about the theory of constraints and applying lean principles to repeatable process. We used a technique called value stream mapping to our server delivery process. People complained that I [Seth] was a bottleneck becuase I had to approve all server purchases. Turned out I only take 2 hours to do that. The real problem laid elsewhere. The value stream mapping allowed us to see where our bottlenecks were so that we could focus in on our real bottlenecks and not waste cycles on less productive areas. We zeroed in accurately and reduced the time from 35 to 12 days.

The third cultural lesson, and an important one, is keep your specialists. One of the worst things that can happen is that you introduced a lot of general operators and then the network team, for example, says wow, you totally devalued me, and they quit. You lose a lot of expertise that it turns out is quite useful this way. Keep your specialists in the center. You want to highlight the tough problems to the specialists and leverage them for solving those problems. Introducing DevOps can actually open the floodgates for more work for the people in the center. We endeavored to distribute unix system management to reduce the amount of work for the Unix team itself. This got people all across the org a bit closer to what was going on in this domain. What actually happened is that the Unix team was hit harder than ever. As we got people closer to the problem the demand that we had not seen or been able to notice previously increased quite a bit. This is a good problem to have because you start to understand more of what you are trying to do and you get more opportunities to innovate around it.

If you are looking at a traditional org oftentimes these specialist teams are spending time justifying their own existence. They invent their own projects and they do things no one needs. These days at DRW we find that we have long shopping lists of deep unix things that we actually need. The Unix specialists are now constantly working on key useful features. We are always looking for more expert unix admins.

The last lesson learned, a painful lesson, is that “people have to buy in”. The CIO can’t just walk in and say you have to start doing DevOps. You can’t force it. We made a mistake recently and we learned from it and turned it into a success. A few months ago we were looking at source control usage. The infrastructure teams were not leveraging this stuff enough for my taste among other things. I said, we need to get these guys pairing with a software engineer. I forced it. It went along these lines: the person doing the pairing was not teaching the person they were pairing with. They were instead just focused on solving the problem of the moment. The person being paired with was not bought in to even doing the pairing in the first place. People resented this whole arrangement.

We took a hard retrospective look at this and in the end we practiced iterative agile management and changed course. I worked with Dan North who came from a software engineering background and who also had a lot of DevOps practice. A key thing about Dan is that he loves to teach and coach other people. The fact that he loved coaching was a huge help. Dan sat with folks on the networking team and got buy-in from them. He got them invested in the changes we wanted to make. The head of the networking team now is learning python and using version control. Now the network team is standing up self service applications that are adding huge value for the rest of the organization and making us much more efficient.

Some lessons learned from the technology

Ok, so Seth has covered a lot of the cultural bits and pieces. Now I [Chris Read] will talk about the technical lessons or at least lessons stemming from technical issues. To follow are a few examples that have reinforced some of the cultural things we have done. The first one is the story of the lost packet. This happened within the first month or 2 of me joining. We had an exchange sending out market data, through a few hops, to a server that every now and again loses market data. We know this because we can see gaps in the sequence numbers.

The first thing we would do is check the exchange to see if it was actually mis-sequencing the data. Nope, that was not the problem. So then the dev team went down to check the server itself. The unix team looks at the machine, the ip stack, the interfaces, etc… they declared the machine fine. Next the network guys jump in and see that everything is fine there. The server however was still missing data. So we jump in and look at the routers. Guess what, everything looks fine. This is where I [Chris Read] got involved. This problem is what you call the call center conundrum. People focus on small parts of the infrastructure and with the knowledge that they have things look fine. I got in and luckily in previous lives I have been a network admin and a unix admin. I dig in and I can see that the whole network up to the machine was built with high availability pairs. I dig into these pairs. The first ones looked good. I look into more and then finally get down to one little pair at the bottom and there was a different config on one of the machines. A single line problem. Solving this fixed it. It was only though having a holistic view of the system and having the trust of the org to get onto all of these machines that I was able to find the problem.

The next story is called “monitoring giants”. This also happened quite early in my dealings at DRW. This one taught me a very interesting lesson. I had been in London for 6 weeks and lots of folks were talking about monitoring. We needed more monitoring. I set up a basic Zenoss install and other such things. I came to Chicago and my goal was to show the folks here how monitoring was done by mean to inspire the Chicago folks. I go to show them things about monitoring and I was met with fairly negative response. The guys perceived my work as a challenge on their domain. My whole point in putting this together was lost. I learned the lesson of starting to work with folks early on and being careful about how you present things. It was also a lesson on change. It is only in the last couple of months that I have learned how difficult change can be for a lot of people. You have to take this into account when pushing change. Another bit of this lesson is that you need to make your intentions obvious – over-communicate.

We actually think it is ok to recreate the wheel if you are going to innovate. What is not ok is to recreate it without telling the folks that currently own it. – Seth Thompson.

The next lesson is about DNS. This one was quite surprising to me. It is all about unintended consequences. Our DNS services used to handle a very low number of requests. As we started introducing DevOps there was a major ramp up in requests to DNS per second. We were not actually monitoring it though. All of a sudden people started noticing latency. People started to say “hey, why is the Internet slow?”. Network people looked at all kinds of things and then the problem seemed to solve itself. We let it go. Then a few weeks later, outage! The head of our Windows team noticed that one host was doing 112k lookups per second. Some developers wrote a monitoring script that did a DNS lookup in a tight loop. We have now added all this to our monitoring suite. Because the windows team had been taught about network monitoring and log file analysis, because they had been exposed, they were able to catch and fix this problem themselves.

Quick summary of the lessons

Communication is very key. You must spend time with the people you are asking to change the way they are working.

Get buy-in, don’t push. As soon as you push something onto someone, they are going to push back. Something will break, someone will get hurt. You need to develop a pull – they must pull change from you they must want it.

Keep iterating. Keep get better and make room for failure. If people are afraid of mistakes they won’t iterate.

Finally, change is hard. Change is hard, but it is the only constant. As you are developing you will constantly change. Make sure that your organization and your people are geared toward healthy attitudes about change.

Question: Can you talk a little bit more about buy-in.
Answer: One of the most important thing about getting buy-in is to prove your changes out for them. Try things on a smaller scale, prototypes or process or technology, get a success and hold it up as an example of why it should be scaled out further.

October 22, 2011

Groupon: Clean and Simple DevOps with Roller

By Zack Steinkamp from Groupon @thenobot given at Camp DevOps 2011

This was live blogged by @martinjlogan so please forgive any errors and typos.

The way we do things in production is not always the right way to do things. Coming here to a conference like Camp DevOps and listening to folks like Jez Humble is kind of like coming to Church and reupping your faith in what’s right!

Handcrafted; great for a lot of things. Furniture, clothes, and shoes. The imperfections give a thing character. Handcrafted however has no place in the datacenter. Services are like appliances. Imagine that you run a laundromat. Would you rather have a dozen different machines that all need to be repaired in different ways by different people or would you rather have one industrial strength uniform design for each unit?

In Groupon’s infancy in order to get started quickly we outsourced all operations. We have gotten to the scale though where the expertise of those we outsourced to is not sufficient for our current needs. As a result we have brought it in house now. Given this we needed a way to manage our infrastructure efficiently and with minimal errors under constant change.

In Sept 2010 we had about 100 servers in one datacenter. Many of them were handcrafted. That was ok though, because someone else worked on them. Today we have over 1000 servers in 6 locations. As the service has grown we have felt the pain of a shakey foundation under our platform. That is the driver behind developing this project – Roller. Roller really embodies the DevOps mindset.

The DevOps mindset is typified by folks that love developing software and that are also interested in linux kernels and such – and vice versa. I am one such person. At Groupon I do work for many different areas. I started my career at Yahoo in 1999. I also co-founded a company called Dippidy. I left there and worked for Symantec. Each time I have worn a different hat. Enough about me and my stuff though – lets dig into Roller.

I won’t be giving a philosophical talk but instead will get you into the nuts and bolts of roller [I will summarize this in this live blog - see the slides for more details]. If you have any preconceived notions about how host config and management should be done please try to forget them as this project is quite different. This project is on track to be open sourced from Groupon sometime in the first half of next year.

So, what does Roller do? It installs software. Really, what is a server, it is a computer that has some software on it. Roller installs this software. It facilitates versioning your servers in a super clean way. It allows for perfect consistency across your data center. This handles basic system utils like strace to application deployment. They are all the same, just files on a disk.

You are probably asking yourself why is this guy up here reinventing the wheel. Why do this? We already have Chef and Puppet why bother. Well, we wanted this to be very lightweight. Some existing solutions require message queues, relational DBs, and strange languages that are not already on the system. We also needed to deal with platform specific differences. We have 4 different varieties of Linux. The big thing though, is we wanted a system that was dead simple and audit-able. A lot of the systems now give you tons of power. Inheritance heirarchys like webserver -> apache webserver -> some config of that server etc… That looks great from a programmer brain perspective, but in production this complexity can cause unwanted side effects and cause problems. We wanted to build a system that was blocked on a source code repo commit. We wanted any change in the system to go through git or some other VCS system.

There are 4 parts to roller.

1. The configuration repository.
2. Config server
3. Packages
4. “roll” program.

The configuration repository is a collection of yaml files. The config server sits in each data center. This is a web server that does not have any db. It is a ruby on rails app with no database. It provides views of data stored in the config repository. Packages are precompiled chunks of software. For instance we have an apache package, or some other appliction. A package is just a tarball. The packages are stored and distributed from the config server. Config servers in different datacenters use S3 to distribute packages. We put a package on one config server and then it is world wide in about a minute. Finally we have “roll”. This is what we execute on a host, a blank machine perhaps, to turn it in to a specific appliance.

Configuration Repository

This contains simple files that have within them information about datacenters. This also contains host classes – basically configurations of particular host types. These host classes are just like defining a class in Ruby or some other language that supports classes. The config repo is basically a tree of a fixed depth of 2.5 levels and no deeper. The leaf nodes are the host files. These are contained in the host directory within the config repo. This defines configuration at the host level. Host classes have names and versions for a particular host. The hostclass does not contain a version.

Config Server

We have spoken alot about these yaml files. These are the world for roller. Now to make use of them we need the config server. The config server is a rails app that gives us views of the config repo data. We get to see the yaml config, we can see which hosts are using a particular big of config, we can diff configs to see what changed. A nice thing about this is that you can just run curl commands to investigate the system.

I can also use curl to investigate host classes. Config server just pulls these things out from a git repo and sends them back. This creates a nice http bridge into our running system. This has a lot of value. We will see this with roll.

Roll

Groupon Roller’s Roll server executes code on a host. It runs the http fetches, just like you would do with curl, fetching the host and host class yaml files from the config server. It then downloads any packages that it does not have. It then prepares a new /usr/local dir candidate. It generates configs. It stops services. Moves the new /usr/local into place, then starts the services. Basically each time it nukes the host starting from a new base state. Roller owns user local essentially. This is kind of a nuclear solution. We are not quite re-imaging the whole host but it is still fairly brutal.

This whole cycle typically takes 10 to 30 seconds. The actual services are down for just a few seconds normally. Things are actually only down for a short period of time.

Foreman

This is a roller package. It is in every hostclass. It adds a cron entry when installed on a host. Every x minutes it trues up its users with a config repository via a config host. This is how we do basic user management. You can use this to manage your own profiles and user directories to get your .profile or emacs config or whatever you want on all the hosts in which you have access to.

Wrapping it up

zsteinkamp@groupon.com
on twitter at @thenobot
steinkamp.us/campdevops_notes.pdf is where you can get notes on the presentation.

September 20, 2011

Running Heroku on Heroku

heroku logoThis is a live summary taken from this talk given at StrangeLoop.

Today Noah Zoschke @nzoschke will cover running Heroku on Heroku. Heroku for those that are not familiar is a cloud application platform as a service. It used to be a ruby application as a service platform but now it has been opened up for many other languages. Heroku was all about getting rid of the need for servers; at least you maintaining servers. This talk is going to be about bootstrapping and self hosting and all the benefits for the dev and operations cycles that come along with it – not to mention the benefits for your business.

The meaning of the word bootstrapping has come to mean a self sustaining process that proceeds without external help. There are many applications of this term, socio econimics, business, statistics, linguistics (how a small child can go from no spoken ability to having it), biology (we all start as just a few cells and our cells then figure things out), and then of course computers (booting up is bootstrapping up). We have a computer that is off and we need to figure out how to get the system up and running from that off state into a fully running a viable for work state. Boostrapping also has a very specific meaning for compilers. If you have a compiler written in a language that it itslef compiles then it is bootstrapped. We will talk about the compiler example for just a minute before we get into what this could mean for services for illustration purposes.

Self building/bootstrapping is something that allmost all languages and compilers strive to do. Bootstrapping is an excelent test for any compiler. It allows you to work on your compiler in a higher level language. It also leads to a really great consistency check of the compiler itself. A compiler that can compile itself is a good thing also because it reduces the overall footprint of the tools needed to work on the compiler itself. There is ofcourse a chicken and egg problem. There are a number of strategies for handing this.

Build compiler/interpreter for X in language Y
Using an earlier version of the compilar
Hand compile

Lets change terminology quickly, “Self hosting” is a computer program that produces new versions of that same program. This applys to compilers as we illustrated but it also applies equally well to kernals, programming languages, and revision control systems like git being maintained in git self host. There are more such as text editors; vim being developed with vim and so on. So, the question is

“Is this an applicable metaphore for services and the cloud?”

We see the same properties and benefits associated with compilers in services and cloud. At a simple level Heroku hosts www.heroku.com on Heroku. Not very surprising, probably more surprising if you found out Heroku was run on Slicehost or something like that! (it is not). There are a number of motivations though for taking self hosting a bit further than just this. Dogfooding, efficiency, and separation of concerns. Heroku used to be this large ruby app and any time some developer would screw something up he could crash the whole system. There were all kinds of hoops that were jumped through to prevent this from happening. The ultimate solution ended up being self hosting. Features used to be added to this large ruby app now most features, like Heroku cron, are turned into applications that actually run on Heroku itself not in its codebase where it can cause problems.

Now taking this even further to something more heroic. Heroku has a whole separate database cloud service. This thing is large and a fairly big deal. and the whole thing runs on Heroku itself. Can we keep going with this, and take it even a step further?

Heroku Cloud Architecture

Noah  Zoschke talking about heroku cloud architecture

The question is, what else can we self host? Take the compile part of the architecture and run them on the heroku dynos. so basically compiling new Heroku dynos will be compiled by the compile application running ontop of Heroku itself. We want to run a platform that is not just for sinatra apps or rails apps etc… We want a platform that is a generic computing platform. Running Heroku applications on Heroku helps us prove out we do, or move there is we are not already.

Other motivations are effortless scaling, decreased surface area of the architecture, and build/compile symmetry. We want our build servers to look just like our runtime servers. The motivation here is obvious and running compile on heroku itself really gets us there. The other and most important motivation is to be able to focus on these secure ephemeral containers, the dynos, and making them as secure and well factored as possible. If our business depends on these containers from top to bottom we will be forced to make these are sound as possible.

Martin Logan (@martinjlogan) also, if this kind cloudy stuff floats your boat you should check out Camp DevOps Conf in Chicago this Oct

September 19, 2011

Glu-ing the Last Mile by Ken Sipe.

This post was blogged real time by @martinjlogan at Strange Loop 2011. Please forgive any errors.

I [Ken Sipe] spent the last year focused on continuous delivery which is why I am so interested in this product. We will start this talk off with a commercial. You have of course heard of Puppet, and you might have heard of Chef. Now we have glu. I would actually liked to have called this talk Huffing Glu. So where does glu fit in. We need to start with the Agile Manifesto particularly the principle that our highest priority is to satisfy the customer through early and continuous delivery. We need to not only develop good software but be able to deploy valuable software.

How long does it take you to get one line of code into production? If you had something significant to push into prod, how long would it take you push that code into production? What does your production night look like? Are you ordering pizza for everyone to handle the midnight to 3am call? Why do we do this, because we have not automated. We are engineers and we automate things, but we have not even automated our own backyard. Even with simple rules though, things can be complex. “Just push this single war out to production”. Well, even really simple things can get really complex in the real world. Anyone that can think can learn to move a pawn, but to be a great chess player requires navigating a complex world.

When you look at most companies there are lots of scripts and people running procedures. When you look at LinkedIn they deploy to thousands of servers every day. Glu is model based, I am totally sold over the last few years on starting from models. Glu is model based, Gradle is model based, Puppet is model based. Chef is not. Puppet seems to be loved by Ops and Chef by developers. I am definitely on the dev side and I really love glu. Glu is fairly new, came out in 2009. Outbrain uses glu and unlike LinkedIn which always has a human step in deployments even though they are quite automated, pushes code into production in completely automated fashion.

statistics on the current usage of the glu project

Before glu we had manual deployments. I used to automate production plants in my past life. And workers felt I was taking their jobs away. I was like, I don’t know, I am young and just doing my job. I am sure there is something else for you to do right? The interesting thing is that Ops people often feel the same way about DevOps – but there is definitely quite a lot more to be done by ops folks aside from having to run tedious processes at 3am.

Glu – Big Picture. Glu starts with a declarative model. It computes actions to be taken. Glu has 3 major components. Agents, Orchestration Engine, and ZooKeeper. ZooKeeper is not built by the glu project. All the glu components can be used separately but in this presentation we will focus on using them all together. There are three concepts to focus on. Static model, scripts, and the live model generated as a combination of the previous two.

the model for glu deployment

ZooKeeper is a distributed coordination service for distributed applications. It is used in glu to maintain the state of the system. Each node in your system needs to have 1 agent at least. Putting more agents on a node is possible but does not make much sense. The idea is you have one node that is managed by an agent and that agent id unique to a given fabric. Clearly deployment tools have to be written in a dynamic language ;-) We use Groovy with glu. Agents at the end of the day are glu script engines. We have a Groovy api, the commandline, and a REST api all for handling and dealing with glu agents. So you have your pick. The heart of glu is really the orchestration engine itself.

The orchestration engine listens to events that happen out of ZooKeeper with its orchestration tracker. The events that come out represent the current state of the system. These are represented in Json. These events represent the live model.

The static model describes basically where to deploy something and how. All of these static events that create the static model are compared against the live model by the delta service in the orchestration engine. A delta point is calculated between the static and live models. It then becomes visible to the operator through the orchestration visualizer. Green in this visualization means that you have established the exact situation that you wanted in your static model.

the glu dashboard

With the delta a deployment plan also gets created. How do we fix red in the visualization? How do we get to the state we indicated in our static model. A deployment plan is created, there are usually a serial plan and a parallel plan. They each have their advantages and disadvantages. Speed is an advantage of the parallel model but consistency is sacrificed potentially.

Glu scripts provide instructions. There are 6 states. Install, configure, start, stop, unconfigure, uninstall. These are mapped out in a Groovy script. Each of these states have a closure block associated with them in the Groovy script. Glu scripts have a bunch of nice variables and services defined for you. Log is there for you, init parameters, full access to the shell and system env vars. The glu agent is again what handles and manages these scripts. It is basically a compute server for Groovy scripts.

useful things present by default in glu scripts

In order to test glu scripts we use the gluscriptbase test. Tests are nice and easy to run from within any build system like Gradle (or Maven if you feel the need for pain).

From a security standpoint glu is very focused on security. You can hook into LDAP. All things are logged into an audit log.

Some differences between glu and Puppet. They are both model based as well as being somewhat declarative – those are some similarities. Puppet is Ruby and glu is Groovy. The big difference though is that in glu delta computations are handled on the server side. You can see deltas across nodes. In the Puppet world the deltas are computed at the agent/node level. In glu it is the orchestration engine and zookeeper that is keeping track of all of this. There are advantages and disadvantages to this. Puppet also has better infrastructure support. If you are really nuts you can run Puppet from glu. To me this is nuts though.

Finding glu can be a bit hard. Google seems to find it now in many cases. The easiest thing to do is go to Github and search there. This will probably change over the short term though as glu becomes more popular. Here is the Github url: https://github.com/linkedin/glu

Also, take a look at the upcoming Camp DevOps Conference, its gonna be totally sweet!

August 29, 2011

Announcement: Camp DevOps 2011 Conference in Chicago

Camp DevOps Conference Chicago Logo

Camp DevOps (campdevops.com) is a hands on DevOps tech focused conference taking place in Chicago on the 22nd and 23rd of October. DevOps owes a lot to the Agile movement, no one would deny it. Agile however is less about technology and much more about process. As a result there is a lot out there on the process and cultural underpinnings of DevOps but markedly less about the technology that is indisputably a large part of what DevOps is. Camp DevOps focuses much of its attention to that technology – but does not ignore process either. This conference is aimed at everyone costing $100 per head for the early bird discounted ticket. The text below is from campdevops.com:

Get fully up to speed on all things DevOps tech. Take a look at our speakers list if you want to see who is going to be teaching you. Hosted at ITA in downtown Chicago, the Camp DevOps Conference is a hands on! Our sessions are all 2 hours+ to get you deep into each topic. We are technically focused event centered around on leaving atendees with the knowledge they need to pioneer DevOps in their own organizations. This conference is focused at managers, ops personnel and developers looking to take building and operating software to a more efficient place.

July 9, 2011

Deployment Automation with glu

This video features Yan Pujante speaking at Chicago DevOps on the deployment automation tool he invented, subsequently opensourced, and now manages named glu. It is garnering significant interest within the DevOps mind space. The talk was quite good, got some great reviews by those that attended the meeting, and features a live demo which showcases a lot of the power glu brings to the table. Read below for more description on glu itself.

One thing we would like to mention is that this video was made possible by Carl Karsten. He is a professional that does a ton of work for the python community and other technical orgs. If you really liked the video and want to support the filming of it, and the filming of other DevOps related you may but are certainly under no obligation to donate here at the pledgie page for the video

glu is an open source deployment automation platform. glu was originally created at LinkedIn and has been successfully used for orchestrating the deployment and management of the complex LinkedIn infrastructure for over a year. Since its open source release, glu has been gaining a lot of traction in the devops community. In this tech talk, you will learn from the author of glu, about the novel approach taken by glu to solve the deployment problem (state delta computation, ZooKeeper, REST, etc…). You will also be able to understand why glu is more than a tool but an actual platform on top of which you can customize and/or build your own deployment infrastructure. The talk will also feature a live demo of glu!

Quick update 7/11/2011 – slides available here: http://slidesha.re/nTf0Zo

June 7, 2011

Orbitz IDEAS Video: Teyo Tyree on Model Driven Management with Puppet

posted by @martinjlogan

Teyo Tyree one of the founders of Puppet Labs talks to the about model driven configuration management with Puppet. I was really impressed by Teyo and the whole puppet team to be honest and really appreciate their rigorous sysadmin culture. They seem to be very focused on the practical issues at hand and less interested in keeping up with the latest marketing buzzword of the day.

Teyo Tyree on: Model Driven Management with Puppet from Orbitz IDEAS on Vimeo.

During this video you will learn how puppet works and what drives its architecture. You will get an understanding of how the model driven approach factors into Puppet. You will also learn how to leverage this in extending Puppet configuration management and integrating it with other systems.

April 21, 2011

Highway to the Availability Zone

by @mattokeefe

With apologies to Kenny Loggins’ Danger Zone (lyrics)…


Revvin’ up your VM
Listen to her disk roar
EBS under tension
Beggin’ you to touch and go

Highway to the Availability Zone
Right into the Danger Zone

Headin’ into Cloud
Spreadin’ out her apps tonight
She got you jumpin’ off the deck
And shovin’ into oversubscription

Highway to the Availability Zone
I’ll take you
Right into the Danger Zone

AWS will never say hello to you
Until you get it on the scale of Netflix
You’ll never know what you can do
Until you deploy across three Zones

Out along the edge
Always where I burn to be
The further on the edge
Higher Akamai profitability

Highway to the Availability Zone
Gonna take you
Right into the Danger Zone
Highway to the Availability Zone

Seriously though… Nothing to see here, move along concerning today’s AWS EC2 outage. Many Enterprises run the same risk with internal IT if they are not redundant across active/active data centers or with proven and regularly tested failover capability.

The good news with the Cloud is that everything has an API, so you can automate your Disaster Recovery / Business Continuity process. Regularly snapshot your EBS volumes if on EC2, and recover from S3 (designed to provide 99.999999999% durability) in another Zone/Region. Tools like Cfengine, Puppet and Chef can help you recreate your entire infrastructure from source control in minutes rather than hours.

Also consider cloud management solutions such as enStratus, RightScale et al to abstract cloud provider details and provide multi-cloud redundancy, including your own private internal clouds if you choose to create one or more. Or, roll your own solution using jclouds or something similar.

Remember, *you* own your availability.

March 22, 2011

Exclusive DevOps.com Interview with @DEVOPS_BORAT

great pic of @devops_boratby-@mattokeefe

@DEVOPS_BORAT has exploded onto the DevOps scene as of late, via Twitter. He won Best Cloud Philospher and Best Cloud Tweet at Cloudy Awards 2011. DevOps.com is pleased to share with you an exclusive interview:

@mattokeefe: Congrats on winning multiple Cloudies! Were you able to attend the award ceremony?

@DEVOPS_BORAT: In Kazakhstan we have old saying “If you can lean, you can clean”. Is why I not attend conference but prefer deploy infrastructure on all possible cloud provider.

@mattokeefe: Totally understandable. So where do you work and what is your role?

@DEVOPS_BORAT: I work in small startup in Almaty Kazakhstan. Is part of many startup launch by incubator company with name which is translate as Bird. In Kazakh language is spinoff know as Dropping, so our startup is Bird Dropping #53 finance by venture oil capitalist. We are specialize in social networking in the cloud with emphasis on Human To Android relationship.

In day to day job I have title of Senior Manager of Operation. I manage of myself and of Azamat who has title of Junior Manager of Operation and he only manage himself. We are sufficient manpower for deal with all devops issue in the cloud because we have everything automated. Also we use NoSQL which make it very easy scale, I can not able say at infinite but for practical purpose is infinite.

@mattokeefe: Did you start your career in development or operations? When did you first hear about DevOps?

@DEVOPS_BORAT: Word devops is start with dev then ops. I start career in development of C++ (as small detail, in Kazakh language is pronounce ++C which is more correct). I am happen of agree with Joel Overflow that programmer need learn C/C++ first so they understand pointer. If you not experience null pointer segfault is like you not experience sexytime!

I hear of DevOps in past 2 or 3 years, is coincide with downfall of Agile and rise of cloud and also of Twitter. First reason DevOps was create is because is easier type #devops than #developer #sysadmin but correct name is in actual OpsDev.

For practice DevOps I recommend first follow cloud expert and devops expert on Twitter. Next step is automate bulls shit out of everything.

@mattokeefe: I know what you mean about null pointer segfaults. I’ve seen Java log files full of NullPointerExceptions. When I showed the developers, they said “Oh, those are harmless. You can ignore them.” But they never went away, and I worried that Ops wouldn’t detect a real problem later. Is this something that DevOps can fix?

@DEVOPS_BORAT: In startup if we hear such comment from developer we immediately put them on pager for 1 month. Next time is 2 month and so on. Problem can not be able fix by DevOps. Only way to fix is not use Java in first place. All DevOps rock star are use Scala or Clojure.

@mattokeefe: Wow, you really are hardcore. So tell me, what DevOps tools do you use, and what do you find missing? Are you following the devops-toolchain project?

@DEVOPS_BORAT: Between my personally and Azamat we are try very hard for use all available DevOps tool. Nothing is perfect though so we end up roll our own tool. We tentative call tool Swiss Army Electric Saw, is good for monitor, alert, visual metric, queue, deploy, continuous integration and continuous delivery. Tool is base on node.js so it eliminate disk I/O. We also try hard eliminate network traffic by only allow 56k bandwidth for legacy customer.

I read page of devops-toolchain, I can not able comprehend with limited English. Is philosophic dissertation yes? Is very good if somebody is able get PhD out of it.

@mattokeefe: You guys are very ambitious with tooling. It sounds like you could use more help. If you could hire just one more person for your team, would you choose a developer interested in learning Operations, or an Ops guy looking to learn how to code?

@DEVOPS_BORAT: We are always search for mythical centaur creature 1/2 dev and 1/2 ops. We have business idea of launch RoR Web site for dating of dev and ops. We are hope for ROI in approximate 20 year.

@mattokeefe: Awesome. Are you looking for investors? The DevOps market seems to be heating up, despite Damon Edwards talking about shark jumping and Some Forrester Guy asking for NoOps. What do you think of these remarks?

@DEVOPS_BORAT: We have lot of interest from oil and natural gas baron in Russia. Not need VC dollar from U S and A. As matter of fact region of Almaty Kazakhstan is know as Silicon Camel Hump of Central Asia.

I read blog post of Forrester guy. Content is Noop. In my opinion DevOps is just sign of what is for come. Is going be follow by DevQaOps, then DevUxQaOps, DevUxQaSecOps and in final is pinnacle of Internet Jedi Samurai Jason Calacanis.

@mattokeefe: Well I am blown away by the wisdom that you have imparted so far! I can’t wait to share this with our readers, so I will publish part one of this interview ASAP. Let’s see what sort of questions our readers raise in the comments section, yes? Then I hope that we can have a follow up interview.
Now, just to wrap up part one, how do you celebrate successful DevOps achievements… erm… “happy sexytime” in your country? Do you enjoy pizza and beer, or some other custom?

@DEVOPS_BORAT:
In startup we have quiet celebration usual. Is involve 2-3 keg of vodka. Is loud first 30 minute then very quiet. Azamat is last to stand. In past our ancestor use of shoot ibex after celebration. In startup we continue tradition by terminate random node in cloud after party, is same thrill of feeling.

@mattokeefe: Oh no, WordPress is down! I hope they didn’t sustain collateral damage from your Chaos Monkey!

“We’re experiencing some problems on WordPress.com and we are in read-only mode at the moment. We’re working hard on restoring full service as soon as possible, but you won’t be able to create or make changes to your site currently.”

Thanks for the interview!
Follow

Get every new post delivered to your Inbox.

Join 41 other followers