Shell scripting with AWS Lambda: The Function

In the previous article, I detailed how AWS Lambda can be used to act as a scripting control tool for other AWS services. The fact that it is focused on running individual functions, contains the AWS SDK by default, and only acrues costs when running create a perfect situation for the use of administrative scripting. In this article, I detail the use of Lambda functions to perform the cleaning itself.

Lambda functions are single JavaScript Node.js functions that are called by the Lambda engine. They take two parameters that provide information about the event that triggered the function call and the context the function is running under. It is important that these functions run as stateless services that do not depend on the underlying compute infrastructure. In addition, it is helpful to keep the function free of excessive setup code and dependencies to minimize the overhead of running functions.

The Code


var aws = require('aws-sdk');
var async = require('async');
var moment = require('moment');
var ec2 = new aws.EC2({apiVersion: '2014-10-01'});

The AWS SDK is available to all Lambda functions and we import and configure it for use with EC2 in this example. You can also include any Javascript library that you would with Node. I have included both the Async module and the Moment.js library for time.

Core Logic

var defaultTimeToLive = moment.duration(4, 'hours');

function shouldStop(instance) {
    var timeToLive = moment.duration(defaultTimeToLive.asMilliseconds());
    instance.Tags.forEach(function (tag) {
      if (tag.Key == 'permanent') {
        return false;
      } else if (tag.Key == "ttl-hours") {
        timeToLive = moment.duration(tag.Value, 'hours');

    var upTime = new Date().getTime() - instance.LaunchTime.getTime();

    if (upTime < timeToLive.asMilliseconds()) {
      console.log("Instance (" + instance.InstanceId + ") has " + timeToLive.humanize() + " remaining.");
      return false;
  return true;

I use the AWS tagging mechanism to drive the decision if an EC2 instance should be stopped. If the instance is tagged as 'permanent' or with sepecific 'ttl-hours' tag, then the function knows that it should be kept alive and for how long. If a tag wasn't added, we want to terminate those instance after a default time period. It might be helpful to have this externalized to an AWS configuration store such as SimpleDB, but I leave that as an exercise for the reader. Finally, it is helpful to log the amount of time the instances have left on their TTL.

Searching the instances

    function fetchEC2Instances(next) {
      var ec2Params = {
        Filters: [
          {Name: 'instance-state-name', Values: ['running']}

      ec2.describeInstances(ec2Params, function (err, data) {
        next(null, err, data)
    function filterInstances(err, data, next) {
      var stopList = [];

      data.Reservations.forEach(function (res) {
        res.Instances.forEach(function (instance) {
          if (shouldStop(instance)) {
      next(null, stopList);
    function stopInstances(stopList, next) {
      if (stopList.length > 0) {
        ec2.stopInstances({InstanceIds: stopList}, function (err, data) {
          if (err) {
          else {
      else {
        console.log("No instances need to be stopped");
  function (err) {
    if (err) {
      console.error('Failed to clean EC2 instances: ', err);
    } else {
      console.log('Successfully cleaned all unused EC2 instances.');

This should look familiar to everyone that has done Javascript AWS SDK work.  We use the Async library to query for running instances.  We then run the returned instance data through our helper method as a filter.  Finally, we take all of the identified instances and stop them. 

This code works well for a moderate number of running instances.  If you need to handle thousands of instances in your organization, you will need to adjust the fetch and stop processes to handle AWS SDK paging.  

You can find this code in our Github repository here:

Next Steps

The final piece of the puzzle for our Lambda scripting is deployment and scheduling.  In my final article on this, I will cover both how to deploy a Lambda function and the current, kludgy, method for job scheduling using EC2 autoscaling.

Tiered Testing of Microservices

There is a false challenge in testing a microservice. The application does not exist in isolation. It collaborates with other services in an interdependent web. How can one test a single strand of a web?

But test dependency management is not a new challenge. Using a microservice architecture increases the scale of the problem, and this forces a development team to address integration explicitly and strategically.

Common Terminology

Before discussing a testing strategy for microservices, we need a simple model with explicitly defined layers. Examples are given for RESTful implementations, but this model could be adapted for any transport format.

Figure 1: microservice structure

Figure 1: microservice structure

Resources handle incoming requests. They validate request format, delegate to services, and then package responses. All handling of the transport format for incoming requests is managed in resources. For a RESTful service, this would include deserialization of requests, authentication, serialization of responses, and mapping exceptions to http status codes.

Services handle business logic for the application. They may collaborate with other services, adapters, or repositories to retrieve needed data to fulfill a request or to execute commands. Services only consume and produce domain objects. They do not interact with DTOs from the persistence layer or transport layer objects – requests and responses in a RESTful service, for example.

Adapters handle outgoing requests to external services. They marshal requests, unmarshal responses, and map them to domain objects that can be used by services. They are usually only called by services. All handling of the transport format for outgoing requests is managed in adapters.

Repositories handle transactions with the persistence layer (generally databases) in much the same way that adapters handle interactions with external services. All handling of persistent dependencies is managed in this layer.

A lightweight microservice might combine one or more of the above layers in a single component, but separation of concerns will make unit testing much simpler.

Planning for Speed and Endurance

A test strategy in general should prevent unwelcome surprises in production. We want to get as much valuable quality-related information as we can (coverage), in realistic conditions (verisimilitude), as fast as we can (speed), and with as little bother as possible (simplicity).

Every test method has trade-offs. Unit testing will provide fast results for many scenarios and are usually built into the build process – they have good coverage, speed, and simplicity, but they aren’t very realistic. Manual user testing has the most verisimilitude and can be very simple to execute, but has very poor speed and coverage.

Tiered Testing Strategy

Tiered Testing Strategy

To balance these trade-offs, we use a tiered testing strategy. Tests at the bottom of the pyramid are generally fast, numerous, and executed frequently, while tests at the top of the tier are generally slow, few in number, and executed less frequently. This article focuses on how these tiers are applied for microservices. Unit Testing

Unit tests cover individual components. In a microservice, unit tests are most useful in the service layer, where they can verify business logic under controlled circumstances against conditions provided by mock collaborators. They are also useful in resources, repositories, and adapters for testing exceptional conditions – service failures, marshaling errors, etc.

Figure 2: Unit Testing Coverage

Figure 2: Unit Testing Coverage

To get the most value from unit tests, they need to be executed frequently – every build should run the tests, and a failed test should fail the build. This is configured on a continuous integration server (Jenkins, TeamCity, Bamboo, e.g.) constantly monitoring for changes in the code.
Service Testing

Service testing encompasses all tests of the microservice as a whole, in isolation. Service testing is also often called “functional testing”, but this can be confusing since most tiers described here are technically functional. The purpose of service tests is to verify integration for all components is functionally correct for all components that do not require external dependencies. To enable testing in isolation, we typically use mock components in place of the adapters and in-memory data sources for the repositories, configured under a separate profile. Tests are executed using the same technology that incoming requests would use (http for a RESTful microservice, for example).

Figure 3: Service Testing Coverage

Figure 3: Service Testing Coverage

A team could avoid using mock implementations of adapters at this tier by testing against mock external services with recorded responses. This is more realistic, but in practice it adds a great deal of complexity – recorded responses must be maintained for each service and updated for all collaborators whenever a service changes. It also requires deploying these mock collaborators alongside the system under test during automated service testing, which adds complexity to the build process. It’s easier to rely on a quick, robust system integration testing process with automated deployments to reduce the lag between these two tiers.

Service tests can also be run as part of the build process using most build tools, ensuring that the application not only compiles but can also be deployed in an in-memory container without issue. System Integration Testing

System integration tests verify how the microservice behaves in a functionally realistic environment – real databases, collaborators, load-balancers, etc. For the sake of simplicity, these are often also end-to-end tests – rather than writing a suite of system integration tests for each micoservice, we develop a suite for the entire ecosystem. In this tier, we are focused on testing configuration and integration using “normal” user flows.

Figure 4: System Integration Testing Coverage

Figure 4: System Integration Testing Coverage

This test suite is also functionally critical because it is the first realistic test of the adapter/repository layer, since we rely on mocks or embedded databases in the lower layers. Because integration with other microservices is so critical, it’s important that this testing process be streamlined as much as possible. This is where an automated release, deployment, and testing process provides tremendous advantages.

User Acceptance Testing

System integration tests verify that the entire web of microservices behaves correctly when used in the fashion the development team assumes it will be used (against explicit requirements). User acceptance replaces assumptions with actual user behavior. Ideally, users are given a set of goals to accomplish and a few scenarios test rather than explicit scripts.

Because user acceptance tests are often manual, this process is generally not automated (though it is possible, with crowd-sourcing). As a result, this can happen informally as part of sprint demoes, formally only for major releases, or through live A/B testing with actual users.

Non-functional Testing

Non-functional testing is a catchall term for tests that verify non-functional quality aspects: security, stability, and performance. While these tests are generally executed less frequently in a comprehensive manner, a sound goal is to try to infect the lower tiers with these aspects as well. For example, security can also be tested functionally (logging in with an invalid password, for example), but at some point it also needs to be tested as an end in itself (through security audits, penetration testing, port scanning, etc). As another example, performance testing can provide valuable information even during automated functional tests by setting thresholds for how long individual method calls may take, or during user acceptance testing by soliciting feedback on how the system responds to requests, but it also needs to be tested more rigorously against the system as a whole under realistic production load.

Ideally, these tests would be scheduled to run automatically following successful system integration testing, but this can be challenging if production-like environments are not always available or third-party dependencies are shared. Summation

The goal of the testing strategy, remember, is to be as fast, complete, realistic, and simple as possible. Each tier of testing adds complexity to development process. Complexity is a hidden cost that must be justified, and not just to project stakeholders – your future self will need to maintain them indefinitely.

This strategy can serve as a model for organizing your own tiered strategy for testing, modified as necessary for your context. If you’ve found new and interesting solutions to the problems discussed in this article, let me know at


For the past year, my friend Lester Jackson has been volunteering at Manson High School in Central Washington by remotely teaching Computer Science through a Microsoft Youth Spark program named TEALS.

Lester has always been super passionate of improving computer literacy, especially in unrepresented communities. Several other volunteers and Lester work with an experienced high school teacher, and come in before work a few days a week to teach CS in their assigned high school 1 to 2 days a week.

Why does Lester do it?

According to a 2013 study by, 90% of US high schools do not teach computer science. With software engineers in high demand in the private sector, schools often cannot find instructors with a computer science background, and struggle to compete with the compensation packages offered in industry. Even more staggering are the following statistics:

•Less than 2.4% of college students graduate with a degree in computer science and the numbers have dropped since the last decade

•Exposure to CS leads to some of the best paying jobs in the world. But 75% of our population is underrepresented

•In 2012, fewer than 3,000 African Americans and Hispanic students took the high school A.P. computer science exam

•While 57% of bachelor’s degrees are earned by women, just 12% of computer science degrees are awarded to women

•In 25 of 50 US states, computer science doesn’t count towards high school graduation math or science requirements Source:

The program needs more volunteers for next year. Here is how you can get involved:


Shell scripting with AWS Lambda

One of the newest pieces of the AWS toolkit is the Lambda compute engine.  Lambda provides an ability to deploy small bits of code that run independently as functions.  AWS only charges for the time that these snippets are running based on the resources requested to run the code.  This allows for extremely granular use of compute resources.

Previously, billing for general-purpose compute power was only in increments of one hour using EC2.  This was true even for managed aspects of EC2: Elastic Beanstalk, RDS, Elastic Load Balancing, etc.  This is not to say that there were no services that charged under a different model.  Many other non-compute specific services such as S3, SQS, or Kinesis are based on a per usage model.  The 100 millisecond-pricing model introduced with Lambda provides something that feels a great deal like per usage pricing for general compute.

Since Lambda functions are small and charged only when in use, Lambda encourages a model where development items can be deployed as many single functions rather than as more monolithic single server applications.  This is the Unix command line philosophy applied to the cloud.    It encourages the developers to focus on purpose built tools and interaction between components.  Just as when you build a shell script expecting it to interact with other shell scripts to achieve a larger task.

This parallel with shell scripting is also interesting with regards to administration of your AWS cloud infrastructure as a whole.   The Lambda Node functions contain the JavaScript AWS API by default.  Building functions to perform cloud maintenance, provisioning, or other scripting is now easily performed within the cloud itself.  I realize that you could always deploy an EC2 instance that contained scripts for this purpose in any language needed.  This is a very heavyweight approach for a simple activity that is likely to only run for a few minutes.  You incur not only the cost of the instance but all of the development work to get the instance setup and ready to run scripts.  Lambda does this all for you. 

It is better to produce a series of scripts to manage specific aspects your AWS infrastructure.  This is how shell scripts are used on Linux systems to manage aspects of the running machine.  A Lambda script can perform a single activity or when needed can be chained together to perform a series of actions.  Lambda also comes with a couple of different external triggers: S3 and SQS.  This allows your Lambda scripts to respond to actions occurring with other AWS or external applications that interface with these tools.

This is all supported by the fact that Lambda is secured using IAM roles.  This reduces the use of AWS credentials and allows you to pinpoint the AWS products, or even specific instances within the product, that the script has permission to access.  This security helps minimize the chance that a buggy script causes any issues outside of its allowed domain.  It is again similar to shell scripting permissions to reduce possible script exposure.

All of this points to Lambda as a great tool for managing AWS infrastructure in addition to other compute tasks that you may want to use it for.  In the next article in this series, I’ll  use Lambda to stop existing EC2 instances that are not in use. Stay tuned.

Beyond Code Coverage - Mutation Testing

Code coverage is a metric most of us track and manage. The basic idea is that when you run the tests, you indicate which lines and branches are executed. Then, at the end, you come up with a percentage of lines that were executed during a test run. If a line of code was never executed, that means it was never tested. A typical report might look something like this:

Code Coverage Results

You can see the line coverage and branch coverage in that graphic.

Typically, we consider 80% or above to be acceptable coverage, and 95% or above to be "great" coverage.

Lies, Damn Lies, and Statistics

However, coverage numbers are not always enough. Here's an example class and test case, let's see where the problem is:

So here we have a calculator, and we're testing the add method. We've tested that if we add two zeroes together, we get zero as a result. A standard code coverage would call this 100% covered. So where's the problem?

Codifying Behavior

What happens if somebody accidentally changed the addition to subtraction in the original method? Something like this:

The test case will still pass, and we will still have 100% coverage, but we can tell that the result is just wrong.

But, it can get more insidious than this. How about this example:

The lines in the constructor count towards the code coverage, despite us never validating them. What happens if somebody goes and changes the radix to 2? The tests will still pass, and the app will be completely wrong!

While additional tests can fix these concerns, how do we know if we have comprehensive coverage?


Imagine that we took the code above and did a couple things:

  • Changed all the constants to MAX_INT -- including radix
  • Changed addition to subtraction

Now, these should be breaking changes. But what happens if you make those changes, and your tests still pass? That means you're not testing it well enough! But if your tests fail or throw an error, then you are testing it.

This is the basic idea behind mutation testing. By inspecting the bytecode, we can apply transformations to the code, and re-run the test suite. We can then count the mutations that produced breaking changes, and the mutations that did not produce breaking changes -- we will call those "killed" and "survived".

Your goal, then will be to maximize the "killed" stat, and minimize the "survived" stat. We have a Maven Site Docs Example, and I have wired in the pitest library, which provides the mutation framework. The output looks a little something like this:

pitest output 1

Those red lines are places where the mutations survived -- and the green ones are where mutations were killed. Highlighting or clicking the number on the left shows this portion:

pitest mutations applied

Mutations Available

Each language and framework is unique, but here are some examples of before and after:

Of course, this isn't limited to just equality mutations. We can also do stuff like this:

In this case, we are removing a method call that does not return a value. If your code still passes after this, why does that method call exist?

You should check out full list of mutations available in pitest.


This method of testing is obviously more comprehensive, but it comes at a cost: time. Making all of these mutations and running the test suite can take significantly longer. Each mutation requires a run of the whole test suite, to make sure it was killed. Obviously the larger your code base, the more mutations are needed, and the more test runs are needed. You can tweak things like the number of threads used to run the suite, the mutations available, and the classes targeted. Except for the thread count, the others will reduce the overall comprehensiveness of testing.

Now you know another tool for quality. You can see how we wired it up in our Maven Site Docs example, and how we integrated it into the maven lifecycle.

Good Code Reviews

Good code reviews are almost an art form. It takes a lot of discipline and, frankly, a lot of time to do a good review. Pull requests (in the style of GitHub or Stash) have made the mechanics of a code review significantly easier, but that doesn't mean you should slack on the review process itself. There are several good practices that you should follow, and several bad practices you should avoid at all costs!

Bad Code Review Practices

Let's start with some of the bad practices. Let's be fair, most of us want to know if we're doing something horrible, before learning to do something good. So, here's a few key things to avoid:

Reviewing Too Much at Once

This is probably the easiest trap to fall into. As a developer, you want to put a cohesive work product together for review. However, if the change is too large, most reviewers will just skim through it, looking for obvious defects. If the change is small, it's easier to spend just a couple minutes looking through it for issues.

Cisco, with a company called SmartBear, did a review of code reviews. They spent 10 months looking at the relationship between size, length, and defect rate of code reviews. What they found looks like this graph:

Code Review Results

This shows that above ~400 lines of code, there were practically no meaningful defects filed. As the lines of code dropped, significantly better feedback was provided. Drop the size of your review!

Reviewing too Quickly

If you're trying to blaze through the pull requests, you're also likely to do a poor job. That same study that compared defect density to lines of code also looked at how much code you review per hour. This is what they found:

Review Rate

Notice how the defect rate drops the more LOC/hr you are trying to accomplish. This shows that ~500 LOC/hr is the most you can review and still be effective. Combining this with the previous graph, and ideal situation might be 2 reviews each around 250 LOC in one hour. And that's a full hour, not 10 minutes per review. Set a timer, because you're probably not spending enough time on them!

Reviewing Code Style

A code review is not the time to nit-pick on things like brace placement, indent style, or other style-specific issues. If style is a critical part of your code base, there are all sorts of automated tools to take care of it for you. For Java, you have CheckStyle, for JavaScript you have JSHint, and for Ruby, you have rubocop. If reviewers are nitpicking on style, they're ignoring the more difficult review portions.

Fear-Driven Development

This one can be difficult. A lot of people don't want to give tough feedback to colleagues in the hope that the feedback they receive is similarly dampened. It helps nobody to sugar-coat the truth. Some developers need more feedback than others. It's easy for some people to get their feelings hurt, or feel like they're under a personal attack if their review goes poorly. Some developers fear giving feedback to more senior developers for fear of looking dumb or being laughed at. Some developers are not "people person" material, and the reviewer may want to avoid getting on their bad side. They may fear being labeled "not a team player." They may fear delaying the release because of a bug, and get pressure from somebody (usually project owner or scrum master) to "just sign off on it so we can make our release."

This really boils down to professionalism. It may take longer for some people, but you have to internalize that a code review is not a reflection on you as a person, nor your ability as a programmer. Every programmer makes mistakes. Separating your ego from the review process is a critical skill that developers must learn and develop over time. In addition, even senior developers need feedback. They're not perfect, either, and sometimes they will miss really silly issues. You can't be afraid to give or receive feedback, or your product, and therefore your business, will suffer.

Good Code Review Practices

Ok, now that you know all the stuff not to do, let's focus on what you should do to get the best bang for your buck. We already know an ideal timeframe and LOC target based on the above data, but there are some practices to make it even less painful:

Make it a Part of Your Process

Code reviews done after a commit to the mainline are unlikely to get feedback that will change the code. Code reviews done before a commit stall the developer from working on future progress. How do you balance these? Pull requests!

Pull Request

Interruptions are the bane of a developer who needs focus to complete their work at the velocity managers want. By making code reviews asynchronous, you allow developers to keep working. By developing in a story-specific branch, you can have both source control and before-commit semantics on your mainline.

Note that this being built-in to tools like GitHub and Stash are a compelling enough case to get your legacy projects off of SVN and migrate them to Git.

Don't Compromise

Given that you should be OK receiving and giving feedback, what happens when the author of the code doesn't agree with your feedback? As soon as a review is done, the negotiation begins. You will each make a case, and the worst thing to do is compromise. You're each agreeing to not get what you think is right, in order to move on to something else. That's a sure-fire way to drop your code quality. There are 3 ways to resolve this conflict:

  • Reviewer retracts comments -- If the reviewer talks with the author and realizes his comments were in error or are not actually issues, he should retract them. Then there's no conflict.
  • Reviewer stands their ground -- This may mean you want the author to go back to the drawing board. Or maybe, you are arguing about a known issue. For example, you noticed a potential SQL injection issue, and refuse to let it through the gates. If you both are standing your ground...
  • Bubble up to manager/architect -- If your project has an architect, you can bring the issue to her and let her decide the correct approach. These situations are usually more agreeable to the reviewer than the reviewed, as the architect usually doesn't want to compromise quality.

Stick with these three paths of resolution, and your quality should stay high. In addition, the negotiation may actually be a learning opportunity for both parties. One of you may decide to retract an objection when you learn additional information. Code reviews are not an attacker/defender relationship, it is a collaborative peer/peer relationship. Treat it like that.

Hire Like You Mean it

This one should be relatively straightforward: don't hire people that bristle at code reviews. In any process, there are some people who simply don't want to follow it. In Agile, people complained about commitments and sizing. In waterfall, they complained about the overly-long design time. In XP, they complained about having another person sit so close. Even if you lock them all in individual offices, they will complain about the lack of communication.

This can be fixed in hiring. If you ask a candidate about code reviews and they scoff or talk down about them, and you're trying to promote them, that should be a no-hire. They will find a home somewhere else that doesn't do code reviews -- which is probably a majority of companies. If you care about your product, hire like you mean it. Walk the walk.

Minimize Ownership

If you're programming for a company, that company owns the code you write. However, many developers treat the code they write as their property. They'll throw a fit if somebody else tries to mess around in "their" code. These people are likely to cause the same issues as described above. Lack of ownership means they're likely to take feedback objectively. If a developer doesn't want to do code reviews because "they wouldn't understand it" or "only I know what it does," that's a huge red flag for your team, and a huge red flag for your product. What happens when that guy leaves for greener pastures? Who's going to take over the hairball of code that he didn't let anyone poke at?

Prepare for Code Reviews

As the author of a piece of code, you should try to make sure you get the most meaningful feedback possible. That same Cisco study found another interesting correlation:

Code Review Prep

This shows that if you leave 0 comments, you're likely to get a significantly higher defect rate. Tools like Stash and GitHub allow you to add comments to a code review. This sort of prep might help alleviate some concerns right off the bat, or they might be used to justify a design decision that is unclear from just looking at the code. Much in the same way a negotiation is a learning opportunity for both parties, prep comments are a way to make sure you understand what your code is doing, and can explain it to other people.

Code Review Checklist

This one seems super simple, but it's remarkably effective. Most people will make the same few mistakes consistently, despite trying not to. It's helpful to keep a checklist so you don't forget to check your blind spots. Some common checklist items might be:

  • Unit Tests -- Make sure unit tests exist for any new code. For bug fixes, see if there's a test that verifies the bug fix.
  • Method Arguments -- Make sure the method arguments make sense. Do they have boolean arguments? Do they have 20-argument functions? Remember, you're trying to keep your quality high!
  • Null Checks -- This probably doesn't need to be said, but you should inspect all method params and return values to see if null references are handled correctly. In Java, you can use the @NonNull annotation to get compiler help on this.
  • Resource Cleanup -- If the program is modifying or creating files, make sure the files are closed gracefully. If they are opening a transaction, make sure it gets committed or rolled back. If they're opening a stream, make sure it gets closed or consumed before moving on.
  • Security Concerns -- This is becoming a bigger issue in our industry. You should check to make sure your front-end code properly escapes user input, avoiding XSS attacks. Validate your inputs. Make sure you're encryping and salting passwords, Make sure you handle PII safely, and try to avoid collecting more than you need. If you're using a C-like language, make sure to check your buffers for potential overflows, or use a known safe library to consume your buffers.

Mix Up the Reviewers

You want to avoid having the same author's code consistently reviewed by one or two reviewers. Ideally, you would cycle through reviewers, making sure everybody gets a chance. This prevents any one reviewer to going code-blind to defects in another author's code, and makes sure that all developers have some basic knowledge about other parts of the system.


A good code review requires effort and time, but the benefits are real. There's a famous book called "Code Complete," that cites the following:

.. software testing alone has limited effectiveness -- the average defect detection rate is only 25 percent for unit testing, 35 percent for function testing, and 45 percent for integration testing. In contrast, the average effectiveness of design and code inspections are 55 and 60 percent. Case studies of review results have been impressive:

  • In a software-maintenance organization, 55 percent of one-line maintenance changes were in error before code reviews were introduced. After reviews were introduced, only 2 percent of the changes were in error. When all changes were considered, 95 percent were correct the first time after reviews were introduced. Before reviews were introduced, under 20 percent were correct the first time.
  • In a group of 11 programs developed by the same group of people, the first 5 were developed without reviews. The remaining 6 were developed with reviews. After all the programs were released to production, the first 5 had an average of 4.5 errors per 100 lines of code. The 6 that had been inspected had an average of only 0.82 errors per 100. Reviews cut the errors by over 80 percent.
  • The Aetna Insurance Company found 82 percent of the errors in a program by using inspections and was able to decrease its development resources by 20 percent.
  • IBM's 500,000 line Orbit project used 11 levels of inspections. It was delivered early and had only about 1 percent of the errors that would normally be expected.
  • A study of an organization at AT&T with more than 200 people reported a 14 percent increase in productivity and a 90 percent decrease in defects after the organization introduced reviews.
  • Jet Propulsion Laboratories estimates that it saves about $25,000 per inspection by finding and fixing defects at an early stage.

Code reviews are another tool in the belt when you're focusing on quality. Throw in automated testing and automated deployments, and you're about ready to start continuously delivering your software.

Intro to MongoDB

This article might come as a surprise to some -- an Intro to Mongo article in 2015? Yes, MongoDB had its heyday several years ago, but as consultants who see a lot of different companies, one thing is clear: there are a lot of Mongo databases out there, but not enough well-skilled people even know the basics of how to use it. In addition, a lot of nay-sayers denounced the tool, pointing at its lack of schema as proof that it's a tool for amateurs and others who are too lazy to learn SQL.

So What's Changed?

Every piece of technology goes through an adoption curve. The curve looks a lot like this:

Technology Adoption

We are all remembering the MongoDB in that Early Adopter phase, right around the point of Peak of Inflated Expectations. The ride down to the Trough of Disillusionment was swift for Mongo, and most people kind of considered it a dead technology at that point. However, Mongo is cruising along the slope of enlightenment. People now understand the strengths and weaknesses of Mongo better. People aren't trying to use it as a schema-free SQL. They're using it for its intended purpose: document storage.

Before we get too far into the guts of MongoDB, let's remember what its strengths are, compared to a traditional database:

  • Storing full documents atomically at one location
  • Automatically scale horizontally with additional nodes
  • Highly available service, and replicated data
  • Store files across multiple nodes for highly available data
  • Aggregation and Map-Reduce operations.

Your traditional SQL database is limited to one machine. While it can replicate to a number of others, the total size of the database cannot exceed the size of one machine. It's a scale-up architecture. With MongoDB, the assumption is that the data will end up consuming more than one machine worth of resources. It's a scale-out architecture.


ACID compliance is what makes modern RDBMSs so powerful. It allows us to write and read from the database, and get some guarantees on the consistency of the data. Whenever you hear developers talking about "transactions," a transaction is the way to atomically manipulate a set of data without side effects.

The CAP Theorem says that for any distributed system, you can only have 2 of the following:

  • Consistency - All nodes see the same data at the same time
  • Availability - Every request gets a response about success or failure
  • Partition-tolerance - System continues to operate despite arbitrary message loss or failure of some part of the system.

The tradition RDBMS has chosen to use CA as their system. If some part of the DB is unavailable, the whole thing becomes unavailable. MongoDB has chosen AP. Consistency can be achieved in this system eventually, given enough time. That's why it's called eventual consistency. What this means, however, is that if one node in the cluster gets a write for document X, another node in the cluster may return stale data for some period of time until consistency is achieved.

This does not mean that one system is necessarily better than the other. It's a design tradeoff you must make, no different than choosing your UI framework or backend programming language. For a lot of applications, it's not necessary to have perfect consistency:

  • Sensor Data Collection -- loosely structured, high volumes, high variety. Think biometric sensors, "internet of things," etc.
  • Ad Targeting -- latency guarantees are far more important than consistency guarantees.
  • High-Frequency trading -- again, latency is far more important than consistency, as long as it's mostly consistent.
  • Survey Site -- Custom surveys in custom data documents. Other people's answers have no bearing on yours.
  • Call Records -- These are fixed documents that may capture any number of variables
  • Caching -- This is one of the most compelling cases. Data that is difficult or expensive to collect can be stored in a document for later use. The absence of a record, however, is OK.

There are a variety of cases where perfect consistency is desired, such as financial transaction data. Again, this is a design decision of your project.

Playing with MongoDB

One of the coolest things MongoDB has done is create a demo site. You can try MongoDB in your browser, without dealing with an installation!

For the examples here, we're going to pretend that we're collecting biometric sensor data. Let's assume you have some sort of futuristic device that can measure everything about your body. It's to support more tailored advertising, of course.

Baby Steps

In MongoDB, you order your data in 2 ways: First, through a database. A database is just a collection of collections. A collection is a group of documents. In order to use a database, just issue a simple command:

$ use test
$ db

OK, we're in the test database. Collections are created implicitly the first time you insert data into it. In this example, we're going to do it in a time-series fashion. That is, the document we insert is representative of the data collected since the last document:

$ first = { ts: 10000, heart: { beats: 14 }, eyes: { left: { blinks: 3 }, right: { blinks: 3 }} }
    "ts" : 10000,
    "heart" : {
        "beats" : 14
    "eyes" : {
        "left" : {
            "blinks" : 3
        "right" : {
            "blinks" : 3
$ db.metrics.insert(first)
WriteResult({ "nInserted" : 1 })

We've created our first document! Let's get it back out:

$ db.metrics.find()
    "_id" : ObjectId("54d54bf51cdcaf3824ff7059"),
    "heart" : {
        "beats" : 14
    "eyes" : {
        "right" : {
            "blinks" : 3
        "left" : {
            "blinks" : 3
    "ts" : 10000

The _id Field

The find() operation shows an interesting thing -- the _id field. It's the unique identifier for that record. You can set or otherwise customize this id by passing it in:

${ _id: "hello_world", hello: "world"})
WriteResult({ "nInserted" : 1 })
{ "_id" : "hello_world", "hello" : "world" }

One of the interesting things about the Mongo shell is that it's JavaScript. You can actually write something like this:

$ for ( var i = 0; i < 10; i++ ) {{ hello: "world "+i }) }
{ "_id" : "hello_world", "hello" : "world" }
{ "_id" : ObjectId("54d54ecc40694710ae63aefb"), "hello" : "world 0" }
{ "_id" : ObjectId("54d54ecf40694710ae63aefd"), "hello" : "world 1" }
{ "_id" : ObjectId("54d54ed11cdcaf3824ff7081"), "hello" : "world 2" }
{ "_id" : ObjectId("54d54ed41cdcaf3824ff7083"), "hello" : "world 3" }
{ "_id" : ObjectId("54d54ed71cdcaf3824ff7085"), "hello" : "world 4" }
{ "_id" : ObjectId("54d54edc40694710ae63af00"), "hello" : "world 5" }
{ "_id" : ObjectId("54d54edf40694710ae63af04"), "hello" : "world 6" }
{ "_id" : ObjectId("54d54ee41cdcaf3824ff7087"), "hello" : "world 7" }
{ "_id" : ObjectId("54d54ee740694710ae63af06"), "hello" : "world 8" }
{ "_id" : ObjectId("54d54eea1cdcaf3824ff7089"), "hello" : "world 9" }

Back to the Metrics

So, let's generate some biometric data for our biometric stuff:

$ function getRand(start, end) {
...     return Math.floor(start + (Math.random() * (end - start + 1)));
... }
$ for ( var i = 0; i < 1000; i++ ) {
...     var rec = {
...         ts: 10000 + (i * 5),
...         heart: {
...             beats: 6 + getRand(0, 15)
...         },
...         eyes: {
...             left: {
...                 blinks: 3 + (i % 3)
...             },
...             right: {
...                 blinks: 3 + (i % 3)
...             }
...         }
...     };
...     db.metrics.insert(rec);
... }
$ db.metrics.find().limit(3)
{ "_id" : ObjectId("54d5530e3ac6e2a4782b088e"), "ts" : 14845, "heart" : { "beats" : 20 }, "eyes" : { "left" : { "blinks" : 3 }, "right" : { "blinks" : 3 } } }
{ "_id" : ObjectId("54d5530e3ac6e2a4782b088f"), "ts" : 14850, "heart" : { "beats" : 8 }, "eyes" : { "left" : { "blinks" : 4 }, "right" : { "blinks" : 4 } } }
{ "_id" : ObjectId("54d5530e3ac6e2a4782b0890"), "ts" : 14855, "heart" : { "beats" : 20 }, "eyes" : { "left" : { "blinks" : 5 }, "right" : { "blinks" : 5 } } }

Notice that we also added a limit() to the find() query. It works as you would expect. Now, let's say I want to look up a specific record, let's say one at ts = 12485:

$ db.metrics.find({ts: 12485}).pretty();
    "_id" : ObjectId("54d5530e3ac6e2a4782b06b6"),
    "ts" : 12485,
    "heart" : {
        "beats" : 7
    "eyes" : {
        "left" : {
            "blinks" : 5
        "right" : {
            "blinks" : 5

Notice another neat trick here. Above, the JSON data was all on one line, and a bit difficult to read. Here, we have appended the pretty() method, and it pretty-prints the JSON. You can do the same thing with nested fields:

$ db.metrics.find({ heart: { beats: 19 }}).limit(2).pretty()
    "_id" : ObjectId("54d5530e3ac6e2a4782b089b"),
    "ts" : 14910,
    "heart" : {
        "beats" : 19
    "eyes" : {
        "left" : {
            "blinks" : 4
        "right" : {
            "blinks" : 4
    "_id" : ObjectId("54d5530e3ac6e2a4782b089f"),
    "ts" : 14930,
    "heart" : {
        "beats" : 19
    "eyes" : {
        "left" : {
            "blinks" : 5
        "right" : {
            "blinks" : 5

So, that's the basics of finding records based on exact values. Because MongoDB is schema-free, however, you are allowed to do queries that make no sense:

$ db.metrics.find({ weight: 200 })

We found no results because no records have a weight field. This is how you can have documents with different schemas, but still be able to find relevant records.

Now, let's try something more interesting. Let's try to find the records with more than 9 beats:

$ db.metrics.find({ heart: { beats: {$gt: 9} } })
$ db.metrics.find({ "": {$gt: 9} }).limit(2).pretty()
    "_id" : ObjectId("54d5530e3ac6e2a4782b088e"),
    "ts" : 14845,
    "heart" : {
        "beats" : 20
    "eyes" : {
        "left" : {
            "blinks" : 3
        "right" : {
            "blinks" : 3
    "_id" : ObjectId("54d5530e3ac6e2a4782b0890"),
    "ts" : 14855,
    "heart" : {
        "beats" : 20
    "eyes" : {
        "left" : {
            "blinks" : 5
        "right" : {
            "blinks" : 5

Here we have our first bit of Mongo weirdness. For some reason, MongoDB can apply these modifiers only at the top level. So, if you want to compare a nested document value, you must use this dotted notation. If I remember my lessons from a few years ago, it's because the values are stored in dotted notation in MongoDB, but I'll let someone else confirm that. We can also search between two values:

$ db.metrics.find({ "": {$gte: 18, $lt: 21} }).limit(2).pretty()
    "_id" : ObjectId("54d5530e3ac6e2a4782b088e"),
    "ts" : 14845,
    "heart" : {
        "beats" : 20
    "eyes" : {
        "left" : {
            "blinks" : 3
        "right" : {
            "blinks" : 3
    "_id" : ObjectId("54d5530e3ac6e2a4782b0890"),
    "ts" : 14855,
    "heart" : {
        "beats" : 20
    "eyes" : {
        "left" : {
            "blinks" : 5
        "right" : {
            "blinks" : 5

In addition, we can use something called projections to indicate which fields should be sent back:

$ db.metrics.find({ ts: 14845, heart: { beats: 20 }}, { _id: 0, "": 1})
{ "heart" : { "beats" : 20 } }

Using the Data

Ok, we can insert and find/manipulate data. Let's start trying to use it. Let's figure out our average heartrate. How would we do this? We need to count the number of total beats, and divide it by the number of minutes. This is where MongoDB starts getting fun. We are using a part of the tool they refer to as the Aggregation Pipeline framework. It looks a bit like this:

$ db.metrics.aggregate([{ $group: { _id: "answer", sumOfBeats: { $sum: "$" } } }])
{ "_id" : "answer", "sumOfBeats" : 13587 }

The projection framework has some wonkiness in syntax, but it becomes easier over time. First, we are calling .aggregate(), which takes an array of aggregation objects. For our first object, we're just applying a grouping operation. The object after it is the exact format that will be returned. We must specify an _id file manually, since every response must have one. Then, we're telling MongoDB to sum across all of "". We have to put a dollar sign in front of it because... something. Seriously, I haven't found a good reason why it's required, but it is. Now, let's get the time boundaries in there:

$ db.metrics.aggregate([{ $group: { _id: "answer", sumOfBeats: { $sum: "$" }, startTime: { $min: "$ts" }, endTime: { $max: "$ts"} } }])
{ "_id" : "answer", "sumOfBeats" : 13587, "startTime" : 10000, "endTime" : 14995 }

So along with $sum, we can also do $min and $max (much like their SQL equivalents). Now, how can we actually get BPM? Let's add another stage to the pipeline. We're going to start formatting our query to make it easier to read:

$ db.metrics.aggregate(
...     [
...         {
...             $group: {
...                 _id: "answer",
...                 sumOfBeats: { $sum: "$" },
...                 startTime: { $min: "$ts" },
...                 endTime: { $max: "$ts"}
...             }
...         }, {
...             $project: {
...                 sumOfBeats: 1,
...                 elapsedTime: { $subtract: ["$endTime", "$startTime"]}
...             }
...         }
... ])
{ "_id" : "answer", "sumOfBeats" : 13587, "elapsedTime" : 4995 }

So now we have time and sum of beats. Let's go one step further!

$ db.metrics.aggregate(
...     [
...         {
...             $group: {
...                 _id: "answer",
...                 sumOfBeats: { $sum: "$" },
...                 startTime: { $min: "$ts" },
...                 endTime: { $max: "$ts"}
...             }
...         }, {
...             $project: {
...                 bpm: {
...                     $divide: [
...                         "$sumOfBeats",
...                         {
...                             $divide: [
...                                 {$subtract: ["$endTime", "$startTime"]},
...                                 60
...                             ]
...                         }
...                     ]
...                 }
...             }
...         }
... ])
{ "_id" : "answer", "bpm" : 163.2072072072072 }

Now, we have our BPM. Yes, it's extremely high, but we also had a bunch of entries with >19 beats in 5 seconds. Maybe this person was running?


That's the basics of using MongoDB. It's a lightning-fast storage engine for full documents. You can do almost every SQL operation you are already familiar with, and even more in a map-reduce inspired pipeline framework.

While the hype cycle may be dead with MongoDB, it's still a valuable tool to have in your belt. It offers a nearly effortless way to store documents (or objects), and analyze them.

In addition, Mongo scales much easier than SQL. It's nearly trivial to expand your document database to a dozen machines. Mongo focuses on speed, flexibility, and scalability, while not dealing with stuff like transactions and relational semantics. This is a design tradeoff you might make for your application.

Intro to Neo4j

Introduction to Neo4j

In the world of NoSQL, scale is not the only performance characteristic that matters. Sometimes, the discoverability of data -- finding a needle in a haystack -- is more valuable.

Neo4j is a graph database. It's purpose is to store and manage data based on graph theory. Some problem domains lend themselves particularly well to graph space: social connections, collaborative filtering, and data discovery, along with many others.

In order to show the basics of graphs, we're going to use the ever-popular idea of movie ratings. Instead of data being scraped from IMDb, we are using a tool called MovieTweetings. This tool scrapes Twitter for well-defined tweets that indicate a movie rating. The author of the tool publishes a new dataset almost every day, so the data is significantly fresher than any other source we've found. Before we get started in that, however, let's get some background.

What is a Graph?

If you have a Computer Science background, you'll likely remember your classes on graphs. Graphs have nodes and edges. Edges indicate a connection between nodes. A graphical representation of a graph might look like this:

Graph Example

Note that the edges (lines) have numbers on them. This is often referred to as the "cost" of making a transition. Not every graph needs to have costs, but it's particularly valuable when you're trying to optimize a path through the graph. There's a very famous problem called the Travelling Salesman Problem. It's considered an NP-Hard problem, or the hardest of the hard. You can also consider driving directions on maps. Imagine intersections and addresses are the nodes, and the streets are edges. If you know the distance from one intersection node to the next, you could create fast driving directions. An example of this might look like so:

Graph Example 2.

Now, just looking at it, what is the best path from Second & Oak to Third & Market? You can't go on Market Alley, because it's a one-way street! Now, imagine this graph covers every single street in the country, with precise distance values. Then, add in traffic data for a more precise cost. Then, add in some intersections not allowing certain turns, construction issues, and a visualizer, and you're well on your way to competing with Google Maps!

Graphs in SQL

It's entirely possible to represent a graph in SQL. This has been done multiple times, and there are many versions of it. The classic is the adjacency list, where you have a "parent" column:

Adjacency List Example.

Note that to find the nodes that are 1 hop from A, it takes the query at the bottom of the graphic. You have to self-join on the table, and use the parent column to relate them correctly. Now, how would we find the path from A to I? We'd have to do a 4-way join:

SELECT node1.*, node2.*, node3.*, node4.*
  FROM nodes AS node1
  LEFT JOIN nodes AS node2 ON node1.title=node2.parent
  LEFT JOIN nodes AS node3 ON node2.title=node3.parent
  LEFT JOIN nodes AS node4 ON node3.title=node4.parent
  WHERE node1.title='A' AND node4.title='I';

Let's say you have a social media site, that has 2 million members, and a bunch of friendship links. In order to find out the friends-of-friends list, it would require a 3-way join. Databases really, really don't like it when you do too many self-joins. In addition, there is very little discovery. You can use the self joins to see if there is a 3-hop path from A to I, but is there a 4-hop path? A 2-hop path? What about a path from D to I? These variable length routes from one node to the next are particularly insidious for SQL. You are forced to write your own Graph Traversal algorithm, making sure not to get caught in loops for a fully-connected graph!

Graph Database

Enter the graph database. It's job is to store data and provide interfaces that make dealing with graphs easier. Neo4j is a particularly good and well-supported option (with a free version), so we'll be using it for this article. Neo4j includes a graph language called Cypher, and we'll show you how to use that as well.

Graph databases are generally optimized for graph operations. That is, finding a path between two nodes, the lowest cost transition between nodes, or finding related groups of data, like friend circles within a larger social network.

In addition, graphs are great at something called inferred data. Let's take mapping again as an example. Let's say you have a coffee shop, and it has an edge connection to Seattle, as a "located_at" relationship. If you then link Seattle to Washington, and Washington to the USA, then you have inferred facts about the coffee shop being located in Seattle, in Washington, and the USA. You can also do this with Taxonomies. You might remember these from school: kingdom, phylum, class, order, family, genus, and species. Here's a pretty picture I found online:

Animal Taxonomies

In this case, you can infer that the Housecat is of the phylum Chordata. Chordata is apparently animals that have a spinal column and a tail of some sort. Notice that Primates (which is where humans would be) are a class under Chordata. You know how some people have something called a Vestigial Tail? That's what keeps us connected to this family tree.

Data and Labels

Before we get started on the cool parts, let's add a few more concepts. A node is a pretty useless construct without meaning. So, in Neo4j, you can give a node properties -- key/value pairs. You can also give a relationship properties. These let you add data like timestamps, cost estimations, geographic coordinates, and more to your data.

Given that the data is key/value pairs, you can see the immediate need to group related nodes together. That is, for our example, grouping all of the movie nodes together, or all of the user nodes. So when I talk about labels, you can sort of think of them like SQL tables, where the properties are more like columns.

Graph Example

Now we get to play with the graph. If you have a fresh neo4j instance, you'll have a screen that looks like this:

Neo4j First Screen

You'll notice they have a movie graph example as well. Theirs connects actors and directors to movies, while ours is focused on ratings.

Cypher and Graph Basics

To create a node, we will be using their Cypher language. There is a language tour on the screenshot above, but I think if you have some SQL background you can follow along:

Neo4j Stupid Movie

First, I created a movie with the CREATE keyword, and gave it a collection of properties. I also named the node 'movie' -- but that only applies for this one query. The part after that, ':Movie', means that neo4j should apply the movie label.

Next, I ran a MATCH query. I again named it 'movie', which comes in handier here. I then compare the name to the node I want (almost exactly like SQL), and then return the movie.

This is very simple, as there are no relationships. Let's create another movie and a relationship:

Neo4j Matrix

Read these from the bottom-up. First, I created an actor (notice it created the label), then queried using an alternate MATCH syntax. We have both of the nodes. Then, I match them both again, but this time create a relationship between them. Finally, I use the cypher MATCH again, but query for any relationships with the type "acted_in", where the movie is "The Matrix," and we get our actor back. Easy, right?

Our Dataset

In the data set I linked to above, there are 3 files we're using:

  • users.dat - This is a list of internal user ids, related to twitter ids
  • movies.dat - This is a list of movie IDs, titles, year of release, and genres.
  • ratings.dat - This is a list that contains user id and movie id tuples, along with a score, correlating a rating to a person and a movie.

In order to start using these effectively, we need to bring them into Neo4j. I'll leave out the gory details, but I basically just wrote a ruby script to convert the 3 files to a file that has Cypher commands:

After that, you can import it to your neo4j app:

bin/neo4j-shell -file /path/to/cypherscript.txt

Now we can start with some fun stuff. Let's look up "The Matrix" again, this time returning the average rating:

Neo4j Matrix 2

If you check the IMDb page, the movie has an average rating of 8.7, so the ratings seem fairly representative. This is pretty easy with SQL, so let's try something a little more complex. Let's find out, if somebody rated The Matrix 8 or higher, what other movies they're likely to like. The screen shots started clipping the query, so I've included it:

MATCH (n:Movie {title: "The Matrix"})<-[r:Rated]-(u)-[or:Rated]->(m:Movie)
WHERE r.val >= 8 AND or.val >= 8
RETURN m.title, count(r) ORDER BY count(r) DESC LIMIT 10

The Matrix Recommendations

Seems like a pretty good list of movies! However, if you're trying to find movies that are like The Matrix, you probably want to restrict the genres as well. So, what genres are applied to The Matrix?

MATCH (n:Movie {title: "The Matrix"})-[:has_genre]->(g:Genre) RETURN;

The Matrix Genres

Ok, let's find other top-rated movies that have Action, Adventure, and Sci-Fi genres:

MATCH ()-[r:Rated]->(n:Movie)-[:has_genre]->(g:Genre)
WHERE"Action" or"Adventure" or"Sci-Fi"
WITH avg(r.val) as rating, n.title as title, count(distinct as genres
WHERE genres = 3
RETURN title, rating

Other Movies With Genres

The "WITH" clause allows us another chance to transform results, so we can do additional filtering. In this case, we're counting the genres and then only returning results that have a count of three. This guarantees only results that contain all three genres -- but those movies can contain other genres. You'll notice there are some weird results in there, and if you search them on IMDb, you'll notice the ratings don't match up either. For a lot of these results, there are only 1 or 2 ratings. So, let's add an additional clause for at least 1000 ratings:

MATCH ()-[r:Rated]->(n:Movie)-[:has_genre]->(g:Genre)
WHERE"Action" or"Adventure" or"Sci-Fi"
WITH avg(r.val) as rating, n.title as title, count(r) as numratings, count(distinct as genres
WHERE genres = 3 and numratings >= 1000
RETURN title, numratings, rating

Better Results

These results are looking much more like we'd expect. Again, we show that you can also count the relationships. Because we put no restrictions on the other end of the Rated relationship, any user can match, which means all ratings for everyone.

So that's the basics of how you'd use a graph database. They're great for exploring hierarchical data, as well as finding relationships between items.

Let's take this one step further before we wrap up. We showed how to get data for one movie, but what if we want to start at a user? Let's pick one of the users who reviewed The Matrix and has some amount of reviews:

MATCH (u:User)-[:Rated]->(m:Movie {title: "The Matrix"})
MATCH (u)-[r:Rated]->(m2:Movie)
RETURN, count(r) as numratings

Potential Users

Here, we find all users who have rated The Matrix, then for each user, we execute another query to see how many ratings they have. We're going to take a look at user 9520. Based on what they've liked, what other movies should they see? Let's just say 6 or above is "Liking" a movie. So, how should we do it? We take this user, find all movies they've rated 6 or higher, find out who else has rated the movie 6 or higher, then find all movies each of those people have rated 6 or higher that aren't also rated by our original user, and figure out the weight based on the number of paths to that movie. How's that for a gnarly query? As a homework exercise, you should figure out what the SQL query for this would be. It's not pretty.

MATCH (u:User {id: "9520"})-[r:Rated]->(m:Movie)<-[r2:Rated]-(u2:User)
WHERE r.val >= 6 AND r2.val >= 6
WITH u2, collect(m) AS m
MATCH (u2)-[r3:Rated]->(m3:Movie)
WHERE r3.val >= 6 AND NOT m3 IN m
RETURN count(*) AS weight, m3.title AS movie
ORDER BY weight DESC, movie

Recommendations For User

And there we have a recommendation engine in 9 lines of code.

The Myth of Developer Productivity

How's that for a click-bait title?

If there is a holy grail (or white whale) of the technology industry, especially from a management standpoint, it's the measurement of developer productivity. In fact, there is a very common phrase, "you can't plan if you can't measure." Measurement works so well in many other industries that involve humans -- building construction, manufacturing, road work. We are able to get rather accurate estimates for both cost and completion date, so why not software?

If you're a manager, you're going to read a lot of discouraging information here. However, if you make it to the end, I promise we'll give you tools and tips to gain efficiencies. All is not lost.

Why Measure?

We as developers love to play along with this. So much of what we work with is data-driven feedback. We can analyze with profiling, complexity, conversion rates, funnel metrics, heat maps, eye-tracking, a/b testing, fractional factorial multivariate analysis, etc. All of these things give us data upon which we can prioritize future efforts. It only makes sense that we should be able to measure ourselves.

Measuring Developers

Measuring and managing developer productivity, however, has consistently eluded us. So many of the tools we use are designed to increase developer productivity: XP, TDD, Agile, Scrum, etc. There were academic papers analyzing software project failures/overruns in the 80s. This isn't a new phenomenon by any means. We also famously hear of IT failures in the news, such as:

These are just a few cases. There are likely dozens or hundreds of errors on this scale every year, and likely hundreds to thousands of projects in the <= $1m range. A lot of this is due to a lack of good testing. We at Dev9 have frequently espoused the benefits of automated testing, and it has real benefits.

However, quite a few others are caused by planning and estimation that missed the mark. There are estimates that say IT organizations will spend over $1t per year on their IT initiatives. Notice it's trillion, not billion. A trillion dollars. Given this extremely high cost, anybody who found a way to reliably gain efficiencies of even 1% would save a billion ($1,000,000,000) dollars. That's a lot of zeroes.

The 10x Developer

There is a theory floating around, and largely backed up by data, that the best developers among us are 10x more efficient than the worst ones. Given that developer salaries do not reflect this order-of-magnitude difference (Who is the last senior dev you knew who made $800k/yr?), it's obviously a bargain for companies if they can find one of the 10x, and hire them at a comparable rate to a 1x or 2x person. These studies even gave birth to analysis that showed, "...[T]he top 20 percent of the people produced about 50 percent of the output (Augustine 1979)." If you were a manager looking to cut costs, you'd want to get rid of 80% who produced only 50% of the output, and hire only the kind of people who are in that top 20%.

High Performers

However, that quote I gave you is not the full quote. It actually is, "This degree of variation isn't unique to software. A study by Norm Augustine found that in a variety of professions--writing, football, invention, police work, and other occupations--the top 20 percent of the people produced about 50 percent of the output, whether the output is touchdowns, patents, solved cases, or software (Augustine 1979)."

This problem is not a software-specific problem. Any field that requires human decision-making is subject to variation. Some people are going to be naturally talented in the field. Some have the perfect personality for the job. Some people are voracious readers, others never try to learn after school. Some consistently push their bounds, while others are content to be competent. Some people's brains just work differently. Some people's bodies just work differently. It doesn't take a genius to see that some football/soccer/hockey players are dramatically better than others, even though they both train the same amount of time. Why would software development be any different? Why should it?

Traditional Measures

Before we continue onward, let's look at some of the ways the industry has tried to quantify development activities, and why they fall short for measuring productivity. The tl;dr of this section is that any metric you come up with to measure developers will be gamed.

Hours Worked

This is one of the most obvious ones: butt-in-seat time. If you worked 10 hours instead of 8 hours, you should get 125% of the work done. That's just math. Time and time again, you'll see studies proving that this just does not work for anyone. In fact, running hot on hours is a great way to decrease productivity.

Time and time again, we see proof that more than 40 hours necessarily leads to a drop of productivity, even for assembly line workers. Yet, this pervasive attitude of 8-6 being a minimum workday continues to chug along.

I was once on a team where the managers were so addicted to tracking hours as a measure of productivity that we started putting meetings, lunches, and bathroom breaks on the board every sprint. Otherwise, we were accused of not working hard enough because our hours didn't exactly add up to 40 or more. This absolutely destroyed the morale of the team. "Don't forget to put your hours in" causes me to involuntarily twitch.

Source Lines of Code (SLOC)

Lines of code. What a perfect measure. Even if they think different and whatnot, we can just track lines of code, and use that to extrapolate.

There are so many problems with this metric that it is actively harmful to use it to judge developers:

  • Developers can just add extra lines of code to pad their numbers
  • A 200-line solution may be faster or more performant than a 1000-line solution to a problem
  • Sometimes the solution is to delete code
  • 5000 lines of buggy code is worse than 1000 lines of bug-free code.
  • Developers copy-paste code instead of refactoring, leading to massive technical debt and poor design, as well as significantly increased bug probability.

This is an interesting metric to track in aggregate to get a sense of the size and complexity of the system, but not useful at an individual level.

Bugs Closed

This one is so crazy, Dilbert has a comic on it:


If you do this, you're the pointy-haired boss from Dilbert.

Function Points

Function points found a small following out in the world. You've probably never heard of them. It's practically impossible for a lay-person to digest. If you want to try to measure function points for your project, then give this article a read and figure out how to automate it in your project.

Go ahead, try it. I dare you.

Defect Rate

The idea of this one is to measure the number of defects each developer produces. This does seem reasonable, and you should probably track it, but here's why it's a bad measure of productivity:

  • It favors bug fixes over feature development.
  • It discourages developers from tackling larger projects. Would you rather try the "Add a form field to this existing page" project, or the "Implement a real-time log analysis system from scratch" project?
  • Not all bugs are created equal:
    • Bug 1: When somebody uses the "back" button, a bug deletes all customer data on the production website.
    • Bug 2: Form fields are not left-aligned
    • Bug 3: If a customer enters dates that span 2 leap years, the duration calculation is off by 1 second.
  • People often mistake features for bugs. Missing requirements are not a bug, but may be filed as such.
  • There may be multiple bug reports related to 1 bug.
  • Developers will never touch anybody else's code, and will get very aggressive about protecting their code.

Defect rates are interesting, but they're not enough to give you an idea of productivity.

Accuracy of Estimation

Estimation, my least favorite activity. I have no problem taking a swing a how long something will take. However, at every single company I've ever worked for, estimates become commitments. If you say "this will take about 3 days," you get in trouble if it takes longer than 3 days. On the other hand, if you finish ahead of schedule, you get praised. This encourages developers to estimate given an absolute worst-case scenario. Like, "neutrino streams from solar flares corrupting random bits on our satellite stream that somehow passed checksum validation but is still corrupted and we wrote that to our hard drive" kind of worst-case scenarios.

Other reasons this metric is a problem:

  • If you estimate in "ideal hours," distractions may turn that 8-hour task into 3 days.
  • Developers can be overly and inconsistently optimistic with their estimations.
  • The scope was not adequately defined, or not defined at all.
  • The customer was asking for something that is impossible, which could only have been discovered at coding time.

There is one more reason, bigger than those four combined. Look for the section "Developer Productivity is a Myth."

Story Points

Story points -- we thought we had found the holy grail. Story points were explained as a measure of effort and risk. If we have consistent story points, and figure out how many story points each developer finishes per sprint, then we can extrapolate developer performance. Let's see what happens:

  1. If they finished less than they did last sprint, they're chastised. They are again reminded that they committed, no matter what. Even if you had to help a prod issue, or were in a car accident, or got sick -- you committed. So developers start sandbagging to avoid this.
  2. If they finished exactly right, the managers will think the developers finished early and were sitting idle, or were padding their estimates. This leads to frustration and resentment. Alternatively, a perfect finish might be seen as a state where, if everybody worked a few more hours, we'd see more output.
  3. If they finish with more points than they took on, managers will accuse the developers of sandbagging. Then they told that they must accept more points next sprint, to take this into account. That, or you have a "level-setting meeting" where everybody re-agrees what the points represent. This leads to frustration and resentment, not to mention the drop in productivity related to figuring out the new point system.

If a manager asks for doubled productivity, that's easy: double the story-point estimate.

Story points also aren't consistent between developers. Even if everybody agrees that it's a 3-point story, based purely on effort and risk, the wall-time delivery will be different depending on who picks it up. One developer who is intimately familiar with that code may be able to finish in 2-3 hours, while a new junior developer may struggle for 1-2 days. This is proof that we've decoupled productivity from points, and why it's a bad metric.

On the official Scrum forums, practioners always have to explain why story points are not a measure of productivity. The Scrum Alliance even has a whitepaper called The Deadly Disease of Focus Factors, and here is the opening statement of the document:

To check your organizational health, answer these two questions:

1) Do you estimate work in “ideal” hours?

2) Do you follow up on your estimates, comparing it to how many “real” hours work it actually took to get something done?

If so, you may be in big trouble. You are exhibiting symptoms of the lethal disease of the “focus factor”. This is how the illness progresses:

Speed of development will keep dropping together with quality. Predictability will suffer. Unexpected last moment problems and delays in projects are common. Morale will deteriorate. People will do as they are told, but little more. The best people will quit. If anything gets released it is meager, boring and not meeting customer expectations. As changes in the business environment accelerate, the organization will be having trouble keeping up. Competitors will take away the market and eventually the end is unavoidable.

So even the people who invented the concept tell you explicitly not to use story points as a measure of developer productivity. So stop it.

Developer Productivity is a Myth

"You can't plan if you can't measure." This is an idea still taught in business school, it's a mantra of many managers, and it's wrong in this context. It assumes everything a developer does is objectively and consistently measurable. As we've shown above, there still doesn't exist a reliable, objective metric of developer productivity. I posit that this problem is unsolved, and will likely remain unsolved.

Just in case you think I'm spouting nonsense, just remember: the smartest minds of Microsoft, Amazon, IBM, Intel, Wall Street, the Bay Area, Seattle, New York, and London still haven't found that magical metric. It is, therefore, a rather safe assumption that the average company also hasn't found it. If you believe you have proven me (or them) wrong, go ahead and publish it. You'll be a wealthy rockstar of the programming universe. People will write books about your life and your brilliance.

We all know that some people are better than others. Developers can identify which developers are better, but there is not a number or ranking system we can come up with, objectively based on output, that consistently and reliably ranks developers. Let's explore why.

A Developer's Job

Most people don't understand what developers do. We clicky-clack on electronic typewriters while drinking Mountain Dew and eating Doritos in the dark, and make the magic blinky boxes show cute cat pictures.

OK, it's not the 90s anymore. Most people really do understand the basics of operating a computer. If you're under 40, there's a good chance your grandparents use Facebook.

So what do we do? Code is the output, but it's not really what we do. If we were just transcribing something into code, that's basically data entry. We're knowledge workers. We take inexact problems and create exact solutions. Imagine if managers were capable of exactly specifying the system they want built. They would have to explain it so finely-grained that it would be programming. That's what we do. We are people who exactly detail how a system works. Our code is the be-all, end-all specification for what the software does. We are people that write specifications, digest knowledge, and solve problems.

Most people are incapable of breaking a problem down to the level required for computer code to solve it. This isn't to say that they can't learn, but it's a skill you must nurture. Imagine a parent (P) trying to teach a kid (K) how to make a grilled cheese sandwich:

K: How do you make a grilled cheese sandwich?

P: You make a cheese sandwich, then fry it in a pan until it's done.

K: What's cheese?

P: It's a food made from milk.

K: How do they make cheese?

P: Well, they take milk, and they add rennet, then they add flavorings, and maybe age it.

K: What's rennet?

P: It's an enzyme that makes the milk solid

K: How does it do that?

P: It is a protease enzyme that curdles the casein in milk.

K: How does a nucleophilic residue perform a nucleophilc attack to covalently link the protease to the substrate protein before releasing the first half of the product?

P: Because I said so.

Imagine the plethora of questions they can keep asking: How do you tell if it's done? What does done mean? How many minutes? What's a minute? Why is a second a second and not something else? How brown is too brown? What kind of bread do you use? How do you make bread? What is bread yeast? What's butter? What's a pan? How do you make a pan? What's a stove? Why does a stove get hot? How does a stove get hot? What happens if you don't have cheese? What happens if you don't have bread? Can you use a microwave? Can you cook it outside if it's really hot? Can you use other cheeses?

So when somebody in the business asks, "can you tell me how many people visited our site yesterday and clicked on the newsletter signup?", it sounds like a simple request. You just take all the people, find the ones who clicked the thing, and count it. But, let's take a dev perspective. How do we identify visitors? Is IP good enough? Do we support IPv6? Do we want to use cookies? Is our cookie policy legally compliant in Europe? Do we have to worry about COPPA? Do we want to de-dupe visitors? How do we track that people clicked on a link? What's the implication of click-stream tracking? Will our infrastructure support that? How important is accuracy? If we lose one click record, does that matter?

This is what developers do. For every line of code we write, we are answering all of these questions in excruciating detail.

When you hear developers talk about "abstraction," we are basically answering the "How does electricity get turned into heat?" question for anybody who asks. Then we're answering the "how does a protease enzyme curdle casein?" question. Then we're answering the "how does heat turn bread brown?" question. One of the questions we literally answer is, "How do you turn 1s and 0s into text?" Well, what about character encodings or code pages or multi-byte entities or byte-order markers or little-endianness... you get what I'm saying. A computer is a dumb machine. It can't read our minds, and has no context.

A good developer is able to take a high-level problem, see best way to break it down, and create the correct levels of abstraction, all while keeping the code readable and maintainable for other developers. This also explains why some people are 10x performers, and some people get so frustrated with programming that they give up. Some people have curated, or have a natural talent for, thinking at this extreme level of detail. Some people can intuit things that others will never discover -- even if they had all the time in the world. This is the nature of knowledge work.


This one is likely to be more controversial, but the crux of this issue is that developers are treated like blue-collar workers. Because so many of our beloved processes come from the world of manufacturing, it's very easy to see why developers would be though of like assembly line folks. That's why managers try to get consistent productivity. The idea is that if they can just find a way to measure developers, then developers will truly be interchangeable cogs: software would never be late again, it would always be on budget, and it would be exactly what we want. All of the theory they learned about manufacturing and assembly lines in business school would then apply to this field.

This attitude led to the massive amounts of off-shore outsourcing, just like manufacturing. These days, we know that offshore development is very difficult to get right, the end product often contains a lot of bugs, and is often of very poor quality. Many companies are bringing off-shore projects back in house due to these issues -- or using local consulting firms like Dev9.

What About Those Builders?

So what makes building and road work so predictable, when we can't get it right for development? The answer is relatively simple: we're not doing the same job. The labor in those fields have very little input on the decision-making processes. As we explained above, what a developer does all day is make thousands of tiny decisions. By the time these construction projects break ground, the decisions are made, the plans are already in place, there are very exact specifications, and there is little room for ambiguity or disagreement. In addition, the skills required aren't as widely variable. One person can use a pneumatic nailer about as good as any other. One person can operate a dump truck about as good as any other. And even if somebody was a 10x better paver than another, the time needed to cure is a near-constant factor. In addition, the tools and techniques are not as rapidly moving. The basics of foundations, jack studs, jamb studs, nogging, top plates, mudding, and taping really hasn't changed. Governments and building codes will dictate many of the decisions, like how far apart studs are center-to-center, or how many electrical outlets go on a wall.

Rather than trying to build an analogy to builders, who makes all the decisions? City planners, building code authors, architects, and engineers. All while dealing with a highly beaurocratic permit system, and localities that have different rules. They make tons of decisions.


Let's do another thought process. If developers were truly thought of as professionals, let's see how other professions compare.


Ask a doctor what their job is. Is it talking to people? Is it writing prescriptions? Maybe it's taking inexact problems from imperfect people with imperfect information, then trying to diagnose and fix or ameliorate problems, within the constraints of cost, time, side effects, and a million other things. Sound familiar?

So how do you measure the productivity of doctors? Given their high cost, obviously the field should be rabid for productivity optimization, right? Doctors have something called RBRVU, or "Resource-Based Relative Value Units.". From that article:

[...] if your organization is measuring physician productivity based on how many patients a doctor sees per day, it needs to take many relativities into consideration. If you compare a primary care physician with a small practice to an ED physician, you are unlikely to see a day when the PCP sees more patients than the bustling ED physician – but is that really a fair and accurate measure of productivity? However, within your organization, if you stack doctors up against those in like-practice, thinking that you can judge productivity on numbers alone, you run into the trap of complexity of care – even within the same speciality, practices may be saddled with patients in varying degrees of medical complexity – and even that will change over time within the same patient!

This seems rather familiar.


Ok, let's try lawyers. Is their job reading briefs? Is it writing them? Is it consulting with people? Or is it doing all of that, while interpreting imperfect laws with imperfect information based on second- and third-hand reports of a situation, while absorbing all of the decisions of the past?

We all are pretty familiar with the traditional method of measuring productivity of lawyers: their billable hour counts. Even there, people are discounting that metric. The only goal of billable hours is higher partner profits. From that article:

The relevant output for an attorney shouldn't be total hours spent on tasks, but rather useful work product that meets client needs. Total elapsed time without regard to the quality of the result reveals nothing about a worker's value. More hours devoted to a task can often lead to the opposite of true productivity. Common sense says that the fourteenth hour of work can't be as valuable as the sixth. Fatigue compromises effectiveness. That's why the Department of Transportation imposes rest periods after interstate truckers' prolonged stints behind the wheel. Logic should dictate that absurdly high billable hours result in compensation penalties.

Hey, there's something interesting. "Useful work product that meets the client needs." How does Scrum define success? Value delivered to the business. It says nothing of how you determine that value. There are too many factors. It may even be impossible to directly correlate revenue to features. Therefore, the only measure of success in scrum is that the product owner is happy.


So those two fields, often considered where the best and brightest go, have found that hours and other obvious metrics aren't useful to measure productivity. So, why aren't developers treated the same way? Why do we keep being excluded from the "Professional" list?

I'm not suggesting any solution here. I just don't have one. However, it helps explain things like calling developers resources. From that article:

Does George Steinbrenner schedule a “short stop resource” or does he get Derek Jeter?

Do they Yankees want homerun hitting A-Rod or a mere “3rd baseman resource”?

Did the Chicago Bulls staff a “shooting guard resource” or did they need Michael Jordan?

Did Apple do well when it had a CEO “resource” or did they achieve the incredible after Steve Jobs came back to lead the company?

Thoughtworkers and creative types are no different. Software engineers are simultaneously creative and logical, and there is an order of magnitude difference between the best and worst programmers (go read Peopleware if you don’t believe this). Because of this difference, estimates have to change based on the “resource,” which means we’re not interchangeable cogs after all.

You Promised Me Tools!

So let's assume that measuring -- or more importantly, optimizing -- productivity is nearly impossible. How do you keep your team happy and still satisfy the business need for efficient use of capital? Well, what do these other professionals do? Instead of trying to directly measure productivity, they measure anything that impedes productivity.

Measuring Impediments

This is an easy one. Every time something impedes progress, make a note of what it is, and how long it took to resolve. This is especially good to do for any external dependencies. Any time the work leaves the direct, in-progress control of the developer, track who it goes to, and how long they have it.

You can then use this information to talk with the external groups. For example, if the IT folks are taking 2 weeks to turn around a virtual machine, that's a discussion the Dev manager can have with the IT manager. If you have a policy of mandatory code reviews, then track that time. Maybe people are letting those sit around for 3 days, and the manager can set priorities. Maybe there are competing priorities. Either way, the dev manager can show THEIR boss why work items are taking longer than they need to.

Time Before Delivery

This is another interesting metric. Track how long it takes from the point the business requests a work item, to when it's available for use in production. Over time, this metric will stabilize. If the units of work are somewhat consistently sized, predictability will be gained.

Time In Progress

This one tracks the total amount of wall time taken from when work starts on an item, to when it's delivered. Again, if the units of work are approximately similar sized, predictability will be gained here.

Time In Phase

This one tracks the wall time in each phase. Remember how I told you to track external organizations? You should be tracking every phase. The design phase, the dev phase, the QA phase, the code review phase, even the deployment phase. By having every phase tracked, you can identify the slower phases, and see if there is any room for optimization.

Flow Control

Just like working more than 40 hours leads to less productivity, so does working on too much at once. There's a rule of optimization that you can optimize a process only as much as you can optimize a stage. The way to get more done is to remove bottlenecks.

If the QA team is only able to test 4 stories weekly, but developers are finishing 10 stories per week, then only 4 features per week are going to be released. Speeding up the developers will have no effect on the number of features delivered per week. You have to get the QA team to get more throughput. If the managers didn't know the QA team was the bottleneck before, it's impossible to ignore the pile of work that's growing in their phase.

To this end, it makes sense that instead of developers taking a bunch of items on at once, they should focus on one item, and drive it to completion. In addition, there should be some limit of total features being worked on at one time. Work that's being done beyond what the QA team can handle is wasted work. If your developers can help resolve the roadblock in the QA queue, that's going to deliver more value to the customer than working on features. And if we forgot, value is the true output we're trying to deliver.

Wait a Minute...

If you think this all sounds a little familiar, it should. It's the basics of Kanban. It again comes from the manufacturing world, but the focus is on a continuous delivery of value to the customer, with a minimum of wasted work.

We have plenty of articles on the Dev9 blog about Kanban, so I won't go into too much detail here. The basics of Kanban:

  1. Map your value stream. This means separate stages for any handoff point. This also should include any external factors that might impede progress. Then you track the time a story spends in each phase, as described above.
  2. Define the start and end points for the Kanban system. Some teams find if valuable to have To-Do, Doing, and Done. Some teams have Backlog, Design, Dev, Code Review, QA, Release, and Done. It's up to you. Anywhere there's a political or team boundary is a perfect place to have a new phase.
  3. Limit WIP (Work In Progress). As we explained above, increasing productivity of developers without clearing the downstream bottleneck results in wasted work, and no adiditional value delivered to the customer. The team shoul agree on WIP limits, and situations which might allow for breaking those imits.
  4. Service Classes. We know that some production issues will have to take priority. You can have different classes of service (e.g. "standard", "expedite", "fixed delivery date").
  5. Adjust Empirically. Given the data you're tracking above, you can find bottlenecks and inefficiencies, and work to resolve them.

This is the current best solution we've found. Instead of trying to directly measure programmer productivity, which we showed above is practically impossible, focus on measuring anything that impedes their progress, or the progress of delivering value to the customer.


Finally, a little note for you, which is often the antithesis to empirical measurement: trust your gut. Even though you can't just put numbers on it, most developers find it easy to spot good and bad developers. There's just something telling you that they're better. It could be the way they talk about their technology, the thought they put into an answer, or the answer itself. Most developers would sacrifice project and pay to work with a former favorite co-worker. Managers, if you have a developer you like and trust, then trust their input on their coworkers.

In addition, even though they may not be developers, managers often already know who their best and worst performers are. There's usually one or two standout people, even in a team of already-amazing people. If you have all of your developers stack rank each other, it's likely the top performs and the worst performs would be quite consistent. This doesn't fix the issue of finding or hiring developers. The troubles of interviewing could be the subject of an article even longer than this one.

Running Continous Integration on a Shoestring with Docker and Fig

By: Jason Marshall

One of the things I love about Continuous Delivery (CD) is the "Show, don't Tell" aspect of the process. While we can often convince a customer or coworker what's the 'right thing to do', some people are harder to sell, and nothing beats a demonstration.

The downside of Continuous Delivery is that, on the face of it, we use a lot of hardware. Multiple copies of multiple servers all doing nominally the same thing if you don't understand the system. Cloud services are great for proving out the system due to the low monthly outlay, but not all organizations allow it. Maybe it's a billing issue, or concern about your source getting stolen, or in an older company it may be a longstanding IT policy. If a manager believes in the system, they may be willing to stick their neck out and get paperwork signed or policies changed. But how do you get them on board in the first place? This chicken and egg problem has been bothering me for a while now, and Docker helps a lot with this situation.

Jenkins in a Box

The thing I wanted to know was: "could I get a CI server and all of its dependencies into a set of Docker containers?" It turns out not only is the answer 'yes', but most of the work has already been done for us. You just have to wire the right things together.

Why start here?

The Big Ask for hardware starts with the build environment.

Continuous Delivery didn't always exist as a term. Before that it was just a concept. You start with a repeatable build. You automate compiling the code. You automate testing the code. You set up a build server so you know if it's safe to pull down trunk/master in the morning. You start enforcing clean builds of trunk/master. You automate packaging the code. Then you automate archiving the packages. One day you wake up and realize you have a self service system where QA can pull new versions onto their test systems and from there it's a short leap to capturing configuration and doing the same thing in staging and production.

But halfway through this process, you needed to do UI testing. For web apps that means Selenium. PhantomJS is a good starting point, but there are many things that only break on Firefox, or Chrome. Running a browser in a VM without a video card takes some special knowledge that not everybody has. And when the tests break you can't always reproduce them locally. Sooner or later you need to watch the build server run the tests to get a clue why things aren't working. Nothing substitutes for pixels. Saucelabs can solve this for you but we're trying to start small.

The Plan

Most of what you need is out there, we just have to stitch it together. The Jenkins team maintains Docker images.SeleniumHQ has their own as well, that can run Firefox and Chrome in a headless environment. They also have 'debug' builds with support VNC connections, which we'll be using. What we need is a Fig script to connect them to each other, and the Jenkins slaves need our development toolchain.

We need:

  1. A Jenkins instance
  2. A Selenium Grid (hub) to dole out browsers
  3. Selenium 'nodes' which can run browsers
  4. A Jenkins slave that can see the Selenium Grid
  5. SSH Certs on the slave so that Jenkins can talk to it


Rather than modifying the Jenkins image, I opted to build a custom Jenkins Slave. Personally, I prefer not to run slaves on the Jenkins box. First, the hardware budget for the two is very different. Slaves are IO, memory, and CPU bound. The filesystem can be deleted between builds with few repurcussions. The Jenkins server is a different beast. It needs to be backed up, it uses a lot disk space for artifacts (build statistics and test reports, even if you store your binaries in a system of record), and it needs some bandwidth. There are many ways for a bad build to take out the entire server, and I would rather not even have to worry about it.

Also it's probable you already have a Jenkins server, and it's easy enough to tweak this demo code to use it with your existing server without impacting your current operations.

Fig to the rescue

Fig is a great Docker tool for wiring up a bunch of services to each other. Since I know a lot of people who like to poke at the build environment, I opted to write a Fig file where all of the ports are wired to fixed port numbers on the host operating system.

You'll need to install Fig of course (it's not part of the Docker install, or at least not yet), and you'll need to create a ~/jenkins_home directory which will contain all of the configuration for Jenkins, you'll need to generate an SSH key for Jenkins, and copy it into authorized_keys for the slave. Then you can just type in two magic little words:

fig up

And after a few minutes of downloading and building images, You'll have a Jenkins environment running in a box.

You'll have the following running (substitute if you're running boot2docker)

  1. Jenkins on
  2. A Jenkins slave listening for SSH connections on
  3. A virtual desktop running Firefox tests listening on
  4. A virtual desktop running Chrome tests listening on
  5. Selenium hub listening on port 4444 (behaving similarly to selenium-standalone)

Further Improvements

If that's not already cool enough for you, there are some more steps I'll leave as an exercise for the reader.

Go smaller: Single node

On small projects, it's not uncommon to run the Integration Tests sequentially. A single browser open at a time, to avoid any concurrent modification issues resulting in false build failures.

I did an experiment where I took the SeleniumHQ chrome debug image, dropped firefox on it as well, and changed the configuration to offer both browsers. I run this version in [compact.yml] instead of the two run in the normal example. This means only one copy of X11 and xvfb is running, and you only need one VNC session to see everything. The trouble with this is ongoing maintenance. I've done my best to create the minimum configuration possible, but it's always a possibility that a new SeleniumHQ release won't be compatible. For this reason I'd say this should only be used for Phase 1 of a project, and should be a priority to eliminate this custom image ASAP.

fig --file=compact.yml build
fig --file=compact.yml up

This version of the system peaked at a little under 4 GB of RAM. With developer grade machines frequently having 16GB of RAM or more this becomes something you could actually run on someone's desktop for a while. Or you could split it and run it on 2 machines.

Go bigger: Parallel tests

One of the big reasons people run Selenium Grid is to run tests in parallel. One cool thing you can do with Fig is tell it "I want you to run 4 copies of this image" by using the fig scale command, and it will spool them up. The tradeoff is that at present it doesn't have a way to deal with fixed port numbers (there's no support for port ranges) so you have to take out the port mappings (eg: "5950:5900" becomes "5900"). The consequence is that every time you restart Fig, the ports tend to change. But watching a parallel test run over VNC would be challenging to say the least, in which case you might opt to not run VNC at all. In that case you can save some resources by using the non-debug images

Examples & Further Reading