Wednesday, April 8, 2015

More Graceful Deployments with CodeDeploy

UPDATE: As user /u/Andy_Troutman on Reddit kindly pointed out to me, the AWS team has already considered functionality very similar to this. They even wrote a similar script (long before I wrote this one). If you want the more "official" version, here it is: https://github.com/awslabs/aws-codedeploy-samples/tree/master/load-balancing/elb

If you haven't used AWS CodeDeploy before, it's a new service aimed at automating deployments across a fleet of EC2 instances. It works using an agent on the instance that polls AWS for new changes to application code. When a change is detected (you or a CI tool triggered a deployment), the instance downloads the new code and runs a series of steps you define in a YAML file. These steps can include installing dependencies, validating that the service is running, and pretty much anything you can fit in a script. CodeDeploy can be configured to deploy the code to all instances at once, one at a time, or in percentage groups.

Ideally, a new deployment revision should not cause any loss of traffic. However, once the application is installed (the code is unzipped and copied to the correct location), most services must be restarted for the changes to take effect. Personally, I use a simple Node.js process running with "forever" for 90% of my projects. When CodeDeploy finishes installing the code, I have to run "forever stop" and "forever start" for the changes to be applied. This takes about 500 to 4000 milliseconds depending on how large the application is and whether it has to make database connections or perform other startup procedures. During this time, traffic is obviously rejected and the load balancer returns a "503 - Backend Service at Capacity" error to the client.

Although ELBs do have health check options (and theoretically, you could set it to have a 1 second ping), the application restart still causes a instantaneous cut-off of all connections, followed by a failure of the health check. Until the check fails, the ELB is still sending traffic, which could amount to hundreds of connections for a highly trafficked service.

The solution to this issue is to tell the load balancer to stop sending the instance traffic and then wait for the existing connections to drain before restarting the application. Honestly, CodeDeploy could easily implement a simple option for "remove instance from the load balancer when deploying," but I've decided to recreate the effect using the start, stop, and validate scripts used by the agent.

Overview

In the next several steps, I'm going to script out the process of: 1) De-registering the instance from the load balancer 2) performing the application restart, and 3) Re-registering it once the health check passes.

Instance Preparation

Your instances must have the AWS command line tools installed. This can be done on the AMI (recommended) or during the bootstrap process. Additionally, you could even add it as a "BeforeInstall" hook (just know that it will run the install command before every deployment until it's removed).

Additionally, the instances are going to need to be able to make the necessary API calls to AWS to register and deregister themselves from the load balancer. I've allowed this using IAM roles (you are using IAM roles, right?) in CloudFormation below:

{
"Effect" : "Allow",
"Action" : [
"elasticloadbalancing:DeregisterInstancesFromLoadBalancer",
"elasticloadbalancing:RegisterInstancesWithLoadBalancer"
],
"Resource" : [
{
"Fn::Join": [
"",
[
"arn:aws:elasticloadbalancing:",
{ "Ref" : "AWS::Region" },
":",
{ "Ref":"AWS::AccountId" },
":loadbalancer/your-elb-name"
]
]
}
]
}

You can also modify the instance's IAM role directly from the console and use the policy generator to give it the same permissions.

ELB Preparation

You should also enable connection draining on your ELB and set the time to whatever is appropriate for your application (if you're just serving webpages, 30 seconds is probably fine; if users are uploading files to your service, you may want to increase it).

CodeDeploy Files

Now that your instances have the correct permissions, you can include the code in your scripts to gracefully remove them from the ELB before running the application restart. Your scripts may differ considerably, but I have the following appspec.yml file:

version: 0.0
os: linux
files:
  - source: /
    destination: /path/to/install/location
hooks:
  AfterInstall:
    - location: deployment/stop.sh
      runas: user
  ApplicationStart:
    - location: deployment/start.sh
      runas: user
  ValidateService:
    - location: deployment/validate.sh
      runas: user

When a deployment is triggered, CodeDeploy runs the "ApplicationStop" script, downloads your artifact, runs the "BeforeInstall" script, copies the files to the correct location, runs the "AfterInstall" script, then the "ApplicationStart" script, and then finally the "ValidateService" script. As you can see, they are not all required, and I have not made use of every one.

Once the artifact is downloaded and unzipped, the "AfterInstall" script is run, which I've configured to remove the instance from the ELB, wait for the connections to drain, then stop my application:

#!/bin/bash

# Get the instance-id from AWS
INSTANCEID=$(curl http://169.254.169.254/latest/meta-data/instance-id)

# Remove the instance from the load balancer
aws elb deregister-instances-from-load-balancer --load-balancer-name elb-name --instances $INSTANCEID --region us-east-1

# Let connections drain for 30 seconds (replace with your drain time)
sleep 30

# Now stop the server
forever stop /path/to/process.js

At this point, the instance is successfully removed from the ELB, connections have been drained, and you can do whatever is needed to restart your app without worrying about loosing requests. My start.sh script restarts the server:

#!/bin/bash
forever start path/to/process.js

Finally, you should add validation to ensure your app is actually running before you re-attach the instance to the ELB. I've done this in the validate.sh script:

#!/bin/bash

# Wait for however long the service takes to be responsive
sleep 10;

res=`curl -s -I localhost/ping | grep HTTP/1.1 | awk {'print $2'}`
echo $res;

if [ $res -eq 200 ]
then
# Get the instance-id from AWS
INSTANCEID=$(curl http://169.254.169.254/latest/meta-data/instance-id)

# Add the instance back to the ELB
aws elb register-instances-with-load-balancer --load-balancer-name elb-name --instances $INSTANCEID --region us-east-1

# Wait for the instance to be detected by the ELB (set this to the health check interval)
sleep 10
exit 0;
else
exit 1;
fi

If everything is successful, CodeDeploy will complete this step and move to the next instance (assuming you're deploying one at a time). If not, the deployment will fail but the instance will remain removed from the ELB. You can either re-trigger a deployment with a fix or rollback to a previously working one.

Additional Thoughts

Depending on the size of your application, this may not fully replace proper A-B stack deployments that includes switching DNS. If you only have a few servers, taking one offline will increase the load on the others substantially. Finally, these steps will increase the time of your deployments by 30 seconds to a few minutes per server. If you have 100 servers, consider using the "percentage at a time" deployment method, but balance this will the increased load on the remaining servers.