Knowing if your website is up or down is important, having recently expanded to running a few websites, instead of just this blog, I wanted a way of having confidence everything was working! Nothing complicated - just a HTTP request and confirm that a 2xx response is received.

Previously, my experience had been setting up monitoring on a larger scale - decent sized VMs, running Icinga2, checking tens to hundreds of sites/services. This time I was faced with the challenge of monitoring a small number of websites, with the minimum amount of cost possible.

I looked at the "Site ping" tooling market (is that a market? I'm not sure, theres a few options...) and everything was a bit more expensive than I'd hoped, around £8 per month to check the sites seemed excessive - doing it myself seemed like it had to be cheaper!

Balancing stability and Cost

With only a few site checks to do, I wanted to have as little additional cost as possible, whilst providing peace of mind that my own, and the other websites were working. This blog and my other sites are run on a simple Rancher cluster, so a monitoring container was an option, but quickly identified some problems with this approach:

  • Sacrifice capacity for visibility - Setting up a container that was only needed once every n minutes to check the sites were up seemed a bit overkill, it'd be sat there, using resources (admittedly not a great deal, but still, it's a waste!)
  • Don't monitor what you're monitoring from the same box - It's generally not a good idea to monitor something from the same box it's on. Container death/code issues would have been detected, but if the rancher cluster dies, or the VM itself, no alert. Not so good.

Based on this I wrote off a container in the same cluster as the sites themselves. Standing up a separate VM was an option, and cheaper than commercially available options, but it was still more than I wanted to pay.

100% of the cost for 1% of the time

What got me thinking was the fact that for 4 minutes, 57 seconds in every 5 minutes, the server would be doing nothing. I don't want to have pay for them doing nothing, but there's no point standing up/tearing down VMs every 5 minutes...


AWS Lambda

So I had a look at AWS Lambda, part of what's termed "Serverless", or "Jeff" in the industry at large - I'm not going to get into why I think it's a bad name, plenty of people have covered the topic.

In this case, Lambda seemed like a really good fit, I don't need the servers all the time, lambda is available on AWS's free tier and it supports NodeJS 4.3. It also gave me a location outside of Digital Ocean, and the potential to ping my sites from multiple locations (e.g. Ireland, Frankfurt, Singapore, etc). I also found this blog post which confirmed I was on the right track, but lacked code examples.

Getting Started

To begin with, as I was feeling my way round, I started off using the AWS console for everything. I had an account already as I use Route53 to manage my Rancher DNS, but if you don't have one, you'll need a credit card to get set up.

Before we begin

I'll focus on the lambda setup below, if you want to setup lambda-overwatch, you'll need the following SNS topics and roles, so you may want to create these first before delving into the code.

Lambda Location

Once you have your account and the prerequisites, head straight over the Lambda section in the Compute area. Select "Create" function, and it'll take you to a wizard to create your first function.

Function No.1: Make Request

The first function we're going to need will actually make our request. The Lambda Create Function wizard will first ask if we want to use a boilerplate function. In this case, just skip straight past this stage (bottom of the screen)

The Trigger

Next it will ask for our trigger, we want it to execute every n minutes, so we need a "Cloudwatch Events - Schedule" trigger. Once the trigger type is selected you can set the name/description and all important expression. The expression can either be a rate expression, or a CRON specification. In my case, a rate(5 minutes) was all I needed.

Create scheduled trigger

The Code

The final screen then requests the details of the lambda function itself. You can select to enter the code inline, upload a zip file, or retrieve the zip file of S3 storage. To start off with I just pasted the code into the editor window, but that gets tedious pretty quickly - to get up and running though, it's fine!

The code I wrote is very simple, it creates a HTTP request to a URL specified in the incoming event. Once the response comes back, we parse the response and emit an event to a SNS topic. Additionally we handle timeouts, and again emit an event on to an SNS topic for other functions to process the response.

This level of composability with lambda is actually really nice, and I could quickly see different ways in which small dedicated functions could be arranged, with SNS/SQS's gluing them together.

To use my code, you'll need to replace %RESULT_SNS_TOPIC_ARN% with your SNS ARN - you can find it hosted in github.

Setting the URL

I mentioned above that incoming event will include the url, but by default the Cloudwatch Scheduled Event trigger just includes a bunch of meta data and some things to do with scheduling, unsurprisingly, nothing about my URLs. Fortunately we can override the event a trigger supplies lambda, with some arbitrary JSON, e.g.: { "url": "" }.

All you need to do is go to the "Triggers" tab, select the trigger we created earlier and in the top right click "Actions -> Edit".

Configure Trigger

On the the right you should see a Configure input section as above, select "Constant (JSON text)" and you can override the entire event JSON as required. In theory we could add additional properties, allowing for dynamic timeouts and the like (be careful of the default max execution time on a lambda function though).

Once the target is configured correctly, you should be able to set it going, but we wont hear anything to begin with...

Simple Notification

We check a HTTP endpoint every 5 minutes, but we don't do anything with the result! We could stick ever more logic into make-request, so that if it fails it publishes it directly to the failure SNS topic, but I preferred to separate the two.

Lambda has a maximum execution time of 5 minutes, and a default cut off on functions of 3 seconds. These fairly aggressive timeouts put you against the clock - but I actually found that it made me think about how much each function should be doing as little as possible.

Should this make a request AND decide what to do? Better to get the execution of the request done, publish it somewhere safe, like an SNS topic, and then let another dedicated function pick it up.

Function No.2 Process Result

We create our function the same as with the last function, but this time the trigger is going to be the SNS we published to last time. Fortunately this is just a matter of selecting SNS Topic from the trigger list, and enter it's ARN (if using the "Before you begin" naming - resultComplete).

SNS Trigger

Once that's done, we paste in our process result function code. This time we're making decisions based on the status and publishing to ANOTHER SNS topic on complete/failure. Why separate ones? You're intended behavior may be different, but I want to get an email when a site is down, and I'd like to see in slack whenever a check has been carried out.

Once we have that code in place, we can subscribe to our SNS via email and get a lovely formatted email when a site check fails:

Site Fail Email

Clearly not going to win any design awards, but it does the job! If a prettier email is required, then we could always hand it off to another function that will format it and use SES to send the email.

Fancy Slack Notification

An email when something goes wrong is great, but it's nice to have the reassurance that everything is working ok (especially when I was starting off and there were many more issues with lambda - did the site fail, the lambda function exit with an error?).

Slack Result

Function No.3 Format for Slack Webhooks

Very similar as the previous one, just we're connecting to the "completedCheck" SNS topic. The code this time creates a HTTP request to a Slack Webhook endpoint. You can configure this through your teams apps/integrations, just take the URL it provides and replace %SLACK_WEBHOOK_PATH% with it.

Once sorted you should see your check results appear in the slack channel you selected on creating the integration.

Getting Complicated

After I built the prototype, I realised there were so many places I could take this.

  • Automated Build/Test/Deploy of the functions
  • Save reports to s3 buckets (including responses and headers)
  • Build a front end to expose the last results as a status page
  • Utilise a NodeJS selenium driver to attempt to run user journeys in Lambda
  • Other site based checks such as checking for asset sizes, and the like.
  • Complete AWS Configuration - take out the manual steps above, possibly using the serverless framework.

I'm half way through the Automated Build/Test/Deploy stage at the moment, so I'll probably finish that first and follow up with a blog post on how I built the work flow! This makes the managing of the functions much easier, especially the handling the replacement of SNS ARN's and Slack webhook address.

Is it cheap?

One of the key points of using lambda was the potential cheap cost, instead of starting up a VM in EC2 and leaving it there all the time, was it cheaper?

There's a dead handy online calculator I used to estimate the cost, and it came out as a whole $0.68. Quite a bit cheaper than £8 then. As it stands Lambda is also in AWS's free tier, so no charge for a year either. Pretty cheap for some piece of mind!

Lambda cost estimate
(Assuming 3x checks every 5 minutes for 31 days a month, with a execution time of 12s per check, which is extremely pessimistic, actual cost should be quite a great deal cheaper, even ignoring the free tier)

You can find all my source code for the project so far on Github:

Feel free to fork it for your own purposes, it doesn't include the AWS setup for now though.

[Image credit wikipedia, creative commons:]