Frontpage of ProductHunt and a bit of drama

Yesterday was a great day for Typely. We got featured on the frontpage of ProductHunt and I think we could have made it to number one or two if it wasn't for a bug that killed our server for good, with a downtime of about 5 hours.

ph-badge

Typely was featured at about 4 p.m. EST and I made sure to babysit until early in the night when I called it a day. I knew the traffic will only grow larger at that hour because the US visitors were just starting to pour in but I was too tired to keep looking at numbers.

Nothing went bad for so many days and hours…how unlucky can I be to have Typely crash right now. Nah, it's fine, it's tested, compiled, statically typed, continuous integration, continuous delivery and all the jazz…I'm fine! I was, until somewhere at 2 a.m. when a visitor decided to send me an email complaining that CTRL + Space shortcut does not work on Mac OS. As his email address, he enters: ffff@wwdc. That was it!

UnexpectedResponseError URL=https://api.mailgun.net Error: {
  "message": "'from' parameter is not a valid address. please check documentation"
}

Headshot! I forgot to add any sort of health checking to this app. How immature. I focused on so many new features and design implementations but I failed to make sure that my server is able to stay alive at the very least. It's probably something I inherit from working with Elixir and their "Let it fail" mentality.

All of my other projects are up on Kubernetes, constantly being monitored and restarted if anything happens. With Typely I had to go a bit different due to its high CPU usage and relatively low memory footprint. Such a cluster on GCE would cost almost 5 times to what I am spending now, 5 times for little to no gains (except the high availability part of course). With our services being completely free, what we spend monthly does matter, quite a lot.

Our stack consists of 4 docker containers kept alive by docker-compose. One for the core "engine", one for the blog, another for generating PDF reports and the last one, doing some NLP business in Python.

I knew docker recently introduced a healthcheck flag that can be entered directly into the Dockerfile and I thought it's an easy fix.

HEALTHCHECK --interval=2s --timeout=3s --retries=10 CMD curl -f http://localhost:3000/_healthz || exit 1

I added the flag, built the project and started testing. The endpoint was written to respond with a 500 status code in order to test what is going on in case of 10 failed responses. Nothing happens! Turns out, this is just a test that updates a flag, visible with docker inspect in order to see what is the status of the container. No action is being taken and there's nothing you can do about it, without external help. What a complete waste of time.

Disappointed, I started looking for ideas from other people experiencing the same issue. I found a container willfarrell/autoheal that monitors all other containers with a healthcheck implemented and it restarts them. It needs a volume to the docker socket but that's no surprise because it can't restart other containers without access to the main docker daemon.

  autoheal:
    image: willfarrell/autoheal
    environment:
      - AUTOHEAL_CONTAINER_LABEL=all
    volumes:
      - /var/run/docker.sock:/tmp/docker.sock

Tried, tested and pushed. The little drones got busy and compiled, tested again and pushed into production.

These things take place all the time. I'm a bit mad it happened when we were featured. I feel a bit handicapped without Kubernetes. It is, without a doubt, my favourite tool for orchestration and I worked with all of them: Rancher, Marathon, DC/OS…you name it.