Providing HTTPS on custom domains in a SaaS app

In my free time, I run a software as a service app called MCJukebox. Users of the service each control a single page. Advanced users are able to make this page accessible on their own domain by pointing a DNS record to our servers. This has worked well for a number of years now, but a significant limitation in the feature up until recently has been the lack of HTTPS.

Adding HTTPS support has always been important to me, but admittedly, this only became a top priority with Chrome announcing that they will be marking all HTTP sites as "Not Secure" from July 23, 2018. We don't collect login or credit card information, but we need a certificate to ensure the integrity of our responses. Without one, an ISP or man in the middle attacker could inject code into our page as it makes its way back to the user.

The solution that I eventually settled on starts with Caddy - an HTTP/2 web server with automatic HTTPS. The tagline may make it sound like this was the perfect solution out of the box, but I had a few challenges to address first.

Challenge #1: Let's Encrypt Rate Limits

The default certificate authority used by Caddy is Let's Encrypt. To prevent abuse, many rate limits are in place. The majority of these are at the domain level, meaning that in the worst case scenario, I will be unable to issue a certificate for a particular domain but no other users will be affected. However, one limit in particular is a major concern:

There is a Failed Validation limit of 5 failures per account, per hostname, per hour.

If Let's Encrypt fails to verify domain ownership five times for a given hostname, that hostname is blocked from making additional requests for the next hour. Users are very likely to make mistakes with their DNS setup, and so to avoid them reaching this limit, I only want to pass requests to Let's Encrypt that I expect to succeed.

I was able to implement this using Caddy's ask directive. This is set to a URL, and is treated in the following way:

The URL will be queried via GET and should return a 200 status code if the domain form value from the query string is allowed to be given a certificate.

I created a /internal/caddy/ask route as part of our existing API, and this performs several checks returning the 200 status code exclusively in the case where all succeed. The checks are:

A user has requested that their page be hosted on this domain.
When we perform a DNS query on the domain, it is setup to correctly point to our servers so that we can complete the HTTP challenge.
Let's Encrypt have not refused to issue a certificate for this domain in the past. This is a fairly agressive restriction, but I can allow additional attempts for a domain if a user requests that.

Challenge #2: Caddy Rate Limits

To be fair to Let's Encrypt's ACME server (ACME is the protocol used for requesting certificates), Caddy has several built in rate limits for their "On Demand" certificate feature:

Certificates are never requested in parallel
After ten certificates have been succesfully obtained, they can only be requested at a rate of one every ten minutes
If a certificate request fails for a given name, no challenges on this name will be attempted for the next five minutes

To adjust these to be more suitable to my situation, I needed to learn Go, the programming language that Caddy is written in. I can still only claim a basic working knowledge of Go, but nevertheless, enough to make the changes neccessary.

For those interested, the function where I made changes is located here.

On consideration, I decided to leave the first and third rate limits in place. I may revisit that decision in the future, but I feel it makes sense with new domains added fairly infrequently at the moment. While too many limits could keep users waiting for longer than neccessary, there is value to Caddy providing protection against a mistake in the implementation of my ask endpoint.

The second rate limit, where certificates are only generated once every ten minutes after the first ten, was too restrictive for the service that I want to offer. At the moment, I have reduced the timeout to a single minute, but as I become confident in my solution, I may choose to remove it altogether. The fact that certificates are never requested in parallel introduces a delay already and that is likely to be sufficient alone.

Edit: Matt Holt, who created Caddy, corrected me on Twitter. The last two limits don't apply in the case where an `ask` subdirective is present (see Tweet).

Challenge #3: Database Integration

In order to monitor the status of certificate generation, it was neccessary to link Caddy to the MongoDB instance that I'm running. This could've been done by directly changing the source of Caddy, but to reduce complexity as much as possible, I decided to investigate the server's built in plugin system.

The Event Hook plugin type, which executes a function when Caddy emits certain events, was perfect for my use case. These are the two events that I used:

OnDemandCertObtainedEvent, called when Caddy succesfully obtains an on demand certificate. This passes through the respective hostname.
OnDemandCertFailureEvent, called when Caddy fails to obtain an on demand certificate. This passes through the respective hostname along with the associated error.

These events don't actually exist in Caddy, so I implemented them as part of my fork here. I plan on maintaining this for the time being, and eventually opening a pull request to contribute these upstream.

I think that now is a good place to quickly mention Caddy's licencing agreement. The builds available on their website are only free for non commercial use, but because of the Apache 2.0 licence that is used to share the project on GitHub, fewer restrictions apply when Caddy is compiled from source. These details may have changed, and my description here may not be perfect, so please do read their licenses page here if you plan on making use of Caddy.

Moving back to database integration, my plugin is a simple one written in just two Go files. The first provides an interface with Caddy and listens for the relevant events, and the second establishes a connection to MongoDB and saves events in a certificates collection. Originally, I was actually saving certificate information in the record corresponding to the page where that domain was being used. However, this caused problems I didn't anticipate - if a user removed a domain and another then registered it, Let's Encrypt would remember previous requests for that hostname but my database wouldn't. I needed to use a new collection to store persistent information about the certificates Caddy had, providing independance from where they might currently be in use.

Challenge #4: UI/UX

Above, you can see how I chose to present the new HTTPS feature to users. I had users create a CNAME pointing to my instance of Caddy, and following this, they are able to link the domain to their page. I wanted to make sure that the users were explicitly consenting to certificate generation, which is why I explain this in both the description above the input, and directly through the text on the button.

The HTTPS section below updates automatically, by running a function repeatedly which polls my API. The API determines the status based on specifics of my implementation. For example, the pending state is really shown when a domain is setup but there is no certificate information in the database. Caddy won't touch the database until a certificate has either been generated, or there has been an error, and so we can't rely on anything else. The API also returns an expiry date if a certificate is currently in use, and this is calculated by adding 90 days to the time when Caddy recorded receiving the certificate. This could perhaps be improved but I haven't yet updated my fork to read any information from the certificate.

As a final note about the UX implementation, due to the nature of on demand certificates, Caddy will only look to ask for these when a request is received for that hostname. To work around this, I send a simple HTTP request to the domain from the client when the new domain is saved in the panel. To prevent CORS related errors, I set the "no-cors" request mode. This makes the request opaque, meaning we can't see the result, but that is fine in this situation.

Summary

Overall, this has been an incredibly exciting project to work on and a true learning experience for me. SaaS apps are usually run by larger companies meaning that I couldn't find a post like this when I was starting out, and I hope that what I've written will help a hobbyist somewhere down the line.

If you have any questions, either from a higher level or related to my specific implementation, I would be happy to chat in the comments below. Alternatively, you can always reach me as @oliverdunk_ on Twitter.

06 Jul 2018