As a web entrepreneur you HAVE to make sure that your product is working properly and making you money at all times, even when you are asleep. We'll see what are the strategies and the tools at our disposal to help us deduce with a certain degree of confidence that a website is up and working as it should.
“Uptime” is the word we commonly use to define the amount of time in which an online service is working and it’s accessible by its users. Conversely, “downtime” defines the amount of time in which an online service is not working. The important keyword to highlight here is working. Yes, it’s important to highlight this because for many companies, measuring the uptime of a website means merely measuring the time in which the website is online. And actually this is what traditional uptime monitoring tools do! They work by checking the response of a particular website URL every few minutes, and if the response is what it’s expected then we assumes that the website in online and working. But, is it 🤨?
Well, the real question is: what percentage of the functionality of your website you can assume is working if a particular check is successful? Of course this percentage will vary a lot depending on the particular website or application in question, and also based on the quality and amount of checks that are run. For example, using your favourite uptime monitoring tool to verify the return value of an ad-hoc health check endpoint that runs some basic checks on the server, is way better than having the same monitoring tool retrieve the homepage of your website. Similarly, using an automated browser to check the most important user flows of your website (e.g. your sign-up process), is a step ahead compared to just invoking an HTTP endpoint.
With this assumption in mind, we can surely make some guesses on how much the different website monitoring techniques are good at deducing if a website is really working or not. The chart below, shows you an example. The proportions and numbers will change based on the factors we discussed above, but nonetheless it’s a good visual representation of the situation:
It’s important to stress on the fact that we always use the word deduce or assume when it comes to evaluating the efficacy of a monitoring technique with respect to the amount of coverage it provides. This is because it’s often close to impossible to have a setup where every single feature of a website is checked for malfunctions, at least not while monitoring, when speed is essential. In fact, it’s possible to reach close to 100% of actual test coverage during the development phase, while we have all the time we need to run the amount of automated scripts it requires; but when the code reaches production, it’s a whole different story. All we can do there, in fact, is estimating and trying to infer that other parts of a website are working based on a few meaningful checks.
But beware! These assumptions can be formulated with an acceptable accuracy only if we are already implementing an adequate testing strategy during the development of the website/application. Without this foundation, in fact, deducing if a website is working from the few checks that are run in monitoring mode, it’s going to be difficult. That’s why it’s very important to implement a proper testing strategy, before your product reaches production. Nowadays, you can use a user-friendly solution like Frontend Robot or other similar tools, so there are no more excuses for not testing your website.
So, what’s the best way to check if a website is working? Let’s try to answer this question in the next section.
As it often happens in the world of software development, the best approach involves having multiple tiers, each one with its own balance of speed/coverage. The ideal setup, would be having a couple of very fast HTTP checks running at a very high frequency (e.g. every 1-5 minutes), then having a handful of meaningful user flow tests running at medium-high frequency (e.g. every 5-10 minutes), then a slightly more comprehensive set of automated browser tests to run less frequently (e.g. every 10-20 minutes). This way, we can get alerted fast if some serious malfunction is affecting the entire website, and at the same time we will eventually be alerted also if there is a problem limited to only specific parts of the website.
This particular setup can be represented like a pyramid, where in the vertical axis we have the frequency of a check while in the horizontal axis we have the coverage of that particular check.
The pyramid we’ve just seen has a lot of similarities with the testing pyramid, but it must not be confused with it. In fact, in our monitoring pyramid the slowest running tests are at the bottom, while the faster test are at the top. In the testing pyramid this is actually the other way around; unit tests - which are the fastest - are at the bottom while end-to-end tests - which are slower - are at the top. However, the two pyramids have in common the fact that the more summarizing tests are at the top while the more detailed tests, those that go deeper into a particular feature or function, are at the bottom.
It’s also interesting to notice that the checks at the top of the pyramid are also those that provide the highest relative amount of (estimated) feature coverage. This is because of our assumption that the amount of new bugs introduced from the development phase is low, which is something that can be guaranteed only with an adequate testing strategy applied during the various development cycles.
So, the answer to the question of what’s the best way to check if a website is working, is not as linear and straightforward as recommending just a single solution. The right answer, in fact, involves the combination of different techniques, organized in such a way to obtain almost immediate alerting to critical issues and fast response to issues affecting only portions of the product. The ultimate goal is providing to the user an uninterrupted service or to at least minimize the interruptions by responding fast to malfunction. In fact, if a user reports a problem to you before your monitoring tool does, then it means that your monitoring strategy needs to be adjusted.
To put what we have discussed so far into practice, let’s now analyze a sample testing and monitoring structure to continuously check if a website is working.
At the top of our pyramid we have a set of typical HTTP health-check endpoints. We can decide to create one endpoint for each subsystem of our application, or aggregate different checks within a single endpoint. But no matter how we structure our HTTP checks, they have to run fast (as in a few hundreds of milliseconds at most). You can monitor these endpoints with a traditional uptime monitoring tool, such as UptimeRobot, but there are hundreds of similar solutions you can choose from. OK, but what are the typical tests performed withing those health-check endpoints? The following list gives you an idea:
But bear in mind that each application is different and may require specific tests.
In the middle of our monitoring pyramid we have a set of end-to-end tests targeting the most important user flows of the application. This is also called smoke testing. We are talking about core processes such as these:
But in this tier we can also have much simpler tests that a mere HTTP request check simply can’t perform, such as:
As we can notice, all those tests require that the website is (automatically) tested by a real browser, which is capable of executing, rendering and verifying frontend code and data. This is the part that the checks at the top of the pyramid will miss, and that’s why we need this second tier of tests at the middle of the pyramid. I would consider this set of tests as important as the health-check endpoints, and they must not be missed in a proper monitoring setup. These tests are usually run every 5-10 minutes, so, all combined, they should take at most 1-2 minutes to run to be effective. You can use a tool like Frontend Robot to create and run these user flow tests, but there are also other options as discussed in another article.
At the bottom of our pyramid we have a set of end-to-end tests, which again use real browsers to verify if a the functionality of the website work as expected. The type of tests performed at this tier are almost identical to those performed in the middle tier. The only difference is the fact that they have more time to run and therefore can go deeper into the features of the product, to test secondary functionality as well. In reality this tier can be considered optional if the website is already well tested before reaching production, but I’d consider it essential anyway if the product is large or containing hundreds of features that otherwise would not be tested by the more shallow tests in the higher tiers of the pyramid. Frontend Robot and other similar tools are perfectly suited for implementing this type of testing as well.
Within this article we’ve mentioned many times the importance of testing the application before it reaches production. This is indeed a factor in the system that should not be neglected, as it allows us to run lighter and faster tests on production, with the advantage of getting notified earlier if something is wrong. Frontend Robot can scale up to hundreds or thousands of tests, so it’s an amazing tool for creating and running regression tests during your development cycles as well.
In this article we’ve discussed what it means to check if a website is working. We’ve highlighted the differences with uptime monitoring, and defined a tiered structure - the monitoring pyramid - to guide us in the definition of a monitoring strategy that can help us infer with a good degree of confidence if a website is actually working. In this regard, the role of automated frontend testing is of utmost importance as it allows us to replicate the actions of a real user and reach parts of the application that a health-check endpoint cannot. Tools such as Frontend Robot are the perfect companions to more traditional uptime monitoring solutions, and can be also leveraged to simplify the creation of end-to-end tests to run during the development cycles of the application.
Get started with your improved monitoring strategy now, create your first test with Frontend Robot in 5 minutes.