Put simply, our definition of observability for software systems is a measure of how well you can understand and explain any state your system can get into, no matter how novel or bizarre. You must be able to comparatively debug that bizarre or novel state across all dimensions of system state data, and combinations of dimensions, in an ad-hoc manner, without being required to define or predict those debugging needs in advance. If you can understand that bizarre or novel state without shipping new code, then you have observability.
Observability Engineering, Majors et al., 2022
We offer small business owners a platform they can use to fulfill their dreams. Each day we work on innovative and exciting tasks to continuously improve our offering. To be fast and innovative, we also need the ability to recognize whether our platform is working as expected and if not, we need to provide Jimdo teams with the right information to triage, mitigate and resolve any issues.
At the beginning of 2021, we revisited our capabilities to assess the Jimdo platform. Due to the way our systems have evolved, we decided to introduce a source of information.
This article will explain how we approached changing our observability offerings and will also provide some insights and tips we wish we had known at the outset, in the hope that it will speed up your own journey.
Different Approaches
The Jimdo platform consists of multiple systems which communicate with each other to fulfill a user's tasks. This is in contrast to the approach of building a system as a single unit.
The unit approach, also called a monolith, has its advantages and drawbacks, as does the separation approach we took; micro services. This is an essential decision for your organization. The approach you choose will influence not only your design, but also your processes (for example, communication). Each organization must evaluate which approach best suits their individual circumstances.
We also had to evaluate which information was required in order to assess the system from an external point of view. I will use the term ‘signals’ for this going forward.
In a monolith approach, it is possible that only log signals (which are recorded events) are necessary. A common pattern for logs includes a level, timestamp and message property.
{"level":"info","msg":"Something of note happened","time":"2021-12-24T15:00:00+01:00"}
{"level":"warning","msg":"Something happened which needs attention","time":"2021-12-24T15:10:00+01:00"}
{"level":"error","msg":"Something failed!","time":"2021-12-24T16:00:00+01:00"}
This is potentially enough information for a simple monolith system, however in our case, we needed more to classify issues.
The most complex problems occur when there is interaction between systems. Logs allow teams to see information, however this is in isolation. What we needed was a log overview which correlated to a request traveling through the whole sections of the system.
This kind of signal is called a trace. I like to think of tracing as transforming your systems into informants, which tell you the whole story.
Instead of having logs in isolation in a particular section, we added context to the logs. While a request travels through different parts of the system, these services recognize that they are part of a unique request and add their information to the overall context. This mechanism is called context propagation, if you’d like to dig a little deeper.
Based on this new trace signal, we can display the interaction between services as a waterfall diagram.
The trace signal can be enriched with arbitrary meta information. Most people start using trace signals to look for latency problems, however the opportunities are endless; you can add operation system information to a trace to see if a specific version is affected, or add account IDs to help customers affected by degraded services for example.
How to choose tools and vendors
Confident that traces would hugely improve our observability, our first step was to seek out and choose tools and potential vendors. This was an overwhelming task.
Some vendors provided amazing insight into our systems, however this was only possible by introducing their proprietary libraries and agents.
History showed us that a mixed ecosystem for different signals was the current standard (e.g., log and metric libraries). We were concerned about introducing yet another (proprietary) library for the following reasons:
- Learning something entirely new is challenging and always involves a learning curve
- A vendor lock-in was inevitable due to proprietary libraries
- Switching to a new vendor because of changing requirements would mean many additional changes to our services. This would make the transition particularly difficult and was a good reason to stay put.
While researching this topic we came across the OpenTelemetry project. This is what we now use here at Jimdo. Let’s take a closer look at this and why the associated advantages convinced us it was a safe bet.
OpenTelemetry
OpenTelemetry (Otel) is a project under the umbrella of the Cloud Native Computing Foundation. It is an Open Source Observability Framework providing APIs, Tools and SDKs for various languages and ecosystems. The idea is to combine formats and provide a single toolkit for all signals. For us at Jimdo, this allowed us to resolve our issue of a mixed ecosystem.
The separation of concerns between generating, emitting and visualizing signals was an advantage that really sang to us. Generalizing the creation of signals and being able to use any vendor able to work with this data gave us the flexibility to adjust to new challenges without migrating to new libraries. As a result, vendors must concentrate on competing at the end of the observability pipeline (for example, with correlation of data, visualization).
Providing an open source alternative to proprietary libraries also makes it possible for open source library authors to provide signals straight out of the box in OpenTelemetry format.
Before jumping into our strategic approach, it is important to draw attention to the Otel Collector. This was, and is, an essential part of our infrastructure and our solution as you will read. I highly recommend familiarizing yourself with its functionality.
The following sections outline the strategy we employed to streamline the introduction and adoption of an improved observability offering at Jimdo.
How can I get started with a new signal?
The general idea is to disturb teams as little as possible in their daily tasks while introducing a new signal.
The OpenTelemetry Collector is a tool for receiving, processing and exporting signals. We were able to use this to our advantage and let application engineers send signals, while the collector made sure to process the signals data (to remove PII data, for example) and send it to the chosen vendor.
This separated the task nicely:
- Application engineers could remain tasked with the inner workings of their systems
- Operation engineers could enrich signals with information from their area of expertise (e.g. resource usage)
As an added bonus, we are able to use the collector to switch vendor without interrupting teams with any configuration changes. There are drawbacks to this solution however, and we will take a look at these a little later on.
Tip #1: use the collector as a strategic component in your adoption phase.
Instead of pointing teams to the OpenTelemetry project or writing guidelines from the start, we looked for an interesting use case to showcase trace signals and OpenTelemetry.
Don’t choose a complicated use case initially because the benefit of traces is only visible once everything is connected. So if you have a use case which involves 20 services, all of these need to send the signal in order to generate the complete trace.
I would also suggest choosing a use case that is crystal clear for everyone. This is confidence building when it comes to the data generated (more on this later).
Tip #2: choose a simple use case to show the value of the signal as soon as possible.
After we had chosen a use case, the operations team offered support to team leaders and implemented the first solution using OpenTelemetry. After a successful launch, we used this, and future projects, as reference points for other teams and new Jimdo employees.
Tip #3: take full advantage of experienced teams and get them to help others on future projects. Use the first project as a reference point for future teams.
As OpenTelemetry documentation is written in a generic fashion, we finished up this phase by drafting Jimdo instructions on Observability for everyone to use as a source of information.
Evaluating vendors - how to choose the best vendor for your business
As previously mentioned, we decided to use the OpenTelemetry collector as a strategic component. Signals sent from systems were gathered, processed and exported.
A huge amount of thought went into which vendor to use. With the collector sitting centrally, sending the same signals to different vendors was effortless once they provided an OpenTelemetry compatible endpoint. This allowed us to directly compare each vendor.
This approach means you can evaluate the differences between vendors and provide important information for company decisions. Always bear in mind that you’re adding extra stress to your teams’ daily tasks when you introduce new tools. If you can find a platform they are already familiar with, adoption can be streamlined.
Tip #4: use the collector as gatekeeper and exporter of signals.
Introducing the collector as gatekeeper carries the risk of creating a bottleneck. If the collector is unreachable, no signals can be sent to vendor systems. We were well aware of this potential issue and set barriers around it to prevent scenarios such as this. For us, the benefits outweighed the concerns.
How can we increase confidence in data?
After enabling teams to send their trace signals to our collector and ultimately to our selected vendor, the next phase was gaining confidence in the data. After all, once a team loses trust in data, it loses all value.
This is a highly individual task and I suggest sitting down with your team and looking at specific traces and evaluating whether they are comfortable with the information. A simple use case will help you greatly in this respect. Analyze the generated graphs and validate them against your mental model.
Tip #5: a simple use case garners trust in data.
Honestly speaking, we did face technical challenges when we adopted OpenTelemetry, especially in the context of Kotlin’s Coroutines. Our team had to think long and hard and the process involved a great deal of consideration about context propagation in order to fix the issues.
Unfortunately, the documentation and implementation of certain OpenTelemetry libraries is not as thorough as you would wish for. Brace yourself.
Are there ways to control the volume of data and the cost?
This was the final phase of our strategic plan.
After successful implementation, your costs will rise due to the increase in data. Each vendor charges different prices, however controlling the volume of data is a necessary evolution in your journey if you are to be successful. There are various ways to address this but as usual, this depends on your use case and which information is important to your teams.
The collector provides filter capabilities that can be used to entirely drop tracing information.
Another approach is to sample data. This is the approach we implemented because we are mostly interested in outliers. Successful operations can be sampled and annotated with metainformation.
Tip #6: become familiar with the sampling strategies and features of the Otel Collector
Summary
We hope this article helps you improve your observability and provides some food for thought in your approach.
The outlook of OpenTelemetry is promising and I highly recommend looking into community updates.
Here at Jimdo, we have already celebrated a few wins thanks to trace signals (performance and security improvements, for example) and if we ever need to switch vendors or introduce a new signal, we can be up and running extremely quickly.
Photo by Alessio Lin on Unsplash.
Related articles:
Jorge Collado García, Software Engineer, gives a quick overview on how to migrate and restructure your current project to the Clean Code paradigm.