Moving from a monolith to a services architecture is a complex, demanding task. It is a journey that requires continuous investment to bear fruits. If the destination isn’t reached, it is likely that the outcome will be worse than the starting point. The switch between one architectural design and another is dangerous, especially when considering an application with a persistent state.

With Split serving over one trillion feature flags every month, we must continuously improve our platform in a sustainable manner. While Split’s architecture contains multiple components, there are aspects of it that require modernization. Built with different technologies and optimized for unique needs, some facets of the architecture could be considered small monoliths.

Split’s journey to move away from a monolith incrementally was heavily powered by its own feature flags. This is chapter one of a two-part article, and we will describe how we safely transitioned our codebase over time, highlighting our goals in doing so. The process didn’t always go well, but feature flags helped us minimize customer impact and recover quickly.

Beginning the Journey: Identifying What to Do First

Moving from a monolith to microservices is not easy. The first step we took was to identify which domain should be the first one to decouple. We did a detailed analysis of pros and cons related to different domains. Then we balanced that with the company’s priorities in order to make a choice.

The Domain

Split provides a set of SDKs in different languages that our customers can run in their own application. Developers are able to leverage our services in order to fetch feature flag definitions (splits), amongst other capabilities.

To evaluate a split for a given user (i.e. feature is on or off), our SDKs provide a getTreatment function. This function receives the split name and a unique key attribute corresponding to the user. 

The key value typically is associated with a customer’s users and is mapped to their internal ids. This attribute helps evaluate a split and targets different treatments to different users based on the split’s rules.

Figure 1

There is a wide range of customization, like allocating traffic by a percentage. The key is used to ensure that the same user will always obtain the same result. You can also target groups of keys based on constraints such as exact match and is in list, among others. You can view all different types of targeting rules supported by Split here.

Consider the example below showing a Split with a simple set of targeting rules:

  • default treatment OFF
  • if user is in list (UUID_1, UUID_3) serve ON

Default treatment OFF will specify that if no other rules apply, the split’s evaluations will be OFF for all users. The second rule says that for users UUID_1 and UUID_3 the result will be ON.

With just a little configuration, you can decide which treatment to serve by listing users. This works well with a small set of users. However, let’s think about configuring and maintaining these kinds of rules for hundreds or even thousands of users. Or, let’s try to replicate them among other splits and be sure that all of them are in sync. 

Here is when segments enter the picture. A segment is a way to group a set of keys to use them directly when configuring a Split:

Figure 2

Now, with this segment configuration we can configure the split like this:

  • default treatment OFF
  • if user is in segment SegmentA serve ON

This enables the possibility to configure different splits with rules based on segments, instead of listing keys one by one. You can just use the segment’s name instead. When a user is added or removed from a segment, all splits with the same rule will be automatically updated!

The Architecture

We now have more context about what segments are and what benefits they bring. Let’s see how Split internally handles them in order to achieve these powerful capabilities.

We can identify two main requirements for segments. We are able to manage them by creating and deleting segments, as well as adding or removing keys from them. Additionally, we need to provide the segment information for all of our SDKs in an optimized manner.

Our initial architecture looked like this (simplified):

Figure 3

The components to keep an eye on are:

  • A main backend service, exposing REST APIs for different domains. Each domain contains its own API, domain logic, and data access layer. All the management of segments is done through this service.
  • A sdks-feeder service, which is consumed by SDKs in order to fetch required data to evaluate Splits on customer applications. SDKs will collect all the segments associated with the key configured on it from the sdk-feeder.

Split has millions of SDK instances running globally, the traffic load managed by the sdk-feeder is not small. This leads to heavy load in all downstream dependencies as well–database engine included.

Our main backend service contained logic not only for Segments, but for many other domains. Plus, our backend service shared the database with other applications including the sdks-feeder. The combination led to several challenges from development complexity to runtime performance issues. Decomposing this into different services, well encapsulated and isolated, became a necessity.

After doing an exhaustive data analysis of our segment-related traffic, we discovered something interesting. Almost 80 percent of the traffic ends up with no segments associated with a key.  It makes logical sense. 

Not everyone separates their users into different segments. Moreover, they don’t necessarily segment all of their users. This means that oftentimes there are segment keys being requested without being directly associated to a particular segment. 

Capturing the actual data based on our traffic was fundamental for us to understand the problem and make informed decisions. We started looking into solutions that would provide:

  • A low memory/storage consumption, to handle the big amount of data.
  • Quickly determine if a key has segments associated, to avoid unwanted calls to the database.

Leveraging a Cuckoo Filter

After investigating different options, we decided to leverage a cuckoo filter. This would allow us to answer requests in a few milliseconds and with minimal memory consumption. An initial attempt at fitting it into our architecture would look like this:

Figure 4

This scenario demonstrates every write operation in the segment’s domain. The main backend service needs to stay in sync with the cuckoo filter and the latest state of information. Meanwhile, the sdks-feeder will query the cuckoo filter first before going straight to the database.

This could work, but with some caveats. Introducing the cuckoo filter this way in our initial architecture leads to challenges. Issues include data ownership, scalability, error handling, consistency on data meaning, implementation of circuit breakers, maintainability, among other things. 

In this situation we want to decompose the main backend service into domain services. We identified an opportunity to start that separation, which led to the following architecture:

Figure 5

The segment service should own all segment related data, encapsulating all access to the database. Note that in the architecture shown above, the access from the different APIs would still go through the main backend service. This service has its own segment-related logic and communicates with the segment service. This is a transitional architecture, not the goal.

This centralization of segment management into a single service, used by multiple applications, provided multiple benefits. The ability to collect more metrics about usage and the freedom to run experiments. We can also change the underlying technology and introduce new mechanisms to handle our use cases.

The natural next step to fully separating the service is to move the domain logic into the segment service itself. We also moved the external API access directly to it, removing the main backend service from the picture. 

Figure 6

As you can see, this is a journey that requires multiple steps, and that needs to be done incrementally. In chapter two of “Safely Moving Away from Monoliths,” we’ll talk through how we achieved this, showcasing techniques that can also be applied to similar re-architecting journeys. Stay tuned until next time.

If you want to achieve reliability and mitigate risk with Split feature flags, schedule a demo!