4 big data challenges and solutions

Over the years we have encountered a fair share of challenges in which we had to provide fast search access to huge amounts of documents. The scenarios varied, based on the content, use case and number of documents, but we're usually talking about terabytes of data and billions of documents or records. To make things even more interesting, typical scenarios include enriching the documents while indexing, advanced searching and filtering capabilities, performing analytics on the data, finding relationships between documents, performing on the fly calculations and all of this in under half a second.

Use cases

1. Searching through 4 billion documents

We worked for 6 months on a really interesting project in which we had to provide a way to search through billions of text documents as fast as possible, with complex filters. Each search result should also be accompanied by several visualizations that give the user an overview of the result data. The sheer volume of data was a challenge, as on a typical day we would get between 1.5 and 2 million new records, while the goal we had was to store at least 5 years of fast accessible data (anything on top of this would be nice to have). The solution we implemented consisted of several parts:

first, a process for indexing the content, in which we also did sentiment analysis, language detection and other classifications and calculations that enrich the data;
second, an elastic cluster that would index and store all the data across multiple nodes and provide search capabilities;
and third, an API that would provide endpoints for searching the data as a wrapper around the elastic API that would handle authorization, validation, and conversion of specific Boolean dialect used in the UI.

This approach meant that we had loose coupling and could easily change parts of the system without disrupting normal work. Due to the huge number of documents, some things could not have been foreseen. For example, we found that some important characters like @ were not indexed, and we had to go back, change mappings and analysers, reindex everything we had and do this without any downtime by using aliases. These fine tunings, changes and adaptations took maybe 60% of all the time on the project, but we were able to deliver them iteratively and constantly improve the search, going from PoC to MVP and then gradually releasing to all the users. The end result: search through over 4 billion documents with complex boolean queries in under 500ms.

2. CQRS with elastic as a secondary storage

We had a fast growing multi-tenant .net application that needed a big performance boost - some search requests lasted over 5 or 10 seconds depending on the filters being used. The data was pulled from a SQL server database using entity framework and stored procedures in areas where performance was important and EF could not deliver. This was sufficient for several years, but as time passed the number of users and records grew substantially and no matter how many optimisations were done on database level we couldn't reach a sufficient performance watermark.

Enter elastic search and CQRS. Our idea was to use elastic as a secondary storage that we will always be able to recreate from the database data; then we can use this storage only for searching, while all the write requests would continue to work as they did and hit the database. The search cluster would then be updated with the changes with some latency which is transparent to the end user and is not really important when you search. We first identified which entities were accessed the most and in which areas of the application we had the most performance bottlenecks, and started with those. We flattened the structure so that the data stored in elastic had no nested objects. The PoC was then launched for a single search endpoint and one tenant, gradually releasing to more.

When we finished, not only we improved the search requests, but we took off a huge CPU and memory load away from the database - thus improving all the write requests and providing room for further growth. Keeping an eye on system uptime and stability, we kept the original approach with stored procedures as a backup solution, controlled by a feature flag per tenant per entity type, meaning that we were always able to switch the implementation - turn off elastic, fix or change or upgrade, and switch it on again - without the users noticing. The result of the whole process: search requests executed in 300ms, improved overall system health and scalability.

3. Caching availability and rates for thousands of hotel rooms

We were building a service that would operate in a microservice architecture with a single purpose: store and provide fast access to availability and rates for hotel rooms (conveniently called ARI data in the travel tech). The platform integrated with dozens of different property management systems and channel managers, and our job was to create a central location that can give super fast access to ARI for travel agencies. Briefly on ARI: it might look simple at the beginning: a flat model with a date, a number representing available rooms, a number representing a price and maybe a string for currency. However, in reality it's much more complex than this. One hotel usually has around 5 room types, each room type has between 3 and 10 different rates, each of those rates can have various restrictions and rules, and we have to store this data on a daily basis for at least two years in the future and all the data in the past - you can see how the number of documents exponentially grows, while each document has a nested hierarchical data structure.

Elastic is not primarily used in scenarios where you store numbers and dates, as its strong point is text indexing. However, we found it to be the perfect match for this scenario for several reasons: we can index thousands of documents at a time, if there are unchanged documents while indexing we can see that in the result of the API call and stop propagating them in the flow, we can calculate aggregate functions really fast to find the minimal availability and price between dates. Elastic is distributed, meaning it is easy for us to scale and we have no downtime - which is crucial since we can’t easily replay the stream of ARI updates. Manipulating with indices also made it possible to store all the ARI updates in a history indices and use that data for analytics.

Our approach was to flatten the structure of the documents as much as possible, so that we get the most performant API. All the incoming updates were put in a queue before entering elastic so that we can control the priority, as all the updates for the near future have a higher likelihood of being used before the ones a year or two ahead. The queue also enabled graceful error handling mechanisms so that updates have 3 attempts before being sent to manual processing. All the operations had clear usage statistics and we created a dashboard where we can follow the processes and adapt. On top of elastic we built a spring boot API that serves as a wrapper for all search requests.

The end result is a service that provides fast ARI access and has serviced billions of requests through the last 3 years.

4. Full Fledged Application Monitoring

Fact: Every web application needs monitoring (not to confuse this with infrastructure monitoring). The definition of application monitoring is rather vague. Lots of different factors are involved and everyone has a different approach on this. Our approach, summarized as briefly as possible:

Trace every unique web request from start to finish with all relevant information
Provide proper security and access levels to log data
Centralize the logs between application environments and use them in CI/CD
Never lose a log line as they are crucial for auditing purposes
Always have an automated backup strategy and disk sizing strategy
Minimize performance impact on the application that is being monitored
Provide dashboards and alerts for support teams

If we look at all these standards we can quickly see that adding monitoring to an application is far from a trivial task; First, we have to prepare the application and put the proper filters and appenders in place, in a generic way. We always log a unique identifier together with the request body, the user information, the ip address, the endpoint information, environment, application version, execution time and much more. Then we have to provision the infrastructure where we will host elastic and store all the logs in a secure and scalable way. Then we have to define index lifecycle and snapshot policies that will guarantee we store all the data, but only the logs from the recent period are in elastic and everything else is backed up on a secure location and ready to be used when needed. Lastly, configuring the right dashboard is not trivial. It must be user friendly and yet show a summary of all the application logs.

To achieve this we have used elastic on dozens of different applications, together with Kibana to provide access to the logs, dashboards and alerts. Another tool we have used is elastic APM, which can give even more information about performance monitoring on top of the standard application monitoring.

When done right, monitoring has always had a drastic impact on reducing support time, incident handling, gathering user feedback, finding performance bottlenecks and proactively planning the architecture and infrastructure of each application.

Summary

Handling huge numbers of documents and providing fast and feature-rich access is a big challenge. Our experience with various tools and approaches has clearly shown the strong points of elastic search when done right. Our standards regarding scalability, zero downtime deployments and maintenance, observability, security, cost optimization, backup and recovery, ease of use, performance and much more - have all been successfully ticked in the various implementations we have done. We have used elastic for text documents, logs, domain entities, and even flexible structures with no text - achieving the goal of having fast and stable solutions time and time again.

Elastic is pretty easy to start with, as you can quickly set it up, index and search through the API and do a PoC. But taking it from the playground to a high standard production environment is another ball game. You have to meticulously perform the task of cluster sizing and make difficult decisions about node count and size. Putting the right mappings and analyzers in place is also very important and can be difficult if you don’t have knowledge and control over the data that is indexed. Next you have to take care of index separation, sharding and replication, summarized by index lifecycle and snapshot policies. Finally, provisioning a scripted infrastructure that is capable of scaling, fine tuning important elastic configurations, and putting monitoring on top of the whole thing summarizes our process.