[Java][Profiling][Hazelcast] Performance tuning of Hazelcast SQL engine
This blog post was originally posted on Hazelcast’s website.
Hazelcast in version 5.1 contains two querying APIs:
- Predicate API
- SQL engine
Predicate API is an older Java-based API. Even though it contains
sqlPredicate, which allows using SQL-like
syntax for the
WHERE clause, the syntax is non-standard, for example
NULL handling doesn’t
support the (in)famous ternary logic. It fetches the results in one batch, which limits the supported result size.
On the other hand, SQL Engine is a more modern engine, it uses standard SQL, a cost-based optimizer,
and is available in all programming languages. It also supports
UNION operators, which don’t have an equivalent in the Predicate API. It streams the results to the
client, so the result size isn’t limited (though this is also a restriction, because it’s not possible to restart
the query if it fails mid-way).
In the next major release, we plan to deprecate the Predicate API. For this we need feature parity, and also match the performance of Predicate API. The performance is the focus of this blog post: we’ll describe our journey in benchmarking and fixing some performance issues we had.
Here is a list of benchmarks that we did for that comparison:
The fifth benchmark was run with and without index on
value field. For benchmark number three we used three different
For the rest of the benchmarks we used
IdentifiedDataSerializable. If you are not familiar with our serialization options
check this page.
We didn’t benchmark joins and streaming queries since they cannot be executed with Predicate API. These functionalities are available in SQL engine only.
All the benchmarks were run in throughput and latency modes since we cared about both. In this post I’m going to cover only the latency part of those benchmarks.
First results showed that the SQL was slower in most of them. Next step was to run those benchmarks again with the attached profiler. I used Async-profiler, and since I was tracing latency issues, I chose wall-clock mode. You can read about different modes in my previous post.
All of those benchmarks were run on our testing lab, so we knew that the results were valid (no noisy-neighbor issue). The setup for the benchmarks was:
- Hazelcast cluster size: 4
- Machines with Intel Xeon CPU E5-2687W
- Heap size: 10 GB
How does Hazelcast SQL work?
Each of the cluster members contains two major modules: IMDG and JET. IMDG is a module where we store our data. JET is a distributed batch and stream processing engine. Additionally, for parsing the SQL queries we use Apache Calcite.
When a client sends a query to a cluster, that query is just a string with parameters. The query is sent to a random member of the cluster. That member is called a coordinator of that query. The coordinator needs to:
- Create a query plan
- Convert the query plan to JET job
- Distribute the JET job to all the members
After that is done all the members are executing the query, and the results are streamed by the coordinator to the client.
If you do sampling profiling you need to visualize the results. The results are nothing more than a set of stacktraces. My favorite way of visualization is a flame graph. The easiest way to understand what flame graphs are is to understand how they are created.
First part is to draw a rectangle for each frame of each stacktrace. The stacktraces are drawn bottom-up and sorted alphabetically. For example, such a graph:
corresponds to set of stactraces:
- 3 samples -
a() -> h()
- 5 samples -
b() -> d() -> e() -> f()
- 2 samples -
b() -> d() -> e() -> g()
- 2 samples -
b() -> d()
- 2 samples -
The next step is joining the rectangles with the same method name to one bar:
The flame graph usually shows you how some resource is utilized by your application. The resource is utilized by the top methods of that graph (visualized with green bar):
So in this example method
b() is not utilizing the resource at all, it just invokes methods that do it. Flame graphs
are commonly used to present the CPU utilization, but CPU is just one of the resources that we can visualize this way.
If you use wall-clock mode then your resource is time. If you use allocation mode then your resource is
SELECT __key, this FROM IMap
- Values in IMap have fields:
int, serialized with
- IMap size - 100_000 entries
- Latency test - 24 queries per second
Let’s start with profiling the client side. Here is a wall-clock flame graph of the method that fetches the data on the client’s side:
Let’s look at the bottom of the flame graph, method
testSelect() iterates over the results and for each row
it executes three things:
accept()- that executes the deserializon of the row - each SQL row travels to the client through the network as a byte array, so we need to deserialize it to the object format
ClientIterator.hasNext()- that one returns the next row if it is already fetched, or waits to receive the next portion of the result set. In the flame graph we see the waiting part. At this point in time the execution is done in the cluster - nothing to improve on the client’s side.
- Some third method, let’s zoom in that part of the flame graph
Here we can see that the method
getColumnValueForClient() (3rd from the top) executes
which consumes a lot of time. Let’s go to the source code:
columns is a field of type
List<List<?>>, and the problematic method executes
LinkedList.get(index) is known to have bad performance, since it has
The instance is not created here, it is passed in the constructor, so we need to debug the application to find the
origin of that
LinkedList. Two breakpoints later I found a place where it was created. One solution is to
ArrayList like this:
But it is not always that easy. The code where that list was created was done many years ago and is used not only by SQL engine. We need to be careful here. I asked colleagues from the other team if we can do it, they did their benchmarks and agreed to that change:
How did that affect our performance? Well it depends on the number of columns and whether we do deserialization on the client’s side (it is done lazily). Here is the latency distribution for two columns and no deserialization:
Was it a mistake to choose
LinkedList in that code? No. It was a coherent decision made by the state of Hazelcast at the
moment of creation of that code.
Thing to remember is that performance is a living organism. It evolves with new features. What was efficient yesterday could be a bottleneck tomorrow.
Cluster side - sum()
SELECT sum(\”value\”) FROM IMap
- Values in IMap have fields: one
Long, and one
int, serialized with
- IMap size - 1_000_000 entries
- Latency test - 6 queries per second
In this benchmark we need to deserialize all the instances in IMap, get a single field from that deserialized object and aggregate it. The majority of work is deserialization. It was strange to me that the Predicate API had different performance than SQL.
Here is a flame graph for Predicate API:
And here is one for SQL:
Now let’s do something a bit silly. Open both graphs in separate browser tabs (right mouse click on the graph -> open graphics in a new tab) and then just switch between those two tabs very quickly and try to find a difference. Did you spot it? It is at the top. The SQL graph has a lot of yellow/red code there, the Predicate API does not. Let’s see what it is about:
The frames that we can see in SQL only are
Class.forName(). We had a class definition cache in Hazelcast already,
and it was strange that it wasn’t used for SQL. We can highlight all invocations of that method to see how much time it
So that loading of classes consumed ~25% of deserialization of Java objects. Let’s look at the code of our class loading:
We have four layers of caches here, each
tryLoadClass() checks if the definition is in the cache, if not it
loads it. If none of the classloaders loaded the class we called
The difference between Predicate API and SQL was that in SQL the
null, since you
cannot use user classes with SQL query. We decided to add another level of cache:
How did that improve the performance?
Thing to remember is that “If it looks stupid but works, it isn’t stupid” – that also applies to the performance analysis. Comparing two flame graphs by quickly switching the browser tabs looks silly, but hey, it works like a charm if you want to compare resource utilization that should look the same.
Cluster side - sum() with JSON
SELECT sum(json_value(this, '$.value' returning integer)) FROM cache
- Values in IMap have two fields:
int, serialized as
- IMap size - 1_000_000 entries
- Latency test - 5 queries per second
Let’s go straight to the flame graph of evaluation a value from JSON:
The big left part of that graph is acquiring a lock, it utilizes:
~70% of the time of evaluation of the value. That part of code uses Guava cache that uses
internally. The cache contains a mapping from string JSON path to our object that represents that path. The usage of
that cache is usually single insert and multiple reads. For such a usage pattern the
ReentrantLock is not the
ConcurrentHashMap is better for example. Simple switch from Guava cache to CHM did this improvement:
In the end we decided to do two implementations of a cache, one for a single JSON path based on a field, the second for the rest of the cases.
Thing to remember is that the performance bottleneck may be in 3rd party libraries/frameworks. It doesn’t mean that their code is bad, usually we simply do not know the trade-offs there, and that may hurt us.
Cluster side - scan for a single value by index
SELECT __key, this FROM iMap where col=...; // Index on col
- Values in IMap have two fields:
int, serialized with
- IMap size - 10_000_000 entries
- Latency test - 7500 queries per second
This is a benchmark where we test a distributed deployment overhead of a job on a busy cluster. The part after the deployment is easy, we just need to take an index and fetch a single row from it. Unfortunately in such queries we will never be better than Predicate API, since the execution plan for SQL is much bigger than for Predicate API, and serialization and deployment takes more time. What we can do is to speed up that part in SQL to be as fast as possible.
When I ran the first instance of that benchmark the mean latency was around 4,5ms on a stressed cluster. To fight that kind of latency we need to focus on all the resources that are needed for that code to run. Let’s look at the heap allocation:
Over 90% of recorded allocation (during creation of the plan) was done in the constructor of
That class is responsible for sending computation results to other nodes in the cluster. It created a 32k byte
array as a buffer of data to send. The value 32k was ok for a task manipulating multiple entries, but for a task
that processes only one row it was a waste of heap memory.
We went with the solution to have a buffer with two initial sizes:
- Small initial size: 1k
- If that buffer is too small it expands to 32k immediately
That approach didn’t hit larger tasks, they need to waste 1k array at the beginning, but allocation of such an array is much cheaper than 32k, since 32k is often allocated outside the TLAB.
That change lowered the allocation rate of that benchmark from 3,8GB/s to 2,3GB/s, but that was not the
only change in that part of our engine. My colleague pointed out that in some cases we created a
that knew it would never send any data through the network. We can avoid the creation of unnecessary
SenderTasklet. That change lowered the heap allocation rate to 1,5GB/s.
There were plenty of PRs for lowering down that deployment overhead. Current status (mean) is:
We lowered the mean latency from 4.5ms to 1.8ms on a stressed cluster, and we still have ideas how to make it even better.
Thing to remember is that allocation on heap is fast, but no allocation is faster.
PRs: https://github.com/hazelcast/hazelcast/pull/20882 and https://github.com/hazelcast/hazelcast/pull/20940
Our SQL engine in release 5.2 is currently faster with better throughput than Predicate API in most of the benchmarks. The only kind of query where Predicate API is faster is a query that normally takes just milliseconds to execute. As I’ve mentioned in the last example, we still have ideas on how to make the SQL engine faster. We are going to realize those in future releases.
In this post I focused on four different issues that can teach us something. Let me gather those four things to remember:
- Performance is a living organism. It evolves with new features. What was efficient yesterday could be a bottleneck tomorrow.
- “If it looks stupid but works, it isn’t stupid” - every flame graph analysis technique is good as long as it works for you
- The performance bottleneck may be hidden in 3rd party libraries/frameworks
- Allocation on heap is fast, but no allocation is faster
One additional “lesson learned” is that the Async-profiler is a very useful tool for finding such high-level bottlenecks. You need to remember that there are issues in which such a profiler won’t help. Some performance bottlenecks can be understood after analyzing assembly code or with Top-down Microarchitecture Analysis.