Skip to the content.

[Java][JMX][Monitoring] The ultimate monitoring of the JVM application - out of the box

Context

Most of the applications I saw were deployed at the virtual machine. Most of them used Spring Boot with Tomcat as a web server and Hikari as a connection pool. In this article I’m going to show you metrics from that technology. If you are using others, you need to find corresponding metrics by yourself. The JMX mbean names are taken from the Java 11 with the G1GC.

Mind that some of the following metrics may have no sense in the cloud.

Foreword

This article is very long because there are a lot of metrics that you should focus on. In my applications I have all of those metrics gathered in one dashboard. It is the first place I look when there is some outage. That dashboard doesn’t tell me the exact cause of the outage, but directs me to the source of the problem.

Hardware

Let’s start from the bottom. We have four crucial hardware components that almost all applications use:

Mind that there are other components, like disk controllers, bridges and so on, but I’m going to focus on those four.

CPU

CPU utilization chart (7 days)

alt text

Possible failures:

In this situation you need to check if that resource is utilized by your JVM. The first one can be easily diagnosed by JMX, I’m going to cover that later. The second one I usually diagnose at OS level using pidstat -t <time interval>.

RAM

Available memory chart (7 days)

alt text alt text

You need to understand the difference between available and free memory. I suggest you read the beginning of the man free at Linux. Long story short, your OS has internal caches and buffers. Those are used to help the performance of your application. Your OS can use them if there is enough memory available. If your application needs the memory used before by cache/buffer, the OS will remove that cache/buffer to make that memory available for you.

Simplifying:

Possible failures:

The first situation is shown in the second chart above. The available memory dropped to 0, OOM killer killed the JVM. In that situation you need to check the OS logs to find out which process ate that memory.

The second situation is easy to diagnose when you look also at your next resource.

Disk

IO operations (reads/writes) chart (1 day)

alt text alt text

Possible failures:

The IO operations don’t appear out of nowhere. If the increase of IO reads is correlated with reduction of available memory then you probably have a situation where a crucial cache/buffer is removed (shown at second chart above). In that situation you need to check RAM consumption to check which process eats the memory.

The increase of IO operations can be done by your JVM, you can check it with pidstat -d <time interval> at Linux. If that is your problem I suggest using the async-profiler in wall mode. With that profiler you can easily find the part of your code that is using the IO.

Network

Net transfer (in/out) chart (1 day)

alt text

Possible failures:

In the first situation first you need to check if that traffic is done by your JVM. There are multiple tools to check it, I like the nethogs. If that traffic is generated by the JVM then, again, the async-profiler at wall mode is going to show you what part of your application generates it.

If there is no traffic, then:

JVM (from JMX)

CPU (again)

CPU utilization chart (7 days)

alt text alt text alt text

The JVM reports through the JMX two metrics, the CPU utilization from OS level (which is the same as the first metric in this article), and how much of that utilization is done by that JVM. Comparing those two values can give you the information if your application is responsible for the utilization. The last chart above shows the case where the application cannot perform well, because another process consumes a whole CPU.

Heap after GC

I’ve already written why that metric is useful in the previous article.

Heap after GC chart (7 days)

alt text alt text alt text

The committed, and the max heap size can be obtained from:

The heap after GC can be obtained from:

I use that chart to find out if there is a heap memory leak. Check the previous article for details. The last chart above shows the memory leak. If you have a memory leak in your application then you have to analyze the heap dump.

Heap before GC

I’ve already written why that metric is useful in the previous article.

Heap before GC chart (7 days)

alt text alt text

The committed, and the max heap size can be obtained from:

The heap before GC can be obtained from:

I use that chart to find out if the heap is wasted, because the GC starts its work too early. Check the previous article for details. The second chart above shows the application where the GC runs inefficiently. If you have such a situation then you need to analyze the GC log.

Full GC count (G1GC)

Full GC count chart (7 days)

alt text alt text alt text alt text alt text

The best situation is presented at the first chart: 0 full GCs. Possible failures:

If you have a first situation, and it is a problem for you then you need to analyze GC logs. This is a place with information why the G1GC has to run that phase.

If you have a second situation, it usually means that your application runs a part of code that needs almost all the free heap. What I usually do in that situation is to check the access log, and the application log to find out what was going on in the time of full GCs. The continuous profiling is also useful in that case.

The last situation usually means that your application runs a part of code that needs more heap than you have. It usually ends with OutOfMemoryError. I strongly recommend you to enable -XX:+HeapDumpOnOutOfMemoryError, it will dump you a heap dump when such a situation occurs, which is the best way to find the cause of it.

Code cache size

Code cache size chart (7 days)

alt text

Possible failures:

The code cache contains the JIT compilations. If that part of memory cannot fit a new compilation then the whole JIT disables. After that, your application starts degrading the performance. You need to remember that the code cache can be fragmented. As a rule of thumb I assume that if the application uses >=90% of that part of a memory, then that part needs to be increased.

Off-heap usage

Off-heap usage charts (7 days)

alt text alt text

Possible failures:

Most of the time the off-heap is used by frameworks/libraries. Those metrics are useful for control, if there is no memory leak in this area.

To find out what part of your code allocates the memory with these two mechanisms you can use:

Loaded class count

Loaded class count chart (7 days)

alt text

Possible failures:

This problem was common in EJB servers. If you do not use hot swap, context reload, there is a little chance you have this leak. You need to remember that it is completely normal for that chart to be ascending. There are mechanisms (like the evil reflection) which create classes at runtime. If you think there is a memory leak the first think you should look at are classloader logs.

The heap dump is also useful in that situation. It contains all class definitions and can tell you why they are alive.

Hikari active connection count

Hikari active connection count chart (7 days)

alt text

This metric is useful for scaling and sizing your application. It tells you how many active connections you need to your database. If this value is too big for you, your code may spend too much time with the opened transaction. The way I’m finding such a situation in Spring framework is to take output for async-profiler in wall mode and look for long methods covered with invokeWithinTransaction method.

Hikari threads awaiting connection

Hikari threads awaiting connection chart (7 days)

alt text

Possible failures:

The second situation is simple, your database connection pool has too low size to handle all the requests. Either your pool is too short, or you have a situation I’ve explained in the previous chapter.

The first situation (second chart above) means that you have one of the following issues:

The last situation can be diagnosed the same way I’ve written above. The easiest way to diagnose the second option is to start from database level. There are dedicated tools to monitor that area like Oracle Enterprise Manager

Tomcat connection count

Tomcat connection count charts (7/5 days)

alt text alt text

Possible failures:

This metric tells you how many connections are established to your server. If it is increasing rapidly then most probably you encounter one of the following situations:

You need to remember that if your application slows down the number of incoming requests don’t decrease. The only way for an application server to handle the same amount of requests with slower application is to create more worker threads and accept more connections. If it is your case (the second chart above) then you need to find why your application is slower, the best way from my experiences is async-profiler in wall mode.

Created thread count

Created thread count chart (7 days)

alt text

If the number of created threads is increasing it means there is a lack of a thread pool. I’ve written an article explaining how to find the part of code that lacks that pool.

Current thread count

Current thread count charts (7 days)

alt text alt text alt text

Possible failures:

There are some situations where the application has thread leak, they are very rare. That situation can be diagnosed the same way as creating too many threads. The huge increase of threads in a short period means pretty much the same as increasing the tomcat connection count. The application needs more threads to do its work. We can find the reason with async-profiler in wall mode.

Afterword

As I said in the foreword, a dashboard with all those metrics helps me handle the outages, here are some examples:

… and so on.

Those charts do not solve the outage but show you where you should focus on your diagnosis. Remember, I just covered the technology I use. You may have other metrics worth gathering (MQ, Kafka, custom thread pools and so on).

After fifteen years of handling different kinds of outages I can tell you one thing: I never start the diagnosis from the application log.