[Java][JVM Logs] Monday with JVM logs - Classloader logs, can attaching the JFR (and other profilers) cause the outage?

JVM logs

If you enable class log at info level (-Xlog:class+unload,class+load) you can find such an entries:

[class,load           ] org.springframework.jdbc.core.simple.SimpleJdbcCallOperations source: jar:file:/opt/app.jar!/BOOT-INF/lib/spring-jdbc-5.2.12.RELEASE.jar!/
[class,load           ] java.net.Authenticator source: jrt:/java.base
[class,load           ] jdk.internal.reflect.GeneratedSerializationConstructorAccessor301 source: __JVM_DefineClass__
[class,load           ] sun.nio.ch.FileChannelImpl source: __VM_RedefineClasses__
[class,unload         ] unloading class jdk.internal.reflect.GeneratedConstructorAccessor528 0x0000000800b64c40

For loading class we can find the class name, and the source of a loaded definition. The source can be a jar file, or it can be __JVM_DefineClass__ if the class is created at runtime. You can also find the __VM_RedefineClasses__ source, which is a class redefinition, usually made by a profiler through JVMTI.

Attaching the JFR

Let’s run a simple HelloWorld:

import java.io.IOException;

class HelloWorld {
    public static void main(String[] args) throws IOException {
        System.out.println("Hello world");
        System.in.read();
    }
}

java -Xlog:safepoint,class+load,jit+compilation=debug HelloWorld

Let’s start the JFR with JMC:

alt text

The first thing I want you to see is the JIT compilation count.

alt text

You can see that after I run the JFR the compilation count dropped to 0. Let’s look at the logs to find out why could that happen:

[24,434s][info ][class,load     ] java.lang.Throwable source: __VM_RedefineClasses__
[24,435s][info ][class,load     ] java.lang.Error source: __VM_RedefineClasses__
[24,435s][info ][safepoint      ] Application time: 0,0628897 seconds
[24,435s][info ][safepoint      ] Entering safepoint region: RedefineClasses
[24,437s][debug][jit,compilation]   22       1       java.lang.module.ModuleDescriptor::name (5 bytes)   made not entrant
[24,437s][debug][jit,compilation]   23       1       java.lang.module.ModuleReference::descriptor (5 bytes)   made not entrant
[24,437s][debug][jit,compilation]   14       4       java.lang.StringLatin1::hashCode (42 bytes)   made not entrant
... multiple made not entrant
[24,456s][info ][safepoint      ] Leaving safepoint region
[24,456s][info ][safepoint      ] Total time for which application threads were stopped: 0,0209087 seconds, Stopping threads took: 0,0000114 seconds

From that log we can read that everything started with a class redefinition. The JFR instruments Throwable and Error classes. That redefinition started the stop the world phase RedefineClasses in which all the compilations were removed.

To understand why the JVM does that we need to look at its sources: jvmtiRedefineClasses.cpp:

void VM_RedefineClasses::redefine_single_class(jclass the_jclass,
       InstanceKlass* scratch_class, TRAPS) {
...
  // Deoptimize all compiled code that depends on this class
  flush_dependent_code(the_class, THREAD);
...
}

void VM_RedefineClasses::flush_dependent_code(InstanceKlass* ik, TRAPS) {
  assert_locked_or_safepoint(Compile_lock);

  // All dependencies have been recorded from startup or this is a second or
  // subsequent use of RedefineClasses
  if (JvmtiExport::all_dependencies_are_recorded()) {
    CodeCache::flush_evol_dependents_on(ik);
  } else {
    CodeCache::mark_all_nmethods_for_deoptimization();
...
  }
}

If it is the first redefinition, and the recording of dependencies was not enabled at startup, then all the methods are marked as made not entrant. Here is a quote from a very nice comment in the JVM sources:

Deoptimize all compiled code that depends on this class.

If the can_redefine_classes capability is obtained in the onload phase then the compiler has recorded all dependencies from startup. In that case we need only deoptimize and throw away all compiled code that depends on the class.

If can_redefine_classes is obtained sometime after the onload phase then the dependency information may be incomplete. In that case the first call to RedefineClasses causes all compiled code to be thrown away. As can_redefine_classes has been obtained then all future compilations will record dependencies so second and subsequent calls to RedefineClasses need only throw away code that depends on the class.

It’s not only the JFR

This problem occurs in every profiler that:

is attached at runtime
instruments a single class

Why can this cause the outage?

It is a simple snowball effect, here is an example of one scenario I have seen (REST service):

The JFR is run to profile the application
All the JIT compilations are removed
The JIT compiler needs to recompile every method/loop -> it needs the CPU to do it
There are no JIT compilations. so the whole application is running in the interpreter -> it is slower and consumes more CPU
The application is slower, but it doesn’t mean that the incoming traffic is any lower
The application needs to create more threads to handle the traffic -> it needs the CPU and the memory
More requests are processed at the same time -> it needs more memory on the heap and causes the bigger allocation rate
The increase of allocation rate needs more often garbage collections -> it needs the CPU

All that sudden CPU and memory consumption may cause the outage. The application may be killed by Kubernetes, may be removed from a cluster by load balancer and so on.