Java's New Performance Paradigm: Mastering Virtual Threads and Native Image
Java's performance landscape has fundamentally shifted with the introduction of virtual threads for massive I/O concurrency and GraalVM Native Image for instant startup. This report deconstructs these technologies and provides a framework for architects to navigate this new dichotomy.
For over two decades, high-performance Java has been synonymous with the sophisticated dance of the Just-In-Time (JIT) compiler, a world of warmup cycles, tiered compilation, and intricate garbage collection tuning. That era of a monolithic performance model is over. The contemporary Java landscape has fundamentally bifurcated, presenting architects and engineers with two distinct, powerful, and sometimes overlapping axes of optimization.
The first axis is throughput scalability for I/O-bound workloads, masterfully addressed by Project Loom's virtual threads. Finalized in Java 21, virtual threads liberate developers from the constraints of costly operating system (OS) threads, enabling massive concurrency with minimal resource overhead on the standard Java Virtual Machine (JVM).¹ This innovation revitalizes the simple and intuitive "thread-per-request" programming model, allowing it to scale to millions of concurrent tasks.
The second axis is startup speed and footprint efficiency, championed by GraalVM's Native Image technology. Through Ahead-of-Time (AOT) compilation, Native Image transforms Java applications into lean, self-contained native executables that start in milliseconds and consume a fraction of the memory of their JVM-based counterparts.²
The modern architect's challenge—and critical skill—is no longer just about tuning a single runtime but about navigating this new dichotomy. These two models are not mutually exclusive; they represent a spectrum of trade-offs. Understanding when to leverage the massive concurrency of virtual threads on the JVM, when to prioritize the instant-on efficiency of a native executable, and when to combine them is now the central question in building high-performance systems on the Java platform.⁴ This report deconstructs the mechanics of each of these transformative technologies, provides a practical framework for making architectural decisions, examines their integration into the modern framework ecosystem, and looks toward their eventual convergence in the future of the Java platform.
Deconstruction I: Virtual Threads and the Conquest of I/O Throughput
The Problem Loom Solves: The Tyranny of the Platform Thread
At the heart of Project Loom's innovation is a solution to a long-standing problem: the prohibitive cost of Java's traditional threads. Historically, every instance of java.lang.Thread has been a thin wrapper around a heavyweight OS thread, creating a rigid one-to-one mapping.¹ These OS threads are a scarce resource, managed and scheduled by the kernel. Their creation entails significant overhead, most notably the allocation of a large, fixed-size memory block (often 1 MB or more) for the thread's stack.⁷
For the classic "thread-per-request" server architecture, this model imposes a severe scalability ceiling. An application could only handle as many concurrent requests as it could afford to create OS threads, leading to thread pool exhaustion and high memory consumption long before the underlying hardware (CPU, network) was saturated.⁸ This fundamental limitation forced the Java ecosystem into a decade-long detour towards complex asynchronous and reactive programming paradigms. Frameworks built on CompletableFuture and reactive streams were developed not necessarily because they were more intuitive, but as a necessary workaround to share a small pool of precious OS threads among many concurrent I/O operations.⁷ This came at a high cost: code became harder to write, read, and debug, with fragmented stack traces and the infamous "callback hell".¹⁰
The Mechanics of Lightness: Continuations and Heap-Based Stacks
Virtual threads dismantle this limitation by introducing a lightweight, JVM-managed implementation of java.lang.Thread.⁴ The core innovation that makes them "light" is the decoupling of the thread from its stack's storage. Instead of relying on a monolithic stack block allocated by the OS, a virtual thread's stack frames are stored as small objects on the Java heap.⁹
This is enabled by a low-level JVM mechanism called Continuations. A continuation allows the runtime to capture the execution state of a piece of code at any point, save it, and resume it later, all within user space without involving the OS kernel.¹² This means a virtual thread's memory footprint starts at a mere few hundred bytes and grows on the heap only as its call stack deepens.⁹ This efficiency is why a single JVM can now support millions of virtual threads, where it previously struggled with a few thousand platform threads.¹³
It is crucial to understand, however, that virtual threads are a tool for increasing scale (higher throughput), not speed (lower latency). They do not execute bytecode any faster than platform threads. Their power lies in dramatically increasing the number of concurrent tasks an application can handle, especially when those tasks spend most of their time waiting.⁶
This paradigm shift redefines best practices. Thread pools were invented to amortize the high cost of creating and destroying platform threads.⁸ Since virtual threads are exceptionally cheap, pooling them is an anti-pattern that introduces unnecessary complexity and overhead. The correct model is to create a new virtual thread for each concurrent task and let it run to completion, a fundamental departure from decades of established Java concurrency patterns.¹⁶
The Scheduler's Dance: M:N Scheduling in Practice
Virtual threads are managed by a scheduler within the JDK, which implements an M:N scheduling model. A large number (M) of virtual threads are scheduled to run on a small, fixed-size pool (N) of platform threads, which are referred to as carrier threads in this context.⁹ By default, this scheduler is a java.util.concurrent.ForkJoinPool operating in a First-In, First-Out (FIFO) manner, with a parallelism level (the number of carrier threads) equal to the number of available CPU cores.¹
The process is an elegant dance of "mounting" and "unmounting." When a virtual thread is ready to execute, the scheduler mounts it onto an available carrier thread. The carrier thread then executes the virtual thread's code. The magic happens when the code encounters a blocking I/O operation, such as a socket read or a call to BlockingQueue.take(). The JDK's networking and concurrency APIs are now Loom-aware. They intercept this blocking call, save the virtual thread's state (its continuation) to the heap, and unmount it from the carrier. The carrier thread is instantly freed to be used by another runnable virtual thread.⁴
When the I/O operation completes (e.g., data arrives on the socket), the virtual thread becomes eligible to run again. The scheduler then places it back in the queue to be mounted on any available carrier thread to resume its execution exactly where it left off.¹² This entire cycle is transparent to the application code, which can be written in a simple, blocking, sequential style while achieving the high efficiency of non-blocking I/O under the hood.
The Achilles' Heel: Understanding and Mitigating Thread Pinning
The scalability of virtual threads hinges on their ability to unmount from carrier threads during blocking operations. However, there are situations where a virtual thread can become pinned to its carrier. Pinning occurs when a virtual thread enters a blocking operation but cannot be unmounted, effectively monopolizing the carrier thread for the duration of the block. This turns the lightweight virtual thread back into a heavyweight platform thread, undermining the entire scalability model.⁶ If all available carrier threads become pinned, no other virtual threads can make progress, leading to thread starvation and potential application-wide freezes.¹⁸
In JDK 21, the two primary causes of pinning are:
- Executing code inside a
synchronizedblock or method.⁶ - Executing a native method or a foreign function via JNI.⁶
Fortunately, the ecosystem is rapidly evolving to address these challenges. The most critical mitigation strategy is to replace synchronized blocks that guard I/O or other long-running blocking operations with locks from the java.util.concurrent package, such as ReentrantLock. These locks are integrated with the Loom scheduler and will properly unmount the virtual thread while it waits to acquire the lock.⁶
Furthermore, developers can diagnose pinning issues using JDK Flight Recorder (JFR), which emits a jdk.VirtualThreadPinned event, or by using the system property -Djdk.tracePinnedThreads=full to get a stack trace whenever a thread blocks while pinned.⁶ It is also noteworthy that JEP 491, delivered in JDK 24, has largely eliminated pinning for synchronized blocks by re-engineering monitor locking, making the platform significantly more virtual-thread-friendly out of the box.²¹
Code in Action: A High-Concurrency I/O-Bound Service
Consider a web service endpoint that must concurrently fetch data from two separate downstream services—one for user details and another for order history—before combining the results.
With virtual threads, this logic can be expressed with straightforward, imperative code using an ExecutorService created via Executors.newVirtualThreadPerTaskExecutor(). For robust error handling and lifecycle management, StructuredTaskScope (a preview feature since JDK 19) provides an elegant solution:
// Simplified conceptual example
try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
// Fork two concurrent subtasks, each on a new virtual thread
Future<User> userFuture = scope.fork(() -> fetchUserDetails(userId));
Future<List<Order>> ordersFuture = scope.fork(() -> fetchOrderHistory(userId));
// Wait for both subtasks to complete or for one to fail
scope.join();
scope.throwIfFailed(); // Propagate exception if any subtask failed
// If successful, combine the results
User user = userFuture.resultNow();
List<Order> orders = ordersFuture.resultNow();
return new UserProfile(user, orders);
}
This code is easy to read, debug, and maintain. It uses familiar constructs like try-with-resources and Future. The StructuredTaskScope ensures that if the main thread is interrupted or one task fails, the other is automatically cancelled, preventing resource leaks.²² This stands in stark contrast to the equivalent implementation using CompletableFuture, which would involve a chain of lambda expressions and complex error-handling logic.¹⁰
| Feature | Platform Thread | Virtual Thread |
|---|---|---|
| Management | OS-Managed ¹ | JVM-Managed ¹² |
| Resource Cost | Heavyweight ¹ | Lightweight ¹² |
| Memory Footprint | Large, fixed stack (~1MB+) ⁶ | Small, heap-based, resizable stack (starts ~few hundred bytes) ⁹ |
| Scheduling Model | 1:1 with OS threads ⁶ | M:N scheduled on carrier threads ¹¹ |
| Creation Strategy | Pool for reuse ⁸ | Create new per task (anti-pooling) ¹⁶ |
| Ideal Workload | CPU-bound or limited concurrency ¹⁵ | High-concurrency, I/O-bound ⁴ |
Deconstruction II: Native Image and Startup Supremacy
The Mechanics of AOT: The "Closed World" Assumption
GraalVM Native Image represents a radical departure from Java's traditional runtime model. It is an Ahead-of-Time (AOT) compilation technology that transforms Java bytecode into a platform-specific, standalone native executable.² This process is orchestrated by the native-image build tool.
At its core, the builder operates under a "closed world" assumption. It performs a comprehensive static analysis, starting from the application's entry points (like the main method), to discover every single class, method, and field that is reachable during execution.² Any code that is not found during this analysis is considered "dead code" and is eliminated from the final binary. This aggressive dead code elimination applies not only to the application code but also to its dependencies and the JDK itself, resulting in a minimal binary that contains only what is strictly necessary.²
This static, build-time analysis is both the source of Native Image's power and its primary challenge. Java's dynamic features—such as reflection, JNI, dynamic proxies, and classpath resource loading—are often invisible to static analysis. To accommodate them, developers must provide the native-image tool with explicit configuration, typically in the form of JSON files, that declare which dynamic elements need to be included in the closed world.²
The Performance Profile Trade-off: Startup vs. Throughput
The closed-world approach yields a dramatically different performance profile compared to a traditional JIT-enabled JVM.
- Startup and Memory: Native images achieve near-instantaneous startup (measured in milliseconds) and consume a fraction of the memory.² This is because the expensive work of class loading, linking, and initialization is performed at build time. The resulting state of the application, including initialized static fields, is snapshotted into an "image heap" that is loaded directly into memory at runtime.²⁶ With no JIT compiler to run and no classes to interpret, the application is ready for execution almost immediately. This makes native images exceptionally well-suited for serverless functions, command-line tools, and resource-constrained containerized environments where fast startup and low memory footprint are paramount.²
- Peak Throughput: Conversely, for long-running, performance-critical applications, a well-warmed-up JVM can often achieve higher peak throughput than a standard native image.²⁵ The reason for this performance gap is information asymmetry. A JIT compiler operates at runtime, observing the application's live behavior. It collects detailed profiling data on which code paths are "hot," how often branches are taken, and the concrete types seen at polymorphic call sites. It uses this rich, dynamic profile to perform highly aggressive, speculative optimizations that an AOT compiler, lacking this runtime context, cannot safely make.²⁵ The JIT has the luxury of deoptimizing and falling back to interpreted mode if its assumptions prove wrong, a safety net the AOT compiler does not have.²⁵
Bridging the Throughput Gap: Profile-Guided Optimization (PGO)
To address this throughput gap, GraalVM provides Profile-Guided Optimization (PGO), a powerful technique that brings runtime insights to the AOT compiler.³¹ The workflow involves three distinct steps:
- First, the developer builds an instrumented version of the native executable using the
--pgo-instrumentflag. - Next, this instrumented binary is run against a representative workload that mimics real-world production traffic. This run generates a profile file (e.g.,
default.iprof) containing rich data on method call frequencies, branch probabilities, and other runtime behaviors. - Finally, the application is rebuilt, with the
native-imagetool consuming the generated profile via the--pgoflag. Armed with this data, the AOT compiler can now perform JIT-like optimizations, such as inlining hot methods and optimizing branches, based on actual usage patterns.³¹
PGO can significantly boost the performance of native executables, making them highly competitive with the JIT for a much broader range of long-running services and closing the peak throughput gap.³⁰
Code in Action: A Serverless Function
A classic use case for Native Image is a serverless function, for example, on AWS Lambda, where cold start latency is a critical performance metric.²⁷ A simple Java function can be compiled into a native executable using Maven or Gradle with the GraalVM build tools plugin.²
The build process would look like this (using Maven):
# Compile Java to bytecode
mvn package
# Build the native executable using the native profile
mvn package -Pnative
The result is a self-contained binary. When deployed as a custom Lambda runtime, this function can exhibit cold start times under 100 ms and consume as little as 50-70 MB of memory, a 10-30x improvement over the same function deployed on a standard JVM, which might take several seconds to start and require hundreds of megabytes of memory.³
| Characteristic | Standard JIT JVM | GraalVM Native Image |
|---|---|---|
| Startup Time | Slow (seconds) ³ | Extremely Fast (milliseconds) ² |
| Time to Peak Performance | Requires warmup (JIT compilation) ²⁹ | Instantaneous (no warmup) ² |
| Peak Throughput | Potentially higher due to runtime profiling ²⁵ | Potentially lower (can be improved with PGO) ²⁵ |
| Memory Usage | Higher (JVM overhead, JIT, etc.) ³ | Significantly Lower ² |
| Package Size | Smaller (JAR file) | Larger (self-contained binary) ³⁶ |
| Dynamic Features | Fully supported out-of-the-box | Requires build-time configuration ² |
The Architect's Crucible: A Decision Framework
Navigating the dual performance models of virtual threads and native images requires a systematic approach. The optimal choice depends entirely on the specific characteristics and requirements of the application workload.
Step 1: Characterize Your Workload
Before choosing a technology, an architect must answer four fundamental questions about the application:
- Nature of Tasks: Is the workload primarily I/O-bound, spending most of its time waiting for network responses, database queries, or message queues? Or is it CPU-bound, focused on intensive computation, data processing, or complex algorithms?⁴
- Process Lifetime: Is the application a short-lived process like a serverless function, a command-line tool, or a batch job that starts, runs, and terminates quickly? Or is it a long-running service such as a monolith, an API gateway, or a stateful application that is expected to run for days or weeks?²
- Concurrency Requirements: Does the application need to handle a massive number of concurrent connections (thousands or more), or does it operate with a small, predictable level of concurrency?¹⁴
- Resource Constraints: Is fast startup a critical business requirement, for instance, to handle traffic spikes via rapid auto-scaling or to minimize cold start penalties in a serverless environment? Is the application deployed in a memory-constrained environment where every megabyte counts?²
Step 2: The Decision Flow
These characteristics can be mapped to a decision flow to identify the optimal performance model. The key is to recognize that virtual threads and native images solve orthogonal problems. Virtual threads address the challenge of scaling I/O concurrency, while Native Image addresses the challenges of startup time and memory footprint. They are not competing solutions but rather distinct tools for different jobs.
- Is near-instant startup (<1s) or minimal memory footprint the primary driver?
- If Yes, the application is a strong candidate for GraalVM Native Image. This path is typical for serverless functions, CLI tools, and services where rapid scaling is essential.
- If No, and the application is a long-running service where sustained throughput is more critical than initial startup, the standard JVM is the appropriate starting point.
- For the JVM Path: Does the application handle high-concurrency I/O-bound tasks?
- If Yes, the optimal choice is Virtual Threads on the JVM. This configuration provides maximum throughput scalability for I/O-heavy workloads while maintaining a simple, imperative coding style. For many modern greenfield microservices, this is becoming the new default choice.
- If No (e.g., the workload is CPU-bound or has low concurrency), traditional Platform Threads on the JVM remain the best choice. Virtual threads offer no performance benefit for CPU-bound tasks and can introduce a slight overhead.²³
- For the Native Image Path: Is the application a long-running service where peak throughput is also a critical concern?
- If Yes, the optimal choice is Native Image with Profile-Guided Optimization (PGO). PGO helps close the performance gap with the JIT, making it suitable for high-performance, long-running native services.
- If No (e.g., it's a short-lived function or CLI tool), a standard Native Image build is sufficient. The overhead of the PGO workflow is unnecessary.
- The "Both" Scenario: Does the Native Image application also perform concurrent I/O?
- If Yes, the most powerful combination is Native Image with Virtual Threads. This model delivers the best of both worlds: the near-instant startup and low memory footprint of AOT compilation, combined with the massive I/O scalability of virtual threads. This is the ideal architecture for highly efficient, scalable, and responsive microservices.
| Application Archetype | Primary Concern(s) | Recommended Model | Rationale & Key Considerations |
|---|---|---|---|
| High-Traffic API Gateway / BFF | Throughput, High I/O Concurrency | Virtual Threads on JVM | Handles thousands of blocking I/O calls efficiently. Startup time is less critical for a continuously running process. |
| Serverless Function (e.g., AWS Lambda) | Cold Start Time, Memory Cost | GraalVM Native Image | Sub-second startup and low memory footprint directly translate to better performance and lower operational costs.²⁷ |
| Command-Line Interface (CLI) Tool | Instant Execution, User Experience | GraalVM Native Image | Eliminates JVM startup latency for a responsive, native-like feel.² |
| Scalable Cloud-Native Microservice | Fast Startup for Scaling, High I/O Concurrency | Native Image with Virtual Threads | The best of both worlds: fast to scale up/down and efficient at handling requests once running. |
| CPU-Intensive Batch Job / Scientific Computing | Raw Computation Speed, Peak Throughput | Platform Threads on JIT JVM | The JIT compiler can achieve the highest peak performance for long-running, pure computation. Virtual threads provide no benefit.²³ |
Ecosystem in Action: Frameworks Embracing the Duality
While Project Loom and GraalVM provide the foundational technologies, it is the modern Java frameworks that make these advanced performance models accessible and practical for everyday developers. They act as crucial abstraction layers, handling the complexities of native image configuration and thread management, allowing developers to target either deployment model with minimal code changes.
Quarkus: Native-First, Virtual Threads as an Opt-in
Quarkus was designed from the ground up with GraalVM Native Image as a primary target. Its architecture is heavily optimized for fast startup and low memory usage. For I/O-bound workloads, Quarkus allows developers to easily opt into virtual threads on a per-method basis using the @RunOnVirtualThread annotation. This annotation instructs the framework to offload the execution of a specific REST endpoint to a new virtual thread, enabling blocking I/O without stalling the main request processing threads.⁴ Quarkus seamlessly handles the compilation of applications using this feature into a native executable, making the combination of native performance and virtual thread concurrency a first-class citizen.⁴
Helidon: Virtual Threads by Default
Oracle's Helidon framework took a bold step with version 4, rewriting its core from the ground up to be built on virtual threads.³⁸ Its lightweight web server, Níma, uses virtual threads as the default execution model for handling incoming requests, fully embracing the thread-per-request paradigm.³⁶ This design choice simplifies development by allowing developers to write straightforward, blocking code by default. For cases where a CPU-bound task needs to be isolated, Helidon provides the @ExecuteOn annotation, giving developers explicit control to dispatch a specific method to a managed pool of platform threads.⁴¹ Helidon also offers full support for GraalVM Native Image, delivering impressive metrics for startup time and memory footprint in a native context.³⁶
Spring Boot: Mainstream Adoption
As the dominant framework in the Java ecosystem, Spring Boot's adoption of these technologies is a major indicator of their maturity. With version 3.2 and Java 21, enabling virtual threads across the framework has become exceptionally simple, requiring only a single property in application.properties: spring.threads.virtual.enabled=true.⁴³ This single flag reconfigures the embedded Tomcat to use virtual threads for request processing and enables virtual thread execution for @Async methods, @Scheduled tasks, and various messaging listeners.⁴³ Combined with Spring's robust, production-ready support for GraalVM Native Image, developers in the Spring ecosystem now have a clear and well-supported path to leverage either or both of these modern performance models.⁴³
Conclusion: The Future is Convergent
The current dichotomy between the high-throughput JVM and the fast-starting native executable is not a permanent state. The future of Java performance is one of convergence, with the lines between these two models blurring, largely driven by the efforts of Project Leyden within OpenJDK.
The Road Ahead with Project Leyden
Project Leyden's stated goal is to improve the startup time, time to peak performance, and footprint of Java programs—the very same domains where GraalVM Native Image currently excels.⁴⁴ However, Leyden's approach is fundamentally different and aims to be less disruptive. Instead of requiring a separate, "closed-world" AOT compilation step, Leyden seeks to enhance the standard JVM. It introduces the concept of "condensing," where computations traditionally performed at runtime (like class loading, linking, and even JIT profiling) can be shifted to an earlier phase, such as a build time or a "training run".⁴⁵
The output of this condensing phase is a cache that the JVM can use on subsequent runs to start much faster and reach peak performance sooner, all while preserving the full dynamic capabilities of the Java platform.⁴⁷ Several JEPs from this project have already been delivered, including JEP 483 (Ahead-of-Time Class Loading & Linking) and JEP 515 (Ahead-of-Time Method Profiling), with AOT code compilation on the roadmap.⁴⁴
The Strategic Shift: GraalVM and Leyden's Relationship
A pivotal development in this landscape is Oracle's strategic decision to align the development of GraalVM's AOT technologies with Java and contribute them to OpenJDK under the banner of Project Leyden.⁵⁰ This move is highly significant. GraalVM Native Image served as a powerful proof-of-concept, demonstrating the immense demand for and viability of AOT compilation in the Java world. Project Leyden is now poised to become the standardized, integrated solution for this capability within the JDK itself.⁴⁶
This signals a future where the hard trade-off between a dynamic JIT and a static AOT binary may no longer be necessary. Instead, the Java platform will likely offer a spectrum of performance profiles. An architect could choose to deploy a standard application with a Leyden-generated cache that pre-loads class data for a significant startup boost, while still retaining the JIT compiler's ability to perform further runtime optimizations to achieve maximum peak throughput.
Toward a Unified Performance Model
Ultimately, the two distinct performance models are converging. The future Java platform will likely offer a unified model where AOT optimizations become a standard feature of the runtime, not an alternative to it. Virtual threads will continue to serve as the premier solution for concurrency on this enhanced platform. The combination of a Leyden-optimized JVM for fast startup and efficient resource usage, coupled with virtual threads for scalable I/O, is poised to become the dominant architectural pattern for cloud-native Java.
The deconstruction of these technologies reveals a clear trajectory. Mastering the modern Java performance landscape requires a deep understanding of both concurrency scaling and startup/footprint efficiency. The ability to reason about these trade-offs and apply the right tool—or combination of tools—is no longer an advanced topic but a foundational skill for any architect or engineer building the next generation of performant, efficient, and scalable systems on the Java platform.