An Experiment: Java Multi-Threading on Heavy CPU Bound workload
Java Multithreading
A Field Study: Thread Pools, Cache Hierarchies, and Memory Bottlenecks
What textbooks describe but systems make you feel
1. Introduction
Sometimes you revisit something you thought you understood — and discover you only understood it academically. This document is a record of one such weekend: matrix chain multiplication as a CPU-heavy workload, a fixed-size thread pool, and a series of experiments that produced results far more interesting than expected.
The intention was simple: compare sequential execution to a parallel thread pool, observe the speedup, and confirm the textbook. What actually happened was a journey through deadlocks, memory bus saturation, cache eviction cascades, and a lesson about the gap between the spec sheet and the real machine.
2. Experimental Setup
2.1 Workload: Matrix Chain Multiplication
The workload is multiplication of a chain of N square matrices of identical size. Each matrix contains double-precision floating-point values (64-bit IEEE 754). The operation is purely CPU-bound: no disk I/O, no network, no synchronisation between threads other than task submission and future resolution.
Matrix size is expressed as dimension D, where each matrix contains D × D elements. Memory consumed per matrix:
bytes = D × D × 8
So a 200×200 matrix consumes 320 KB; a 800×800 matrix consumes 5.12 MB.
2.2 Hardware
All experiments were run on a MacBook Pro with the following specifications:
| Parameter | Value |
|---|---|
| CPU model | Intel Core i7-8750H @ 2.20 GHz |
| Physical cores | 6 |
| Logical processors | 12 (hyperthreading, 2 per core) |
| L1 cache (per core) | 32 KB instruction + 32 KB data |
| L2 cache (per core) | 256 KB |
| L3 cache (shared) | 9 MB — shared across all cores |
| RAM | 16 GB DDR4-2400, dual channel |
| Theoretical RAM bandwidth | ~38 GB/s |
2.3 Thread Pool Configuration
Java ExecutorService with a fixed thread pool was used throughout. Task submission followed
an iterative model — see Section 3 for why the initial recursive model was abandoned.
Measurements report the following for each configuration:
- seqMs — wall-clock time for sequential execution of the full chain
- parMs — wall-clock time with the fixed thread pool
- speedup — ratio
seqMs / parMs
3. First Lesson: Thread Pool Starvation
Before any meaningful measurements could be taken, the first implementation produced a deadlock. The design
was recursive task splitting inside multiplyRange:
Callable<Matrix> leftTask = () -> multiplyRange(chain, l, mid, executor);
Callable<Matrix> rightTask = () -> multiplyRange(chain, mid + 1, r, executor);
Future<Matrix> leftFuture = executor.submit(leftTask);
Future<Matrix> rightFuture = executor.submit(rightTask);
Matrix left = leftFuture.get(); // parent thread blocked here
Matrix right = rightFuture.get(); // never reached
The result was thread pool starvation. Parent tasks occupied threads and blocked on
.get(). Child tasks queued for a free thread. All threads were held by waiting
parents. Nobody moved.
3.1 Why This Deadlocks
The thread pool has a fixed number of threads. Each parent task occupies one thread and blocks
calling .get(). The child tasks are submitted to the same pool and wait in the queue
for a free thread. But all threads are occupied by parents waiting for children. The pool is
exhausted.
3.2 Resolution
The fix is to remove the recursive dependency: restructure to iterative task submission, where tasks are independent and no task waits on another inside the same pool. With iterative submission, every thread completes independently.
This is why ForkJoinPool exists: it uses work-stealing to allow a blocked thread
to execute child tasks, preventing starvation. With a plain FixedThreadPool,
recursive dependencies require either a separate executor for child tasks, or restructuring to
eliminate the dependency entirely.
4. Main Experiment: Varying Matrix Size
With iterative submission in place, the main experiment swept matrix size from 100×100 to 1600×1600 with chain length 64 and 8 threads.
4.1 Results
| Matrix size | seqMs | parMs | Speedup | L3 pressure (8 threads) |
|---|---|---|---|---|
| 100×100 | 48 | 22 | 2.18× | ~0.6 MB |
| 200×200 | 246 | 84 | 2.93× | ~2.4 MB |
| 400×400 | 2583 | 1501 | 1.72× | ~9.6 MB |
| 800×800 | 24940 | 27361 | 0.91× | ~40 MB |
| 1600×1600 | 355618 | 262949 | 1.35× | >160 MB |
Speedup peaks at 200×200 and collapses below 1.0 at 800×800 — parallel execution becomes slower than sequential. The partial recovery at 1600×1600 is discussed in Section 4.3.
4.2 Root Cause: L3 Cache Saturation
The collapse is not a bug in the benchmark. It is a direct consequence of the L3 cache architecture.
At 200×200 (320 KB per matrix), eight threads working simultaneously require roughly 2.4 MB of working data. This fits comfortably in the 9 MB shared L3 cache. Each thread finds its data warm on access.
At 800×800 (5 MB per matrix), the same eight threads require approximately 40 MB of data. This is more than 4× the total L3 capacity. Every thread's data must reside in RAM.
4.3 Cache Eviction Cascade
The degradation is not merely flat — it is active and monotonic. This is the cache eviction cascade:
- Thread A loads its matrix data into L3
- Thread B, needing its own data, evicts Thread A's cache lines from L3
- Thread A must now reload from RAM on next access
- That reload evicts Thread C's data from L3
- The cycle repeats for all threads simultaneously
More threads means more evictions per unit time. Each eviction triggers a RAM fetch at 100–200 ns latency versus ~10 ns for an L3 hit. With eight threads actively evicting each other, the total latency penalty compounds.
The partial recovery at 1600×1600 occurs because, at very large matrix sizes, the fixed overhead of thread coordination becomes a smaller fraction of the total runtime. Both sequential and parallel are equally memory-bound, but parallel still utilises multiple cores for arithmetic between stalls.
5. Thread Count Sweep at Fixed Matrix Size
The second experiment fixed matrix size at 800×800 and swept thread count from 1 to 24, with chain length 84.
| Threads | seqMs | parMs | Speedup |
|---|---|---|---|
| 1 | 38,653 | 38,612 | 1.00× |
| 2 | 43,831 | 40,238 | 1.09× |
| 4 | 41,111 | 38,734 | 1.06× |
| 8 | 43,975 | 37,969 | 1.16× |
| 16 | 48,797 | 39,113 | 1.25× |
| 24 | 44,016 | 37,564 | 1.17× |
Two observations are immediately visible:
- Parallel times cluster tightly between 37–40 seconds regardless of thread count. Adding threads from 1 to 24 moves parallel time by less than 3 seconds.
- Sequential times vary by 26% across runs (38k–49k ms) with identical workloads.
5.1 Flat Parallel Time: Memory Bandwidth Ceiling
The flat parallel time signature is the fingerprint of memory bandwidth saturation. At 800×800, even a single thread nearly fills the L3 cache. Every thread is spending more time waiting on RAM fetches than performing arithmetic. Adding more threads adds more memory requests, but the memory controller is already the bottleneck. There is no idle bandwidth to utilise.
This is the key insight for production systems: no thread count will fix a memory-bandwidth-bound workload. The correct intervention is to reduce the working set per task, not to add threads.
5.2 Sequential Variance: JVM Noise
The 26% variance in sequential time reflects JVM non-determinism: GC pauses, JIT compilation scheduling, OS scheduler preemption, and thermal throttling on a laptop CPU.
Because speedup is computed as seqMs / parMs, a high sequential baseline
inflates apparent speedup; a low baseline deflates it. The speedup values in this table are
substantially an artifact of this noise rather than genuine parallel gains. Proper
benchmarking with JMH and multiple warmed-up iterations would isolate this.
6. Theoretical Boundary and Empirical Validation
Having identified the mechanism, the boundary can be derived analytically. The optimal thread count for a CPU-bound task is the largest N such that the aggregate working set fits in the shared L3 cache:
optimal_threads = floor(L3_size / working_set_per_task)
For a 400×400 matrix (1.2 MB) on a 9 MB L3 machine:
floor(9 MB / 1.2 MB) = 7 threads
The predicted cache boundary — where 4 threads saturate L3 — is:
max_safe_size = sqrt(L3 / (threads × bytes_per_element))
= sqrt(9×1024×1024 / 32)
= sqrt(294912)
≈ 543×543
6.1 Fine-Grained Boundary Sweep
| Matrix size | 4-thread speedup | 6-thread speedup | L3 pressure (4 threads) |
|---|---|---|---|
| 500×500 | 1.89× | 1.37× | 8.0 MB — just inside |
| 540×540 | 1.57× | 1.24× | 9.3 MB — crossing boundary |
| 560×560 | 1.36× | 1.25× | 9.9 MB — over |
| 580×580 | 1.26× | 1.02× | 10.8 MB — over |
| 600×600 | 1.06× | 1.13× | 11.5 MB — over |
| 650×650 | 1.08× | 0.93× | 13.5 MB — over |
| 800×800 | 0.99× | 0.82× | 20.5 MB — fully RAM-bound |
Theory predicted the wall at ~543×543. The empirical data shows degradation beginning between 500 and 540. The agreement is within one data point.
The 4-thread curve is cleaner — monotonically declining — than the 6-thread curve, which is noisy above ~560×560. At 6 threads, each thread receives only 1.5 MB of effective L3, increasing sensitivity to OS and JVM cache usage patterns between runs.
7. The Real Denominator: Java is a Tenant
The theoretical formula uses L3 cache size as the denominator. In practice, the Java application does not own the entire L3. Multiple co-tenants consume cache and CPU continuously:
| Co-tenant | Resource consumed |
|---|---|
| OS kernel code and data structures | ~1–2 MB L3 always resident |
| JIT compiler (C2 compilation threads) | Active during warmup, causes CPU spikes |
| GC threads (G1GC concurrent marking) | Background CPU usage, cache pollution |
| JVM code cache (compiled hot methods) | ~50–200 KB per hot method |
| Thread stacks (default 512 KB–1 MB each) | RAM and TLB pressure |
| Class metadata (Metaspace) | Off-heap but competes for TLB entries |
The practical adjustment is to use 70–75% of the nominal L3 size when applying the formula:
effective_L3 ≈ 9 MB × 0.75 = 6.75 MB
practical_boundary = sqrt(6.75×1024×1024 / (4 × 8)) ≈ 470×470
The observed degradation beginning around 500×500 is consistent with this corrected value, not with the 543×543 theoretical value.
8. Production Reference Card
The following table summarises the design rules derived from these experiments for the specific hardware used (i7-8750H, 9 MB L3):
| Parameter | Spec sheet value | Effective (production) |
|---|---|---|
| Physical cores | 6 | 5 — reserve 1 for OS and JVM |
| L3 cache | 9 MB | ~6.5–7 MB |
| RAM bandwidth | ~38 GB/s | ~28–32 GB/s |
| L2 safe matrix size per core | ~178×178 | ~170×170 |
| L3 per-core safe matrix size | ~433×433 | ~380×380 |
| Full L3 safe size (single thread) | ~1060×1060 | ~915×915 |
| CPU-bound thread pool | 6 threads | 4–5 threads |
| I/O-bound thread pool (ETL) | Little's Law | 100–150 threads |
8.1 General Formula
For any CPU-bound workload on any machine:
safe_threads = floor(effective_L3 / working_set_per_task)
effective_L3 = nominal_L3 × 0.75
if safe_threads > (physical_cores - 1):
safe_threads = physical_cores - 1
The thread pool size and the task data size are not independent decisions. Fixing one constrains the other. In production systems, both must be co-designed against the cache hierarchy of the target hardware.
9. Next Experiment: I/O-Bound ETL
All experiments described here involve CPU-bound work. The behaviour of threading changes fundamentally for I/O-bound workloads such as ETL pipelines — extract from a source, transform in memory, load to a destination.
In I/O-bound work, threads spend most of their time blocked waiting for network or disk responses. The CPU is idle during that wait. Other threads can use it. This means:
- Threads do not compete for L3 cache in the same way — working sets are smaller and access patterns are different
- The optimal thread count is governed by Little's Law rather than cache size
- Thread pools of 100–200 are legitimate and correct on a 6-core machine
9.1 Little's Law for I/O-Bound Thread Sizing
L = λ × W
L = threads needed
λ = target throughput (records per second)
W = average latency per record (seconds)
If each ETL record takes 200 ms end-to-end and the target throughput is 50 records per second:
L = 50 × 0.2 = 10 threads
The next experiment will use real network I/O and disk I/O — not simulated sleeps — to observe this behaviour empirically. The expectation is a linear scaling region where throughput scales with thread count, followed by a plateau where the downstream system (database connection pool, API rate limit, or NIC bandwidth) becomes the new bottleneck. More data to follow.
10. Conclusion
The central finding of these experiments can be stated simply:
The practical lessons:
-
Recursive task submission in a shared fixed thread pool causes starvation deadlock.
Use iterative submission or
ForkJoinPool. -
The formula
optimal_threads = floor(L3 / working_set)predicts the cache boundary accurately. - Spec sheet numbers overstate available resources by 25–30%. Budget accordingly.
- JVM benchmarking requires warmup and multiple trials. Single-run measurements are dominated by JIT and GC noise.
- Thread pool size and task data size are co-dependent. Design them together.
The formulas for all of this exist in textbooks. The value of running the experiment is that the data makes the mechanism visceral: watching speedup fall from 2.93× to 0.91× as a single variable crosses a cache boundary is a different kind of knowing.
Appendix: Full Dataset
A.1 Matrix Size Sweep (chain=64, threads=8)
| Matrix size | seqMs | parMs | Speedup |
|---|---|---|---|
| 100×100 | 48 | 22 | 2.18× |
| 200×200 | 246 | 84 | 2.93× |
| 400×400 | 2,583 | 1,501 | 1.72× |
| 800×800 | 24,940 | 27,361 | 0.91× |
| 1600×1600 | 355,618 | 262,949 | 1.35× |
A.2 Thread Count Sweep (chain=84, size=800×800)
| Threads | seqMs | parMs | Speedup |
|---|---|---|---|
| 1 | 38,653 | 38,612 | 1.00× |
| 2 | 43,831 | 40,238 | 1.09× |
| 4 | 41,111 | 38,734 | 1.06× |
| 8 | 43,975 | 37,969 | 1.16× |
| 16 | 48,797 | 39,113 | 1.25× |
| 24 | 44,016 | 37,564 | 1.17× |
A.3 Fine-Grained Boundary Sweep (chain=64, randomised order)
| Matrix size | 4-thread speedup | 6-thread speedup | L3 pressure (4 threads) |
|---|---|---|---|
| 500×500 | 1.89× | 1.37× | 8.0 MB |
| 540×540 | 1.57× | 1.24× | 9.3 MB |
| 560×560 | 1.36× | 1.25× | 9.9 MB |
| 580×580 | 1.26× | 1.02× | 10.8 MB |
| 600×600 | 1.06× | 1.13× | 11.5 MB |
| 650×650 | 1.08× | 0.93× | 13.5 MB |
| 800×800 | 0.99× | 0.82× | 20.5 MB |
A.4 Thread Count at 400×400 and 800×800 (chain=64)
| Threads | seqMs | parMs | Speedup |
|---|---|---|---|
| 1 | 853 | 921 | 0.93× |
| 2 | 897 | 512 | 1.75× |
| 4 | 862 | 337 | 2.56× |
| 6 | 897 | 365 | 2.46× |
| 8 | 902 | 387 | 2.33× |
| 12 | 867 | 594 | 1.46× |
| 16 | 900 | 696 | 1.29× |
| Threads | seqMs | parMs | Speedup |
|---|---|---|---|
| 1 | 7,200 | 7,538 | 0.96× |
| 2 | 7,326 | 6,767 | 1.08× |
| 4 | 7,252 | 7,195 | 1.01× |
| 6 | 7,262 | 9,481 | 0.77× |
| 8 | 7,309 | 10,859 | 0.67× |
| 12 | 7,146 | 11,938 | 0.60× |
| 16 | 7,328 | 13,000 | 0.56× |
Comments
Post a Comment