An Experiment: Java Multi-Threading on Heavy CPU Bound workload

Java Multithreading — A Field Study

Java Multithreading

A Field Study: Thread Pools, Cache Hierarchies, and Memory Bottlenecks

What textbooks describe but systems make you feel

1. Introduction

Sometimes you revisit something you thought you understood — and discover you only understood it academically. This document is a record of one such weekend: matrix chain multiplication as a CPU-heavy workload, a fixed-size thread pool, and a series of experiments that produced results far more interesting than expected.

The intention was simple: compare sequential execution to a parallel thread pool, observe the speedup, and confirm the textbook. What actually happened was a journey through deadlocks, memory bus saturation, cache eviction cascades, and a lesson about the gap between the spec sheet and the real machine.

2. Experimental Setup

2.1 Workload: Matrix Chain Multiplication

The workload is multiplication of a chain of N square matrices of identical size. Each matrix contains double-precision floating-point values (64-bit IEEE 754). The operation is purely CPU-bound: no disk I/O, no network, no synchronisation between threads other than task submission and future resolution.

Matrix size is expressed as dimension D, where each matrix contains D × D elements. Memory consumed per matrix:

bytes = D × D × 8

So a 200×200 matrix consumes 320 KB; a 800×800 matrix consumes 5.12 MB.

2.2 Hardware

All experiments were run on a MacBook Pro with the following specifications:

Hardware Specifications — i7-8750H MacBook Pro
Parameter Value
CPU model Intel Core i7-8750H @ 2.20 GHz
Physical cores 6
Logical processors 12 (hyperthreading, 2 per core)
L1 cache (per core) 32 KB instruction + 32 KB data
L2 cache (per core) 256 KB
L3 cache (shared) 9 MB — shared across all cores
RAM 16 GB DDR4-2400, dual channel
Theoretical RAM bandwidth ~38 GB/s

2.3 Thread Pool Configuration

Java ExecutorService with a fixed thread pool was used throughout. Task submission followed an iterative model — see Section 3 for why the initial recursive model was abandoned.

Measurements report the following for each configuration:

  • seqMs — wall-clock time for sequential execution of the full chain
  • parMs — wall-clock time with the fixed thread pool
  • speedup — ratio seqMs / parMs

3. First Lesson: Thread Pool Starvation

Before any meaningful measurements could be taken, the first implementation produced a deadlock. The design was recursive task splitting inside multiplyRange:

Callable<Matrix> leftTask  = () -> multiplyRange(chain, l, mid, executor);
Callable<Matrix> rightTask = () -> multiplyRange(chain, mid + 1, r, executor);

Future<Matrix> leftFuture  = executor.submit(leftTask);
Future<Matrix> rightFuture = executor.submit(rightTask);

Matrix left  = leftFuture.get();   // parent thread blocked here
Matrix right = rightFuture.get();  // never reached

The result was thread pool starvation. Parent tasks occupied threads and blocked on .get(). Child tasks queued for a free thread. All threads were held by waiting parents. Nobody moved.

3.1 Why This Deadlocks

The thread pool has a fixed number of threads. Each parent task occupies one thread and blocks calling .get(). The child tasks are submitted to the same pool and wait in the queue for a free thread. But all threads are occupied by parents waiting for children. The pool is exhausted.

3.2 Resolution

The fix is to remove the recursive dependency: restructure to iterative task submission, where tasks are independent and no task waits on another inside the same pool. With iterative submission, every thread completes independently.

This is why ForkJoinPool exists: it uses work-stealing to allow a blocked thread to execute child tasks, preventing starvation. With a plain FixedThreadPool, recursive dependencies require either a separate executor for child tasks, or restructuring to eliminate the dependency entirely.

4. Main Experiment: Varying Matrix Size

With iterative submission in place, the main experiment swept matrix size from 100×100 to 1600×1600 with chain length 64 and 8 threads.

4.1 Results

Speedup vs Matrix Size — chain=64, threads=8
Matrix size seqMs parMs Speedup L3 pressure (8 threads)
100×100 48 22 2.18× ~0.6 MB
200×200 246 84 2.93× ~2.4 MB
400×400 2583 1501 1.72× ~9.6 MB
800×800 24940 27361 0.91× ~40 MB
1600×1600 355618 262949 1.35× >160 MB

Speedup peaks at 200×200 and collapses below 1.0 at 800×800 — parallel execution becomes slower than sequential. The partial recovery at 1600×1600 is discussed in Section 4.3.

4.2 Root Cause: L3 Cache Saturation

The collapse is not a bug in the benchmark. It is a direct consequence of the L3 cache architecture.

At 200×200 (320 KB per matrix), eight threads working simultaneously require roughly 2.4 MB of working data. This fits comfortably in the 9 MB shared L3 cache. Each thread finds its data warm on access.

At 800×800 (5 MB per matrix), the same eight threads require approximately 40 MB of data. This is more than 4× the total L3 capacity. Every thread's data must reside in RAM.

4.3 Cache Eviction Cascade

The degradation is not merely flat — it is active and monotonic. This is the cache eviction cascade:

  1. Thread A loads its matrix data into L3
  2. Thread B, needing its own data, evicts Thread A's cache lines from L3
  3. Thread A must now reload from RAM on next access
  4. That reload evicts Thread C's data from L3
  5. The cycle repeats for all threads simultaneously

More threads means more evictions per unit time. Each eviction triggers a RAM fetch at 100–200 ns latency versus ~10 ns for an L3 hit. With eight threads actively evicting each other, the total latency penalty compounds.

The partial recovery at 1600×1600 occurs because, at very large matrix sizes, the fixed overhead of thread coordination becomes a smaller fraction of the total runtime. Both sequential and parallel are equally memory-bound, but parallel still utilises multiple cores for arithmetic between stalls.

5. Thread Count Sweep at Fixed Matrix Size

The second experiment fixed matrix size at 800×800 and swept thread count from 1 to 24, with chain length 84.

Thread Count Sweep — chain=84, size=800×800
Threads seqMs parMs Speedup
138,65338,6121.00×
243,83140,2381.09×
441,11138,7341.06×
843,97537,9691.16×
1648,79739,1131.25×
2444,01637,5641.17×

Two observations are immediately visible:

  • Parallel times cluster tightly between 37–40 seconds regardless of thread count. Adding threads from 1 to 24 moves parallel time by less than 3 seconds.
  • Sequential times vary by 26% across runs (38k–49k ms) with identical workloads.

5.1 Flat Parallel Time: Memory Bandwidth Ceiling

The flat parallel time signature is the fingerprint of memory bandwidth saturation. At 800×800, even a single thread nearly fills the L3 cache. Every thread is spending more time waiting on RAM fetches than performing arithmetic. Adding more threads adds more memory requests, but the memory controller is already the bottleneck. There is no idle bandwidth to utilise.

This is the key insight for production systems: no thread count will fix a memory-bandwidth-bound workload. The correct intervention is to reduce the working set per task, not to add threads.

5.2 Sequential Variance: JVM Noise

The 26% variance in sequential time reflects JVM non-determinism: GC pauses, JIT compilation scheduling, OS scheduler preemption, and thermal throttling on a laptop CPU.

Because speedup is computed as seqMs / parMs, a high sequential baseline inflates apparent speedup; a low baseline deflates it. The speedup values in this table are substantially an artifact of this noise rather than genuine parallel gains. Proper benchmarking with JMH and multiple warmed-up iterations would isolate this.

6. Theoretical Boundary and Empirical Validation

Having identified the mechanism, the boundary can be derived analytically. The optimal thread count for a CPU-bound task is the largest N such that the aggregate working set fits in the shared L3 cache:

optimal_threads = floor(L3_size / working_set_per_task)

For a 400×400 matrix (1.2 MB) on a 9 MB L3 machine:

floor(9 MB / 1.2 MB) = 7 threads

The predicted cache boundary — where 4 threads saturate L3 — is:

max_safe_size = sqrt(L3 / (threads × bytes_per_element))
             = sqrt(9×1024×1024 / 32)
             = sqrt(294912)
             ≈ 543×543

6.1 Fine-Grained Boundary Sweep

Boundary Sweep — chain=64, 4 threads vs 6 threads
Matrix size 4-thread speedup 6-thread speedup L3 pressure (4 threads)
500×500 1.89× 1.37× 8.0 MB — just inside
540×540 1.57× 1.24× 9.3 MB — crossing boundary
560×560 1.36× 1.25× 9.9 MB — over
580×580 1.26× 1.02× 10.8 MB — over
600×600 1.06× 1.13× 11.5 MB — over
650×650 1.08× 0.93× 13.5 MB — over
800×800 0.99× 0.82× 20.5 MB — fully RAM-bound

Theory predicted the wall at ~543×543. The empirical data shows degradation beginning between 500 and 540. The agreement is within one data point.

The 4-thread curve is cleaner — monotonically declining — than the 6-thread curve, which is noisy above ~560×560. At 6 threads, each thread receives only 1.5 MB of effective L3, increasing sensitivity to OS and JVM cache usage patterns between runs.

7. The Real Denominator: Java is a Tenant

The theoretical formula uses L3 cache size as the denominator. In practice, the Java application does not own the entire L3. Multiple co-tenants consume cache and CPU continuously:

JVM and OS Co-tenants Consuming Cache and CPU
Co-tenant Resource consumed
OS kernel code and data structures ~1–2 MB L3 always resident
JIT compiler (C2 compilation threads) Active during warmup, causes CPU spikes
GC threads (G1GC concurrent marking) Background CPU usage, cache pollution
JVM code cache (compiled hot methods) ~50–200 KB per hot method
Thread stacks (default 512 KB–1 MB each) RAM and TLB pressure
Class metadata (Metaspace) Off-heap but competes for TLB entries

The practical adjustment is to use 70–75% of the nominal L3 size when applying the formula:

effective_L3 ≈ 9 MB × 0.75 = 6.75 MB

practical_boundary = sqrt(6.75×1024×1024 / (4 × 8)) ≈ 470×470

The observed degradation beginning around 500×500 is consistent with this corrected value, not with the 543×543 theoretical value.

8. Production Reference Card

The following table summarises the design rules derived from these experiments for the specific hardware used (i7-8750H, 9 MB L3):

Production Reference — i7-8750H MacBook Pro
Parameter Spec sheet value Effective (production)
Physical cores 6 5 — reserve 1 for OS and JVM
L3 cache 9 MB ~6.5–7 MB
RAM bandwidth ~38 GB/s ~28–32 GB/s
L2 safe matrix size per core ~178×178 ~170×170
L3 per-core safe matrix size ~433×433 ~380×380
Full L3 safe size (single thread) ~1060×1060 ~915×915
CPU-bound thread pool 6 threads 4–5 threads
I/O-bound thread pool (ETL) Little's Law 100–150 threads

8.1 General Formula

For any CPU-bound workload on any machine:

safe_threads   = floor(effective_L3 / working_set_per_task)
effective_L3   = nominal_L3 × 0.75

if safe_threads > (physical_cores - 1):
    safe_threads = physical_cores - 1

The thread pool size and the task data size are not independent decisions. Fixing one constrains the other. In production systems, both must be co-designed against the cache hierarchy of the target hardware.

9. Next Experiment: I/O-Bound ETL

All experiments described here involve CPU-bound work. The behaviour of threading changes fundamentally for I/O-bound workloads such as ETL pipelines — extract from a source, transform in memory, load to a destination.

In I/O-bound work, threads spend most of their time blocked waiting for network or disk responses. The CPU is idle during that wait. Other threads can use it. This means:

  • Threads do not compete for L3 cache in the same way — working sets are smaller and access patterns are different
  • The optimal thread count is governed by Little's Law rather than cache size
  • Thread pools of 100–200 are legitimate and correct on a 6-core machine

9.1 Little's Law for I/O-Bound Thread Sizing

L = λ × W

L  =  threads needed
λ  =  target throughput (records per second)
W  =  average latency per record (seconds)

If each ETL record takes 200 ms end-to-end and the target throughput is 50 records per second:

L = 50 × 0.2 = 10 threads

The next experiment will use real network I/O and disk I/O — not simulated sleeps — to observe this behaviour empirically. The expectation is a linear scaling region where throughput scales with thread count, followed by a plateau where the downstream system (database connection pool, API rate limit, or NIC bandwidth) becomes the new bottleneck. More data to follow.

10. Conclusion

The central finding of these experiments can be stated simply:

The practical lessons:

  1. Recursive task submission in a shared fixed thread pool causes starvation deadlock. Use iterative submission or ForkJoinPool.
  2. The formula optimal_threads = floor(L3 / working_set) predicts the cache boundary accurately.
  3. Spec sheet numbers overstate available resources by 25–30%. Budget accordingly.
  4. JVM benchmarking requires warmup and multiple trials. Single-run measurements are dominated by JIT and GC noise.
  5. Thread pool size and task data size are co-dependent. Design them together.

The formulas for all of this exist in textbooks. The value of running the experiment is that the data makes the mechanism visceral: watching speedup fall from 2.93× to 0.91× as a single variable crosses a cache boundary is a different kind of knowing.

Appendix: Full Dataset

A.1 Matrix Size Sweep (chain=64, threads=8)

Full Results — Matrix Size Sweep
Matrix size seqMs parMs Speedup
100×10048222.18×
200×200246842.93×
400×4002,5831,5011.72×
800×80024,94027,3610.91×
1600×1600355,618262,9491.35×

A.2 Thread Count Sweep (chain=84, size=800×800)

Full Results — Thread Count Sweep
Threads seqMs parMs Speedup
138,65338,6121.00×
243,83140,2381.09×
441,11138,7341.06×
843,97537,9691.16×
1648,79739,1131.25×
2444,01637,5641.17×

A.3 Fine-Grained Boundary Sweep (chain=64, randomised order)

Full Results — Fine-Grained Boundary Sweep
Matrix size 4-thread speedup 6-thread speedup L3 pressure (4 threads)
500×5001.89×1.37×8.0 MB
540×5401.57×1.24×9.3 MB
560×5601.36×1.25×9.9 MB
580×5801.26×1.02×10.8 MB
600×6001.06×1.13×11.5 MB
650×6501.08×0.93×13.5 MB
800×8000.99×0.82×20.5 MB

A.4 Thread Count at 400×400 and 800×800 (chain=64)

Full Results — Thread Count at 400×400
Threads seqMs parMs Speedup
18539210.93×
28975121.75×
48623372.56×
68973652.46×
89023872.33×
128675941.46×
169006961.29×
Full Results — Thread Count at 800×800
Threads seqMs parMs Speedup
17,2007,5380.96×
27,3266,7671.08×
47,2527,1951.01×
67,2629,4810.77×
87,30910,8590.67×
127,14611,9380.60×
167,32813,0000.56×

Java Multithreading — A Field Study. Experiments conducted on Intel i7-8750H @ 2.20 GHz, 9 MB L3, 16 GB DDR4-2400. Java thread pool: Executors.newFixedThreadPool(n).

Next: ETL I/O-bound threading experiment with real network and disk I/O.

Comments