May 27, 2026

An Experiment: Java Multi-Threading on Heavy CPU Bound workload

Java Multithreading — A Field Study

Java Multithreading

A Field Study: Thread Pools, Cache Hierarchies, and Memory Bottlenecks

What textbooks describe but systems make you feel

1. Introduction

Sometimes you revisit something you thought you understood — and discover you only understood it academically. This document is a record of one such weekend: matrix chain multiplication as a CPU-heavy workload, a fixed-size thread pool, and a series of experiments that produced results far more interesting than expected.

The intention was simple: compare sequential execution to a parallel thread pool, observe the speedup, and confirm the textbook. What actually happened was a journey through deadlocks, memory bus saturation, cache eviction cascades, and a lesson about the gap between the spec sheet and the real machine.

2. Experimental Setup

2.1 Workload: Matrix Chain Multiplication

The workload is multiplication of a chain of N square matrices of identical size. Each matrix contains double-precision floating-point values (64-bit IEEE 754). The operation is purely CPU-bound: no disk I/O, no network, no synchronisation between threads other than task submission and future resolution.

Matrix size is expressed as dimension $D$, where each matrix contains ($D \times D$) elements. Memory consumed per matrix:

bytes = D × D × 8

So a 200×200 matrix consumes 320 KB; a 800×800 matrix consumes 5.12 MB.

Because all matrices in the experimental chain share identical square dimensions
($D \times D$), the computational complexity remains strictly invariant regardless of the order of evaluation. This specific design choice completely eliminates any possibility of associative parenthesization optimization (such as via Dynamic Programming algorithms). It ensures that both the sequential baseline and the parallel execution paths perform the exact same number of scalar operations: $(N - 1) \times D^3$, isolating hardware bottlenecks from algorithmic shortcuts.

2.2 Hardware

All experiments were run on a MacBook Pro with the following specifications:

Hardware Specifications — i7-8750H MacBook Pro
Parameter	Value
CPU model	Intel Core i7-8750H @ 2.20 GHz
Physical cores	6
Logical processors	12 (hyperthreading, 2 per core)
L1 cache (per core)	32 KB instruction + 32 KB data
L2 cache (per core)	256 KB
L3 cache (shared)	9 MB — shared across all cores
RAM	16 GB DDR4-2400, dual channel
Theoretical RAM bandwidth	~38 GB/s

2.3 Memory Traversal Layout

The core multiplication engine utilizes a standard, naive nested loop layout ($I \rightarrow J \rightarrow K$). In computer architecture, this specific traversal introduces inherent spatial locality challenges for the column-traversed secondary matrix, leading to single-threaded cache thrashing at larger scales. This choice was deliberate: by using a baseline that does not leverage hardware-specific cache tiling or loop-reordering ($I \rightarrow K \rightarrow J$), we can more clearly isolate how raw multi-threaded memory bus contention acts as a distinct performance throttle, separate from localized, single-thread cache management.

2.4 Thread Pool Configuration

Java ExecutorService with a fixed thread pool was used throughout. Task submission followed an iterative model — see Section 3 for why the initial recursive model was abandoned.

Measurements report the following for each configuration:

seqMs — wall-clock time for sequential execution of the full chain
parMs — wall-clock time with the fixed thread pool
speedup — ratio seqMs / parMs

3. First Lesson: Thread Pool Starvation

Before any meaningful measurements could be taken, the first implementation produced a deadlock. The design was recursive task splitting inside multiplyRange:

Callable<Matrix> leftTask  = () -> multiplyRange(chain, l, mid, executor);
Callable<Matrix> rightTask = () -> multiplyRange(chain, mid + 1, r, executor);

Future<Matrix> leftFuture  = executor.submit(leftTask);
Future<Matrix> rightFuture = executor.submit(rightTask);

Matrix left  = leftFuture.get();   // parent thread blocked here
Matrix right = rightFuture.get();  // never reached

The result was thread pool starvation. Parent tasks occupied threads and blocked on .get(). Child tasks queued for a free thread. All threads were held by waiting parents. Nobody moved.

3.1 Why This Deadlocks

The thread pool has a fixed number of threads. Each parent task occupies one thread and blocks calling .get(). The child tasks are submitted to the same pool and wait in the queue for a free thread. But all threads are occupied by parents waiting for children. The pool is exhausted.

3.2 Resolution

The fix is to remove the recursive dependency: restructure to iterative task submission, where tasks are independent and no task waits on another inside the same pool. With iterative submission, every thread completes independently.

This is why ForkJoinPool exists: it uses work-stealing to allow a blocked thread to execute child tasks, preventing starvation. With a plain FixedThreadPool, recursive dependencies require either a separate executor for child tasks, or restructuring to eliminate the dependency entirely.

4. Main Experiment: Varying Matrix Size

With iterative submission in place, the main experiment swept matrix size from 100×100 to 1600×1600 with chain length 64 and 8 threads.

4.1 Results

Speedup vs Matrix Size — chain=64, threads=8
Matrix size	seqMs	parMs	Speedup	L3 pressure (8 threads)
100×100	48	22	2.18×	~0.6 MB
200×200	246	84	2.93×	~2.4 MB
400×400	2583	1501	1.72×	~9.6 MB
800×800	24940	27361	0.91×	~40 MB
1600×1600	355618	262949	1.35×	>160 MB

Section 4.1 — speedup vs matrix size (chain=64, threads=8)

speedup (×) 1× baseline

Speedup crosses below 1× at 800×800 — parallel is slower than sequential at this size

Speedup peaks at 200×200 and collapses below 1.0 at 800×800 — parallel execution becomes slower than sequential. The partial recovery at 1600×1600 is discussed in Section 4.3.

4.2 Root Cause: L3 Cache Saturation

The collapse is not a bug in the benchmark. It is a direct consequence of the L3 cache architecture.

At 200×200 (320 KB per matrix), eight threads working simultaneously require roughly 2.4 MB of working data. This fits comfortably in the 9 MB shared L3 cache. Each thread finds its data warm on access.

At 800×800 (5 MB per matrix), the same eight threads require approximately 40 MB of data. This is more than 4× the total L3 capacity. Every thread's data must reside in RAM.

4.3 Cache Eviction Cascade

The degradation is not merely flat — it is active and monotonic. This is the cache eviction cascade:

Thread A loads its matrix data into L3
Thread B, needing its own data, evicts Thread A's cache lines from L3
Thread A must now reload from RAM on next access
That reload evicts Thread C's data from L3
The cycle repeats for all threads simultaneously

More threads means more evictions per unit time. Each eviction triggers a RAM fetch at 100–200 ns latency versus ~10 ns for an L3 hit. With eight threads actively evicting each other, the total latency penalty compounds.

The minor performance rebound observed at $1600 \times 1600$ ($1.35\times$) highlights a fundamental shift in Operational Intensity. Matrix multiplication is inherently an $\mathcal{O}(D^3)$ operation for arithmetic computation, while the memory footprint scales at $\mathcal{O}(D^2)$. As the matrix dimension ($D$) doubles from 800 to 1600, the arithmetic workload increases eightfold, whereas the data size only quadruples.

At this scale, the CPU spends vastly more clock cycles performing floating-point arithmetic on data once it is loaded into registers, allowing the hardware prefetchers and out-of-order execution engines to effectively hide memory latency. The workload transitions from being completely memory-bus starved back toward being arithmetically bound, allowing the multi-threaded implementation to re-assert its computational advantage.

5. Thread Count Sweep at Fixed Matrix Size

This experiment fixed the matrix size at 800×800 and swept thread count from 1 to 24, with chain length 84. The goal was to see whether more threads help once the working set already exceeds L3 cache.

Thread count sweep — chain=84, size=800×800. Stabilised speedup uses the median sequential time (42,471 ms) as a fixed baseline.
Threads	seqMs (raw)	parMs	Raw speedup	Stabilised speedup
1	38,653	38,612	1.00×	1.10×
2	43,831	40,238	1.09×	1.06×
4	41,111	38,734	1.06×	1.10×
8	43,975	37,969	1.16×	1.12×
16	48,797	39,113	1.25×	1.09×
24	44,016	37,564	1.17×	1.13×

Two things stand out immediately:

Parallel times cluster tightly between 37–40 seconds across all thread counts. Going from 1 to 24 threads moves parallel time by less than 3 seconds.
Raw sequential times vary by 26% (38k–49k ms) across identical workloads — a sign of JVM noise, not real work variation.

Section 5 — raw vs stabilised speedup at 800×800 (chain=84)

raw speedup (JVM noise) stabilised speedup (median baseline) 1× baseline

Stabilised speedup exposes the truth: adding threads from 1 → 24 gains virtually nothing at this matrix size

5.1 Why parallel time barely moves: memory bandwidth

At 800×800, even a single thread nearly saturates the L3 cache. Every thread is waiting on RAM fetches more than doing arithmetic. Adding more threads only adds more memory requests — the memory controller is already the bottleneck, so there is nothing to gain.

The stabilised speedup column makes this concrete: after removing JVM noise, speedup stays in a flat 1.06–1.13× band regardless of thread count. Going from 1 thread to 24 buys essentially nothing.

The fix for a memory-bound workload is not more threads — it is a smaller working set per task.

5.2 Why raw sequential time jumps around: JVM noise

The 26% swing in raw sequential times comes from GC pauses, JIT compilation, OS scheduling, and thermal throttling — all outside your control. Because raw speedup is seqMs / parMs, a high sequential baseline inflates the number and a low one deflates it.

Pinning speedup to a fixed median baseline (42,471 ms) removes that noise. For rigorous benchmarking, JMH with multiple warmed-up iterations is the right tool.

6. Theoretical Boundary and Empirical Validation

Having identified the mechanism, the boundary can be derived analytically. The optimal thread count for a CPU-bound task is the largest N such that the aggregate working set fits in the shared L3 cache:

optimal_threads = floor(L3_size / working_set_per_task)

For a 400×400 matrix (1.2 MB) on a 9 MB L3 machine:

floor(9 MB / 1.2 MB) = 7 threads

The predicted cache boundary — where 4 threads saturate L3 — is:

max_safe_size = sqrt(L3 / (threads × bytes_per_element))
             = sqrt(9×1024×1024 / 32)
             = sqrt(294912)
             ≈ 543×543

6.1 Fine-Grained Boundary Sweep

Boundary Sweep — chain=64, 4 threads vs 6 threads
Matrix size	4-thread speedup	6-thread speedup	L3 pressure (4 threads)
500×500	1.89×	1.37×	8.0 MB — just inside
540×540	1.57×	1.24×	9.3 MB — crossing boundary
560×560	1.36×	1.25×	9.9 MB — over
580×580	1.26×	1.02×	10.8 MB — over
600×600	1.06×	1.13×	11.5 MB — over
650×650	1.08×	0.93×	13.5 MB — over
800×800	0.99×	0.82×	20.5 MB — fully RAM-bound

Theory predicted the wall at ~543×543. The empirical data shows degradation beginning between 500 and 540. The agreement is within one data point.

The 4-thread curve is cleaner — monotonically declining — than the 6-thread curve, which is noisy above ~560×560. At 6 threads, each thread receives only 1.5 MB of effective L3, increasing sensitivity to OS and JVM cache usage patterns between runs.

7. The Real Denominator: Java is a Tenant

The theoretical formula uses L3 cache size as the denominator. In practice, the Java application does not own the entire L3. Multiple co-tenants consume cache and CPU continuously:

JVM and OS Co-tenants Consuming Cache and CPU
Co-tenant	Resource consumed
OS kernel code and data structures	~1–2 MB L3 always resident
JIT compiler (C2 compilation threads)	Active during warmup, causes CPU spikes
GC threads (G1GC concurrent marking)	Background CPU usage, cache pollution
JVM code cache (compiled hot methods)	~50–200 KB per hot method
Thread stacks (default 512 KB–1 MB each)	RAM and TLB pressure
Class metadata (Metaspace)	Off-heap but competes for TLB entries

The practical adjustment is to use 70–75% of the nominal L3 size when applying the formula:

effective_L3 ≈ 9 MB × 0.75 = 6.75 MB

practical_boundary = sqrt(6.75×1024×1024 / (4 × 8)) ≈ 470×470

The observed degradation beginning around 500×500 is consistent with this corrected value, not with the 543×543 theoretical value.

8. Production Reference Card

The following table summarises the design rules derived from these experiments for the specific hardware used (i7-8750H, 9 MB L3):

Production Reference — i7-8750H MacBook Pro
Parameter	Spec sheet value	Effective (production)
Physical cores	6	5 — reserve 1 for OS and JVM
L3 cache	9 MB	~6.5–7 MB
RAM bandwidth	~38 GB/s	~28–32 GB/s
L2 safe matrix size per core	~178×178	~170×170
L3 per-core safe matrix size	~433×433	~380×380
Full L3 safe size (single thread)	~1060×1060	~915×915
CPU-bound thread pool	6 threads	4–5 threads
I/O-bound thread pool (ETL)	Little's Law	100–150 threads

8.1 General Formula

For any CPU-bound workload on any machine:

safe_threads   = floor(effective_L3 / working_set_per_task)
effective_L3   = nominal_L3 × 0.75

if safe_threads > (physical_cores - 1):
    safe_threads = physical_cores - 1

The thread pool size and the task data size are not independent decisions. Fixing one constrains the other. In production systems, both must be co-designed against the cache hierarchy of the target hardware.

9. Next Experiment: I/O-Bound ETL

All experiments described here involve CPU-bound work. The behaviour of threading changes fundamentally for I/O-bound workloads such as ETL pipelines — extract from a source, transform in memory, load to a destination.

In I/O-bound work, threads spend most of their time blocked waiting for network or disk responses. The CPU is idle during that wait. Other threads can use it. This means:

Threads do not compete for L3 cache in the same way — working sets are smaller and access patterns are different
The optimal thread count is governed by Little's Law rather than cache size
Thread pools of 100–200 are legitimate and correct on a 6-core machine

9.1 Little's Law for I/O-Bound Thread Sizing

L = λ × W

L  =  threads needed
λ  =  target throughput (records per second)
W  =  average latency per record (seconds)

If each ETL record takes 200 ms end-to-end and the target throughput is 50 records per second:

L = 50 × 0.2 = 10 threads

The next experiment will use real network I/O and disk I/O — not simulated sleeps — to observe this behaviour empirically. The expectation is a linear scaling region where throughput scales with thread count, followed by a plateau where the downstream system (database connection pool, API rate limit, or NIC bandwidth) becomes the new bottleneck. More data to follow.

10. Conclusion

The central finding of these experiments can be stated simply:

The practical lessons:

Recursive task submission in a shared fixed thread pool causes starvation deadlock. Use iterative submission or ForkJoinPool.
The formula optimal_threads = floor(L3 / working_set) predicts the cache boundary accurately.
Spec sheet numbers overstate available resources by 25–30%. Budget accordingly.
JVM benchmarking requires warmup and multiple trials. Single-run measurements are dominated by JIT and GC noise.
Thread pool size and task data size are co-dependent. Design them together.

The formulas for all of this exist in textbooks. The value of running the experiment is that the data makes the mechanism visceral: watching speedup fall from 2.93× to 0.91× as a single variable crosses a cache boundary is a different kind of knowing.

Appendix: Full Dataset

A.1 Matrix Size Sweep (chain=64, threads=8)

Full Results — Matrix Size Sweep
Matrix size	seqMs	parMs	Speedup
100×100	48	22	2.18×
200×200	246	84	2.93×
400×400	2,583	1,501	1.72×
800×800	24,940	27,361	0.91×
1600×1600	355,618	262,949	1.35×

A.2 Thread Count Sweep (chain=84, size=800×800)

Full Results — Thread Count Sweep
Threads	seqMs	parMs	Speedup
1	38,653	38,612	1.00×
2	43,831	40,238	1.09×
4	41,111	38,734	1.06×
8	43,975	37,969	1.16×
16	48,797	39,113	1.25×
24	44,016	37,564	1.17×

A.3 Fine-Grained Boundary Sweep (chain=64, randomised order)

Full Results — Fine-Grained Boundary Sweep
Matrix size	4-thread speedup	6-thread speedup	L3 pressure (4 threads)
500×500	1.89×	1.37×	8.0 MB
540×540	1.57×	1.24×	9.3 MB
560×560	1.36×	1.25×	9.9 MB
580×580	1.26×	1.02×	10.8 MB
600×600	1.06×	1.13×	11.5 MB
650×650	1.08×	0.93×	13.5 MB
800×800	0.99×	0.82×	20.5 MB

A.4 Thread Count at 400×400 and 800×800 (chain=64)

Full Results — Thread Count at 400×400
Threads	seqMs	parMs	Speedup
1	853	921	0.93×
2	897	512	1.75×
4	862	337	2.56×
6	897	365	2.46×
8	902	387	2.33×
12	867	594	1.46×
16	900	696	1.29×

Full Results — Thread Count at 800×800
Threads	seqMs	parMs	Speedup
1	7,200	7,538	0.96×
2	7,326	6,767	1.08×
4	7,252	7,195	1.01×
6	7,262	9,481	0.77×
8	7,309	10,859	0.67×
12	7,146	11,938	0.60×
16	7,328	13,000	0.56×

Search This Blog

Software Engineering - Experimental Way

An Experiment: Java Multi-Threading on Heavy CPU Bound workload

Java Multithreading

1. Introduction

2. Experimental Setup

2.1 Workload: Matrix Chain Multiplication

2.2 Hardware

2.3 Memory Traversal Layout

2.4 Thread Pool Configuration

3. First Lesson: Thread Pool Starvation

3.1 Why This Deadlocks

3.2 Resolution

4. Main Experiment: Varying Matrix Size

4.1 Results

4.2 Root Cause: L3 Cache Saturation

4.3 Cache Eviction Cascade

5. Thread Count Sweep at Fixed Matrix Size

5.1 Why parallel time barely moves: memory bandwidth

5.2 Why raw sequential time jumps around: JVM noise

6. Theoretical Boundary and Empirical Validation

6.1 Fine-Grained Boundary Sweep

7. The Real Denominator: Java is a Tenant

8. Production Reference Card

8.1 General Formula

9. Next Experiment: I/O-Bound ETL

9.1 Little's Law for I/O-Bound Thread Sizing

10. Conclusion

Appendix: Full Dataset

A.1 Matrix Size Sweep (chain=64, threads=8)

A.2 Thread Count Sweep (chain=84, size=800×800)

A.3 Fine-Grained Boundary Sweep (chain=64, randomised order)

A.4 Thread Count at 400×400 and 800×800 (chain=64)

Comments

Post a Comment