High Throughput Computing: Bigger isn’t always better

I recently had the opportunity to play with a bunch of hardware to test and optimize Nitro to blast as many tasks per second through as possible. What I found led to some interesting insights in selecting servers for HTC computing.

I ran tests on five different Intel Core i7 systems.  Here’s the configuration of the machines:

Machine Sockets Cores Hyper-Threaded CPU Threads
Desktop 1 6 Yes 12
Server 1 4 18 Yes 144
Server 2 4 18 No 72
Server 3 2 10 No 20
Server 4 1 4 Yes 8

My test consisted of running 500,000 “sleep 0” tasks on each machine using Nitro.  Since the test isn’t really doing any work on the task level I was just measuring raw throughput speed of Nitro. I found the optimum number of concurrent tasks spread across the cores and watched the htop to make sure the load was as evenly distributed as possible, and that I was getting acceptable utilization. I then measured the average tasks per second for the test.  The raw numbers were impressive, but I’ll save raw task per second and tasks per second per core numbers for the marketing department.  Instead I want to focus on the merits of various server nodes based on a high throughput scenario, and how to get the most for your investment in hardware.

Here’s the graph of throughput broken down by tasks per second per core, and overall tasks per second for the server:


The winners

Server 4 (8 CPU threads) was the fastest system by a small margin when measured by tasks per second per core.  Server 3 (20 CPU threads) was 98.5% as fast, but the two large quad socket machines only ran 45%-46% of the tasks per second per core as the fastest server.

Server 2 (72 CPU threads – no hyper threading) provided the highest raw throughput on a system-wide basis completing approximately 4 times the workload as server 4 (8 threads).

Server 3 (20 CPU threads) wins the sweet spot award.  At 98.5% of the task per core per second of server 4, but with 59.1% of the raw task throughput of the two big monster machines it would only take two of these servers to outstrip the total throughput of servers 1 and 2 with more than twice the task per core per second throughput.

Divide and Conquer

I had to fine tune the test on each system to get the most throughput.  Running the tests early on I noticed there was a bias towards the virtual cores – htop was showing them getting twice the work of the physical cores.  So I started to separate the tasks into task groups and pin them to the CPU threads in the group to make sure that we didn’t have any cores sitting idle or underutilized.  I used multiple Nitro workers on each server and pinned the workers to a set of CPU threads to make sure the load on the various cores was as even as possible.


Resource contention plays a vital role in server task throughput.  If your tasks are very short and are hitting your server at mach 60, your server is bound to be creating a lot of application forks. In Linux the page table entries use copy-on-write semantics, so there isn’t much cost associated with the memory represented by those tables, but the OS still has to lock the page table database to do the copy – an operation that will cause contention when you have more than 100 processes trying to fork at the same time.

To mitigate the OS contention problem you’ll need to find the right balance between task threads and CPU threads.  Most of the servers in my test worked best when loaded with task threads at 1.3 to 5 times the number of CPU threads, but the surprise was Server 4 which operated best at task threads set to 32 times the number of CPU threads.

You may need to have multiple task schedulers (in my case Nitro workers) per node so that the scheduling overhead doesn’t get in the way of throughput.  If you do, you’ll need to separate them into pools of CPU threads so that they aren’t trying to use the same cores. Try to keep the thread pools integral to the number of CPU cores.

Now it’s up to you to figure out what the cost per server and throughput per server will be to find the right hardware for your organization. If you have high throughput computing needs, give us a call – we know how to get things done fast!