[Estimated Reading Time: 5 minutes]

“The Delphi Geek” recently blogged about a performance bottleneck he had identified in FastMM when used with a particular conditional define. Although not directly related, his post reminded me of an experience I had many years ago, working on a highly complex multi-threaded system (long before FastMM) and the strategy we found we had to employ in order to get optimal performance from our threading code when encountering different numbers of CPU’s (these days “cores’).

The System

The system in question was a highly complex powerstation simulation/emulation which had to scale from single processor (development/test machines) thru dual processor (commissioning and QA) and quad processor (deployment) hardware. To our surprise we found that our threading which worked very well on single CPU’s started to behave apparently quite erratically on different numbers of CPUs. Our synchronization was fine (surprisingly in some respects – given that we were developing on single CPU systems, effectively serialising our code in many respects) – no, the issue was performance.

The system was a training simulation for a physical power station being constructed in Germany – when you turn one of these things on, you need operators who have experience of running it – our job was to create the entire power station in software for these training purposes!

We had to facilitate highly efficient scheduling between different threads. There were threads that were executing a mathematical model of the power station itself, then there were threads that were emulating control hardware in the real system. There was then a multi-threaded terminal emulation layer (emulating Unix X-Windows terminals on Windows NT) that had to communicate with a Primary Processing Unit which itself comprised a number of threads which marshalled signals and data from the various hardware emulation threads to/from the terminal emulation.

The Problem and Our First Attempt at Solving It

As complex as all this sounds, allowing Windows to schedule the threads as it saw fit worked fine on 1 CPU (duh) but on 2 CPU’s we found that the system bogged down. It didn’t lock up as would be expected from a fundamental synchronization issue, things just slowed to a crawl. Of course, synchronization was initially suspected and we spent some time investigating this before satisfying ourselves that we had got things right in that area at least, and eventually concluded that the issue was in the overhead of the context switching of the threads between the 2 CPUs.

We found that we could massively improve throughput by specifically affining threads to each CPU using the Windows SetThreadAffinityMask() mechanism, simply sharing them equally across all (i.e. both) CPUs. As each thread came into being we counted the number of active threads. Each odd numbered thread went to one CPU, each even numbered thread to the other, and this worked like a charm.

Of course, even as naive as this approach sounds (it was a “quick fix” to see if it helped – and it did) we didn’t just assume 2 CPU’s – we used a modulo algorithm to identify the CPU for each thread so that it would “scale” to as many CPU’s as we found in a system.

Or so we thought.

It turned out that – in our application – this approach failed miserably on 4 CPU’s for some reason. So we went back to the drawing board. It seemed that CPU affinity was the solution, we just needed to be smarter in how we distributed our threads across the hardware.

Smarter than the OS and smarter than we ourselves had previously been.

The Solution

We eventually landed on the “right” approach. By analysing/thinking about what the different threads were doing and what resources they contended with each other over (something no OS thread scheduler or memory manager can actually divine for itself – not with any currently technology I am aware of at any rate) we identified different classes (or groups or pools – call it what you will) of threads with different contention characteristics with each other and the other groups.

There was one group that we determined should be affined to just one of the CPU’s (comprising threads which were unlikely to be active simultaneously so they could efficiently share a single CPU). Another group that would be highly likely to be active when the first groups was also active but less likely to contend with other threads in the same (second) group, so these could efficiently share two of the remaining CPU’s.

There was then a third group containing very few threads but which were likely to be running all the time, so these were affined to 3 of the CPU’s.

(Within each group we also identified a base priority level for the threads in that group with the ability to specify higher or lower priority for further identified sub-groups of threads if required. We found this was less significant on performance – in our case – than affinity however)

The 4th CPU was left free for the OS to use as it saw fit – we kept our threads off this CPU entirely, the theory being that this would result in the OS using that 4th CPU for other threads and processes we could not (easily or reliably) directly constrain.

The net result of this exercise was that we externalised these various configurations by identifying the thread “pools” and allowing the executable to be “tuned” by making entries in a configuration file (an INI back in those days). At startup each thread would look in the config file to determine what CPU it should affine itself to and set it’s affinity accordingly.

In addition to the tuning mechanism we also built in performance monitoring metrics, and if we found that performance was not meeting the required throughput levels (effective realtime operation of the simulation) on a particular machine (hardware/software configuration) we could tune the configuration of our threads with almost infinite variety, requiring nothing more than a restart of the simulation software, and see how effective – or otherwise – or re-tuning had been, through those same metrics.

We arrived at a base configuration for our common configurations: 1, 2 and 4 CPU’s. But if someone had won the lottery and given us an 8-CPU machine to play with, we could have devised a further configuration for that exotic hardware too.

The result: One build of the system with an almost infinite variety of performance profiles able to cope with any number of CPU’s (and background load) that our system encountered.

In Conclusion

When it comes to complex multi-threaded applications, the only thing that the compiler, the OS and the runtime knows is that you have a complex multi-threaded application. The only place where precise, detailed knowledge of how the various threads in your system will interact during their lifetime is in the system designer/developer’s heads.

Synchronization between threads is one expression of those interactions that ensures that the system behaves correctly.

Scheduling of those threads is another expression of those interactions that ensure that the system behaves optimally.

Just as we already know that we cannot rely on the runtime to automatically divine the necessary synchronization between threads, it really should not come as much of a surprise that, sometimes at least, the runtime also fails to divine the optimal scheduling.

Just as we have to put in some effort to get synchronization right, the same attention to scheduling can sometimes be just as important.

Getting Across the Platforms

It should also be noted, in these days of cross-platform support in Delphi, that not all operating systems provide the same degree of control over thread scheduling as Windows. This is not necessarily indicative of less sophistication in those other OS’s – perhaps quite the opposite. If low level controls aren’t provided this may simply be because they aren’t found to be needed.

I have no idea whether the approach we settled on in this system would have been possible or even necessary on OSX or Linux for example, but I am sure the principle can be adapted and applied to whatever degree may be necessary and appropriate in those cases.

7 thoughts on “Use Knowledge of Your Own Threads to Extract Optimal Performance…”

  1. Actually, you have hit on some of my preferences for modern OSes as the available CPUs start to climb.

    First – The OS should reserve a core exclusively for its own usage. I don’t care if one of the cores in a 4 core system (or 2 in 8 of a hyperthreaded system) twiddles their thumbs most of the time. In fact, this is often the case anyways. So long as the system can remain highly responsive, I consider this money and resources well spent.

    Second – I wish windows would stop ping-ponging tasks between processors. Run just 1 single threaded high cpu usage and you can watch various versions of windows try to share the load between 2 cores. It is a DEMENTED madness. Park threads on CPUs and only move them when a move it becomes absolutely required to maintain responsiveness. Honest, you can leave 2 cores in Hlt mode if all my apps only bring one of the cores to 50% usage. All stiring the pot does is waste power and cycles as the tasks are moved between cores (expensive to do)

  2. Nice article – I definitively used the same for some server-based application. Setting a per-thread core affinity, and let the threads be “specialized” (in disk process, RAM access, network handling…) is one key of scaling and better HW use.

    At the code level, you may also use some rules to make your application more multi-thread friendly.
    See e.g. what I wrote in this blog article.

    When it comes into speed optimization, I enjoy quoting those programmers:

    Here are some famous quotes about optimization – nice rules to add to my too detailed list:
    Make it right before you make it fast. Make it clear before you make it faster. Keep it right when you make it faster. — Kernighan and Plauger, Elements of Programming Style.
    Premature optimization is the root of all evil. — Donald Knuth, quoting C. A. R. Hoare
    The key to performance is elegance, not battalions of special cases. The terrible temptation to tweak should be resisted. — Jon Bentley and Doug McIlroy
    The rules boil down to: “1. Don’t optimize early. 2. Don’t optimize until you know that it’s needed. 3. Even then, don’t optimize until you know what’s needed, and where.” — Herb Sutter

    1. @A. Bouchez: Thanks for the comment. Some nice quotes in there too which should inspire some further reading I am sure. 🙂

  3. Windows scheduler seems much better from Vista forward, also better on 64 bit than 32 bit. I find for my app that it runs better if we don’t affinitise and let the scheduler decide.

    1. @David: Yep, as is mentioned in the post the specific experience being related was some years ago – before Windows XP let alone Vista. The general principle is the main point – that any scheduler, no matter how good, cannot possibly know as well as a developer how the threads in a process will interact and if absolute performance is a requirement then we have the tools we need to get that performance; we just need to use them.

    1. @Tommi: Yes, although SetThreadIdealProcessor() is basically the same as SetThreadAffinityMask() except that it can only be used to affine a thread to one specific CPU or core. If your needs are such that you will only ever wish to affine a thread to a single CPU then this would be a slightly easier to use alternative I suppose, but an affinity mask is more flexible.

Comments are closed.