In the previous post in this series, we saw that the number of threads that a given process could support was determined by a number of factors, of which the stack size reserved for each thread was key.
We also saw how we could change the stack size used by our application and how this could increase the number of threads that our process could support. But if you thought it seemed a bit crude to have to set a single stack size for all the threads in a process (including the main thread), then you would be right, and we can do something about this.
If we go back and look at our testlimit.dpr code, we can see a number of parameters supplied to the Windows CreateThread function:
if CreateThread(NIL, 0, @ThreadProc, NIL, 0, id) = 0 then
The two 0’s are what we are interested in: the 2nd and 5th parameters.
The 2nd parameter is actually stack size !! By leaving this as zero in our test code, we were indicating that we wanted each thread to use the default stack size – i.e. that established for the process itself, as specified in the Linker options or by the MINSTACKSIZE/MAXSTACKSIZE directives.
But with this stack size parameter to the CreateThread function we can override the process default stack size with a size (in bytes) more appropriate to the needs of the particular thread we are creating !
But wait! There are two stack sizes – a minimum and a maximum. But there is only one stack size parameter ! How come ?
The answer is that the stack size parameter is in effect “overloaded”.
By default, the stack size specified in the 2nd parameter is used to establish the commit charge (minimum stack size) for the thread stack. But as we saw in the previous instalment, it is the larger, reservation value (maximum size) that is most important in this exercise.
This is where the 5th parameter comes in.
This parameter accepts flags that we can pass to control certain aspects of the thread creation process. CREATE_SUSPENDED is one such flag. Another (indeed the only other, currently as far as I know) is the long and rather cumbersomely named STACK_SIZE_PARAM_IS_A_RESERVATION flag.
If we pass this flag then the stack size parameter value will now be interpreted as establishing the reservation size for the thread’s stack (the maximum).
What you will quickly have noticed however, is that you cannot specify both the minimum and maximum stack size for a thread. It’s one or the other. Whichever one you set, the other will still be taken from your application defaults (there are some funny rules in play that mean that you won’t always get what you ask for, but we shall look at those later).
For now, it’s enough that in most cases the stack size that you will wish to limit for a given thread will be the maximum (reservation).
So, to test the effect this has, let us restore the process default limits to the usual Delphi defaults by changing the stack size directives that we added to the code back to the initial values we used (16 KB and 1 MB respectively):
{$MINSTACKSIZE 16384} {$MAXSTACKSIZE 1048576}
Just to make sure, compile and run (for Win32 remember) and ensure that your thread limit is reported as being back down around 1567 (or whatever result you originally saw with the 1 MB stack size in your case).
Now let’s make the changes needed to exercise some control over the individual threads.
First, we must declare STACK_SIZE_PARAM_IS_A_RESERVATION because this flag is not actually defined for us in the Windows unit, or any other unit for that matter (this and the fact that we aren’t using TThread for these tests should have set an alarm bell ringing if it wasn’t already – I think you can guess where this is headed…).
We will also add a constant for the desired thread stack size, let’s start with 64 KB. Finally, we need to apply both our desired thread stack size and the flag to the CreateThread call, ending up with:
program testlimit2; {$APPTYPE CONSOLE} {$MINSTACKSIZE 16384} {$MAXSTACKSIZE 1048576} uses SysUtils, Windows; function ThreadProc(aParam: Cardinal): Integer; begin Sleep(INFINITE); result := 0; end; const STACK_SIZE_PARAM_IS_A_RESERVATION = $10000; THREAD_STACK = 65536; var i: Integer = 0; id: Cardinal; begin try while TRUE do begin if CreateThread(NIL, THREAD_STACK, @ThreadProc, NIL, STACK_SIZE_PARAM_IS_A_RESERVATION, id) = 0 then ABORT; Inc(i); end; except WriteLn(i, ' threads is the limit'); end; end.
Again, click to get the (zipped) code.
If you compile and run this, you should now see your thread limit restored to the higher limit (6076 in my case – your mileage may vary), despite the fact that the default stack size, and the stack for the main thread in the process itself, is still 1 MB.
In the next article in this series we shall look at how we sometimes will not get what we ask for, and consider why.
But to close this article, I shall address a couple of points raised in the comments on the first installment.
Raising The 2GB Process Memory Limit on Win32
A couple of people pointed out that you can increase the address space for a process even on 32-bit Windows by applying the LARGE_ADDRESS_AWARE flag in the PE header for the executable.
This is true, and as Mark Russinovich covers in his very detailed article on the subject, this will indeed increase the number of threads you can create in any given process.
However, the object of the exercise in this series is to identify what the limiting factors are, not to figure out ways of increasing those limits. Quite apart from anything else, in this case, the ability to create the maximum possible amount of threads in a process is not the goal – the goal is to make the best possible use of the available resources, and for threads (generally speaking) the key resource is CPU, not RAM.
Indeed, the more threads you add to a process, the more time your process will potentially spend context switching between those threads.
Is The Problem The Stack Memory, or Creating The Threads ?
Remember that the original impetus for this series of posts was an observation that the Indy framework on which DataSnap is based suffers severe performance problems as a result of the threading model that it uses (by default).
Gaetan Maerten commented on the previous post, saying that he thought it was the creation/destruction of the threads that was the cause of the performance bottleneck in Indy, rather than the memory usage per se.
Timing the code to create the maximum number of threads, and then replacing the thread creation with a simple memory allocation of the same amount of memory as would be required for the stacks for those threads (AllocMem(THREAD_STACK)) and timing the same number of allocations, we can then compare those times.
What I found was that the code that simply allocated a given number of memory blocks took only 25%-40% of the time required to create the same number of threads.
So Gaetan is correct – it is the creation and destruction of threads that is the source of the majority of the bottleneck, with the cost of the stack allocation for those threads accounting, in crude terms, for between 25% and 40% of the total thread creation cost.
This points the way toward a more efficient, scalable parallel architecture which we shall look at in a later post.
Less Is [Sometimes] More
Both of these points – context switching and thread creation overhead – mean that in many cases the most efficient use of threads involves using fewer threads, not more, although even that is not always a hard and fast rule.
Before I conclude this series I shall share an experience from many years ago that taught me how reducing the number of threads, even in a relatively simple yet theoretically highly parallel, architecture, actually led to significant improvements in performance.
But before anyone latches on to that as a touchstone for multi-threading success, we also found in the same system that sometimes we needed to increase the number of threads.
But, as I say, that’s for a later post.
Nice article, hope there are more parts to this topic 😉
btw I think it was BeOS that had very fast (cheap) thread creation (I think it was BeOS, but I might be wrong). Someone of the Os-team made demo for it, to make simple video processing tool which created new thread for each effect on every single frame in Real Time!, that is pretty good performance on that part.
Also I remember some news, maybe 2-3 years ago (Not sure) of some Linux (kernel) patch which made Linux thread creation something like 10-100x faster.
I’ve read some places that the creating the new thread in Windows is an expensive operation. For sure it depends what you are doing, for how long time and with how many threads (And of course on what hardware, and maybe Windows version).
Increasing the number of threads does not make much sense with our current PC hardware. We have only up to 8 cores per CPU. So only 8 (or 16) threads could be run at the same time, if each thread uses 100% of the power.
IOCP is the solution here: you have a limited number of threads, then you use non blocking Input/Output. Each thread get a request from the input queue, handle it at full speed, then send back the response asynchronously. This is what we use in mORMot (and also WCF), via http.sys kernel-mode server. That is, asynchronous process is made in kernel mode, not in user mode; Microsoft ensure this is as fast as possible. See http://blog.synopse.info/post/2011/03/11/HTTP-server-using-fast-http.sys-kernel-mode-server
Then, to handle sessions, mORMot does not need one thread per client, but just one small in-memory structure per client. This is achieved in addition to secure per-URI authentication, and optionally about instance life time of the service implementation interface.
Trying to solve the thread contention will always be less efficient that changing the design.
But in all cases, what you discuss here about low-level thread initialization is quite interesting.
One additional remark: if you release a thread handle, it may be re-used for the next corresponding thread creation. So it should be faster to create then destroy threads in a line, than creating a huge number of threads then release them.
As you say, the number of 100% utilisation threads is limited by the number of cores available to usefully execute them. But many threads in a multi-threaded process spend lots of time idle, so whilst the constraining factor is CPU it isn’t necessarily a 1:1 constraint.
More to come on this. 🙂
Interesting that you have choosen a THREAD_STACK of 64KB,
which incidentaly is the Address space allocation granularity
of Windows – thanks to the legacy brief support for DEC Alpha
architecture, back in the NT4 days (1996).
http://blogs.msdn.com/b/oldnewthing/archive/2003/10/08/55239.aspx
http://blogs.msdn.com/b/oldnewthing/archive/2005/07/29/444912.aspx
“(…) so each thread’s stack occupies 64KB of address
space even if only 4KB of it is used (allocated?)”.
Just to give some historical context on why is it so 😉
It is interesting isn’t it, but is it a coincidence … ? 😉
On the number of threads for good CPU usage:
I do not have the source link anymore, but I have seen an interesting series of tests on that matter, where the conclusion was: As a rule of thumb, if all threads are fully busy at the same time and in a general case, 2*[CPU Cores] gives the best execution times. (This held true for Hyperthreaded cores too.)
In a calculation-heavy application I wrote in C# I tested this thesis myself, and indeed this gave me the best performance. I have not bothered to work out the exact reasons though.
This in the end leads me to the conclusion, that a thread-pooling framework which internally gives out this number of threads might provide a reasonable speedup. For two reasons: If you pool your threads, they only need to be created once, and if it is a maximum of 2*Cores (and leave requests that would lead to more threads pending until another one is free), you will get the most out of it even when heavily loaded with work.
(You would need to make sure your granularity is fine enough to prevent excessive blocks though, probably making this inapplicable in some special cases.)
I made it a habit to try to design for always-running threads, that get sent a flag and some data when a job needs doing. In the meantime, it’ll happily cycle a Sleep(1)-loop and consume 0% CPU time doing that. And since I limit the number of threads with this, these idle ones won’t notably clutter my CPU either.
actually a really productive article, much thanks congratulation 😉