Gravity Moth

So, first things first, threading is tricky, and we don’t make it easy for devs trying to learn. There’s a ton of names floating around for thread-primitives, many of which are confusing. I’ll start by defining terms for this article, to help keep things straight.

Defining Terms

Mutexes (Locks) are a way to ensure only one thread can run some section of code, or touch some resource for a period of time. If you use pthreads, you might use a pthread_mutex, or for MSVC, a CriticalSection can be used to cover the same need.

Condition Variables are used for data signalling with locking, often used for managing thread pools. All threadpool workers might sleep on a condition variable, and when work is pushed the workers can be woken up sequentially, getting an exclusive lock to grab a task from the pool.

So, what’s a Futex then? Futexes are a very handy thread primitive, provided by most major kernels (Windows, Linux, OSX, BSDs). We’ll get into it with some examples.

Futexes For Great Good

We’re going to use C11 atomics in x86-land, because they keep things on the dev-side relatively simple.

Ready to Go? Futex Flags

Let’s start with a ready flag! These can be handy for making threads wait after start up, so that main has some time to pass them all things to do.

A few things to know before we get into the code, futex_wait(a, b) is a “mr.kernel, check my assumptions please” syscall. If your assumptions about variable state are correct when the kernel checks, it will put your thread to sleep, ready to be woken by another thread in the future. In pseudocode, the kernel does something something vaguely like this.

if a == b {
	sleep_until_woken()
}
return

In practice, futexes on many platforms can wake up unexpectedly, so you have to catch/retry if the value they get is not what you’re expecting. For these examples however, we’ll pretend futex_wait handles that for us.

futex_broadcast(a) wakes up all the threads sleeping on address a.

Ok, here we go:

void *thread_worker(void *args) {
	int32_t *ready = (int32_t *)args;

	printf("Checking to see if main is ready yet\n");
	futex_wait(ready, 0);

	printf("Doing thread things now!\n");
	for (int i = 0; i < 5; i++) {
		printf("thread-work %d\n", i);
	}

	return NULL;
}

int main(int argc, char **argv) {
	_Atomic int32_t ready = 0;

	// Spin up all the threads first
	Thread t1, t2, t3;
	t1 = thread_create(thread_worker, &ready);
	t2 = thread_create(thread_worker, &ready);
	t3 = thread_create(thread_worker, &ready);

	printf("Doing fake work in main to prepare!\n");
	for (int i = 0; i < 10; i++) {
		printf("pre-work %d\n", i);
	}

	// Ok, we're ready to go!
	ready++;
	futex_broadcast(&ready);

	thread_end(t1);
	thread_end(t2);
	thread_end(t3);

	return 0;
}

Typically, you’d use a condition variable / conditional wait for this, but due to the way that API is designed, it’s not particularly efficient. You don’t want to have to manage an unused mutex with your condition, all you really want is a wait condition and a trigger.

Futex-Mutex: Lock it Down

On to the next fun exercise, building a lock! Locks are built out of two new important parts: atomic_compare_exchange_strong and futex_signal.

atomic_compare_exchange_strong(a, b, c) attempts to atomically swap a with c, if a == b. By swapping in locked(1), or unlocked(0), you ensure only that one thread can grab or release the lock at a time. (this is sometimes referred to as compare-and-swap) (on x86, it compiles down to cmpxchg with the lock prefix)

futex_signal(a) wakes up only 1 thread sleeping on the lock (a).

Meat and potatoes time:

typedef _Atomic int32_t Mutex;

void mutex_lock(Mutex *lock) {
	int32_t locked   = 1;
	int32_t unlocked = 0;

	for (;;) {
		bool ok = atomic_compare_exchange_strong(lock, &unlocked, locked);
		if (ok) {
			return;
		}

		futex_wait(lock, locked);
	}
}

void mutex_unlock(Mutex *lock) {
	int32_t locked   = 1;
	int32_t unlocked = 0;

	bool ok = atomic_compare_exchange_strong(lock, &locked, unlocked);
	if (!ok) {
		printf("Double unlock?\n");
		exit(1);
	}

	futex_signal(lock);
}

The futex_wait in this example makes use of the kernel’s atomic assumption check, so that it doesn’t get stuck in a sleep if the lock is unlocked between when it tried to lock and went to sleep.

Lock-building strategies vary wildly depending on workload, you don’t always want to hop right into the futex right after a failed compare-and-swap. If your lock is wrapped around a very cheap operation, a spin-lock that turns into a futex-lock after a few tries can give you a big win. (A spin-lock runs the compare-and-swap in a while-loop, trying over and over again to grab the lock)

Fancy Tricks with Race Conditions

I used futexes in a work-stealing threadpool I wrote for the Odin compiler, to help it scale better for many-core machines.

One of the clever tricks there, I use an atomic int to track remaining tasks in the pool, and then use a futex() to both put the thread to sleep if there’s really no more work, or wake up immediately if a new task right after my check came back empty. To make it work properly I leverage a race condition, trading a little bit of efficiency when the pool is nearly dry, for more throughput when the pool is saturated. Defining constraints for your problem can be a huge win for performance and simplicity, and when dealing with threads, simplicity is critical.

A Few Constraints

Main must participate in the work
- Debugging task bugs gets way easier if you can shut off all child threads, run the pool’s work on only the main thread, and still hit most of the threadpool’s code.
Workers can only push tasks to their own queues
- Threads always know the max number of active tasks in their queues, so they won’t sleep before it’s all finished
All threads except main can steal tasks
- Main doesn’t need to steal, this simplifies main’s runner a little.
The worker doing the last task in the pool must try to wake main
- If the thread with the last task doesn’t wake main, and main didn’t do the last task, your program will never finish
Any time new work gets added to any queue, threads get a wake-up signal
- This lets sleeping threads know that “hey, you might have work to steal!”

Threadpool code to dig through

Spot the race condition in the code? It’s a little tricky!

There’s no guarantee that threads will be woken up when new tasks get added. If threads are done with their own tasks and done stealing, but not yet asleep when the signal comes in, they may go to sleep when they could have stolen a task. That only happens when there’s not quite enough work to do though, so it’s a little loss in efficiency (work is more serial than it could be), to avoid locking and unlocking every time I check or adjust the remaining task count. In practice for Odin, tasks are typically expensive enough and adding new tasks takes long enough that this very rarely occurs. Even when we hit the race condition, all work gets done, it just takes a little longer.

References

Ok, now for a pile of papers and cool things using them.

Initial work from the 2002 Ottawa Linux Symposium
For recursive mutexes and optimization, Ulrich talks about tricky bugs in his paper
More musing on futexes and drepper’s work lives over at locklessinc

Futexes and You

Fri, 13 Jan 2023