Anatomy of the Futex: Wait & Notify

Why does the Futex exist? A number of years ago I was pondering this very question - along with why developers looking for a fast, low-latency lock often reach for the primitive. Unfortunately resources on the topic and details of the implementation were relatively scarce. These details can certainly be found, but I'm yet to find a single resource that covers most of what you need to know in a single place. If, like my younger self, you are looking for details on what a Futex is and how it is implemented; you have come to the right place!

What is a Futex?

Let us begin by quoting a snippet from the Futex Wikipedia page:

In computing, a futex (short for "fast userspace mutex") is a kernel system call that programmers can use to implement basic locking, or as a building block for higher-level locking abstractions such as semaphores and POSIX mutexes or condition variables.

While this snippet is somewhat misleading, it does give us a step in the right direction. Futex is short for "fast userspace mutex", it is a Mutex that is fast and in userspace. We all know what is meant by the term fast, but what does userspace mean in this context?

Userspace is referring to any code that is not executed within the operating system kernel. There is a cost to transitioning our execution context into and out of the kernel (along with not knowing when the execution context will be returned), generally leading to increased latency. The goal then, is to remain in userspace as much as possible - this is what makes a Futex fast.

Mutex Breakdown

Now that we know what a Futex is, let's take a look at how we can split a Mutex into its constituent parts to review how it could be optimised for low-latency. All modern desktop operating systems provide a range of kernel level synchronisation primitives, including a Mutex:

The consistent issue shared by these primitives is that using them forces us into kernel space when acquiring or releasing the lock and so we will have to pay the associated performance and latency cost. This can however be vastly reduced by using lighter weight primitives.

The first part required to implement Mutex behaviour is a piece of state to determine whether the lock is currently in use. A Mutex must be capable of being used by effectively any number of threads and so our locked state must be atomic:

#include <atomic>

class futex {
	std::atomic_bool _is_locked;
};

We then need a lock or acquire method to exclusively acquire the lock and an unlock or release method to give other threads access to the lock:

#include <atomic>

class futex {
	std::atomic_bool _is_locked;

public:
	futex() = default;

	// other constructors ...

	void lock() {
		while (true) {
			bool expected = false;
			if (_is_locked.compare_exchange_weak(
					expected, true,
					std::memory_order_acquire,
					std::memory_order_relaxed)) {
				break;
			}
		}
	}

	void unlock() {
		_is_locked.store(false, std::memory_order_release);
	}
};

This implementation is technically already complete, assuming there is no misuse (like recursively calling lock or calling release when the lock isn't held). It is also incredibly fast when used in low contention scenarios due to locking and unlocking requiring a single atomic operation.

The issue is that all waiting threads enter a spin-loop until the lock becomes available, leading to wasted CPU cycles and core starvation. We need a way to put the thread to sleep while it waits for the lock and wake it up once the lock becomes available.

Atomic Wait & Notify

One option would be to use the operating system primitives listed above to sleep the threads; but we can do better! All desktop operating systems also provide a low-cost primitive for sleeping and waking threads:

Windows: WaitOnAddress & WakeByAddress
Linux: futex
Mac: os_sync_wait_on_address & os_sync_wake_by_address

The title of this section is "Atomic Wait & Notify" because in C++20 (released in 2020), these low cost sleep / wake primitives were added to the standard as building blocks intended for use with primitives like the Futex. The APIs being referenced can be found on cppreference:

Wait on address maps to std::atomic_wait
Wake by address maps to std::atomic_notify

Interestingly, this has been a problem area for standards implementers on Mac for some time due to syscalls being a violation of Apple's store guidelines. The Mac APIs linked above were introduced with XCode 15.4, released in May 2024. The implementations of std::atomic_wait and std::atomic_notify on Mac are approximated using a non-Futex primitive.

Putting it all together

Now that we have all of the tools required to implement a Futex, let's take a look at a complete draft implementation:

#include <atomic>

class futex {
	constexpr static uint8_t status_unlocked = 0;
	constexpr static uint8_t status_locked = 1;
	constexpr static uint8_t status_contended = 2;

	std::atomic<uint8_t> _status;

public:
	futex() = default;

	// other constructors ...

	void lock() {
		uint8_t expected = status_unlocked;
		if (_status.compare_exchange_strong(expected, status_locked,
				std::memory_order_acquire, std::memory_order_relaxed)) {
			return;
		}

		while (true) {
			if (expected != status_contended) {
				const uint8_t desired = expected + 1;
				if (_status.compare_exchange_weak(expected, desired,
						std::memory_order_acquire, std::memory_order_relaxed)) {
					if (desired == status_locked) {
						return;
					}
				}
			}

			_status.wait(status_contended, std::memory_order_acquire);
			expected = _status.load(std::memory_order_relaxed);
		}
	}

	void unlock() {
		if (_status.exchange(status_unlocked,
				std::memory_order_release) == status_contended) {
			_status.notify_one();
		}
	}
};

The main changes to our previous implementation are as follows:

In addition to the unlocked & locked states (stored as a bool in the previous snippet), a contended state has been added to allow us to determine if a thread is waiting for the lock to be available
The atomic wait and notify have been added using the std::atomic<T> wrappers
An explicit fast path and slow path have been added to the lock method

The first thread to acquire the lock will succeed at the compare exchange strong (fast path) and early exit; in low contention scenarios acquiring the lock will be incredibly fast. If there is contention however, a single thread will succeed quickly while the others will fall into the while loop (slow path). These other threads will use atomic wait to await changes to the _status field.

Once the first thread has completed its critical section, it will call unlock which will atomically exchange the state. exchange returns the previous value of the atomic, this is checked for if it is contended and if so atomic notify will be called to wake up a single waiting thread. The idea here is that if the state is locked, there are no threads waiting and therefore the notify call is not required - significantly reducing the cost of locking and unlocking if there is no contention.

As a final note; we must assume that threads waiting could wake up without unlock being called, this is known as a "spurious wake up". We have protected against this by trapping execution in a loop and restricting the exit conditions carefully.

What's Next

Let's be clear, this is not a production ready implementation, many areas could be improved. Protection against recursive locking and multiple unlocking, as well as performance optimisations such as a limited low-power spin looping which would attempt to catch cases where the lock is held momentarily.

The Rust standard library implementation is open source and easy to read and understand, if you would like to review something production ready: futex.rs