Lock-Free Priority-Aware Work-Stealing Deque

Summary

We intend to build a lock-free, priority-aware work-stealing deque for dynamic task scheduling in parallel computing systems. Each worker thread maintains a priority deque and pushes/pops tasks locally, while stealing tasks from others based on priority. This is achieved using atomic operations (compare_exchange, fetch_add) in OpenMP and optionally MPI. We aim to benchmark the implementation on shared-memory (8–128 cores) and multi-node systems using MPI, testing throughput, steal attempts, and priority correctness.

Background

Work stealing allows idle threads to dynamically balance load by stealing tasks from others. Traditional deques are extended here with priority awareness. Atomic operations replace locks to avoid contention. This system supports fine-grained concurrency, priority-driven task selection, and hybrid OpenMP+MPI environments for full scalability. Benchmarking includes throughput, steal success rate, and accuracy of priority execution.

Pseudocode

while not done:
    task = pop_local_highest_priority_task()
    if task != NULL:
        execute(task)
    else:
        victim = check_other_threads()
        task = steal_lowest_priority_task_from(victim)
        if task != NULL:
            execute(task)
        else:
            increment local_priority
            spin_wait()

The Challenge

Each thread handles 1000 tasks with mixed priorities. After executing k tasks, it may steal from others. We compare two approaches:

Thread-local deques: Efficient but synchronization overhead on status propagation.

Stealing with priorities can cause inversion if not carefully controlled. We explore ways to balance fairness, correctness, and efficiency under saturation and priority skew.

Resources

Our implementation will be in C/C++ and optionally MPI (distributed memory). We refer to literature on lock-free data structures and atomics, and consult documentation on std::atomic, __sync_*, and __atomic_* intrinsics. Testing is done on multi-core machines and GHC clusters.

Plan to Achieve

Amogh + Akshay: Implement a deque with local push/pop and remote steal operations. Initially, the part of stealing the work, updating the queue, and processing elements is done sequentially.
Akshay + Amogh: Introduce lock-free atomic building blocks (e.g., compare_exchange, fetch_add) to avoid synchronization locks across threads in stealing tasks from the deque.
Compare different priority-handling approaches: separate deques by priority level vs. single-deque combination of priority tagging.
Amogh + Akshay: Extend the deque to have n deques per processor, each storing tasks tagged with its priority. Operations should be lock-free, and task stealing should choose the best priority task available in that processor.
Amogh + Akshay: Extend the stealing logic to choose from the queue populated with the best-priority tasks.
Akshay: Measure the tradeoff between stealing 1 task vs. n tasks at once, analyzing time taken, synchronization overhead, communication time, and other relevant metrics.
Amogh: Perform extensive benchmarking on performance characteristics such as task throughput, steal attempts, priority inversion, idle time, and scaling behavior.
Amogh: Compare tradeoffs when tasks are re-stolen and analyze the resulting overhead and its effect on total execution time.
Akshay: Benchmark results against a baseline with random (non-priority) stealing to evaluate the benefit of priority-based approaches.
Nice to have: Amogh + Akshay: Implement batching logic where each processor, after executing n tasks, attempts to steal higher-priority tasks from random processors. Evaluate whether this improves high-priority task completion without harming low-priority tasks.
Nice to have: Akshay + Amogh: Implement a distributed-memory version using MPI, where each process represents a node with local worker threads and simulates stealing across nodes via message passing.

Platform

x86-64 multi-core architecture C/C++ for implementation.

Rough Schedule

~~Week 1:~~ ~~Setup, approach design finalise~~
~~Week 2:~~ ~~Lock free implementation for the dequeue complete~~
~~Week 3 (First Half):~~ ~~Extend implementation to handle n deques in a lock-free way with each deque having tasks of different priority.~~
~~Week 3 (Second Half):~~ ~~Complete the implementation and task stealing logic of stealing the highest priority task from a random processor.~~
Week 4 (First Half) (Nice to have): Implement logic where each processor, after performing n tasks, tries to steal higher-priority tasks from a random processor’s queues.
~~Week 4 (Full):~~ ~~Benchmark all components, generate visualizations, and prepare the final report and poster-day materials. This runs in parallel with the nice-to-have goal, if possible.~~

Final Approach

Each processor in our system maintains a set of independent lock-free deques, with each deque dedicated to tasks of a fixed priority level; in our implementation, we use three priority levels. These deques are implemented using CAS operations, enabling processors to push and pop tasks without locks — local operations occur at one end of the deque, while remote processors steal from the opposite end, reducing contention. To ensure correct memory ordering and visibility of updates across all cores, we employ explicit memory barriers using __sync_synchronize() at key points in the push, pop, and steal operations. Each processor keeps track of the current priority level it is executing and initially works on tasks from its highest-priority deque. When a processor’s local deque for the current priority becomes empty, it attempts to steal tasks of the same priority from other processors. If no tasks are found after probing all peers, the processor increments its local priority and begins working on the next lower level. CAS failures during stealing, which may occur due to contention, are tolerated by the system; a failed CAS does not guarantee that the deque is empty, only that another processor accessed it concurrently. This design accepts occasional priority inversions in favor of minimizing overhead from excessive retries, maintaining eventual consistency as processors repeatedly probe for work. Our implementation carefully balances the overhead of aggressive stealing with the need to enforce strong priority execution guarantees, and we evaluate this balance using metrics such as CAS failure rates, priority inversion occurrences, and average stealing effort.

Conclusion

Our project implements a scalable, lock-free work-stealing task system designed to efficiently distribute work across multiple processor cores. We evaluated system performance by measuring the total execution time required to complete a fixed set of tasks with varying workloads and distrbution on systems with varying core counts, specifically 4, 8, and 12 cores for most tests. The results demonstrate strong scalability when increasing from 4 to 8 cores, with execution times decreasing by more than half. This shows excellent parallel efficiency, where the benefits of parallelizing work significantly outweigh the overhead from synchronization and task-stealing operations. However, for fewer number of tasks, the performance gains diminish. For larger tasks with higher number of cores the execution time shows good improvement with less overhead cost. As the system scales, it transitions from being computation-bound to overhead-bound if the workload size does not proportionally increase. Our findings highlight the importance of balancing task granularity, workload size, and core count when designing scalable parallel systems. We also benchmarked all our tests with a single lock free dequeue which is not priority aware and our results show the overhead of our solution is 15-25 percent across workloads.

Lock-Free, Priority-Aware Work-Stealing Deque