We intend to build a lock-free, priority-aware work-stealing deque for dynamic task scheduling in parallel computing systems. Each worker thread maintains a priority deque and pushes/pops tasks locally, while stealing tasks from others based on priority. This is achieved using atomic operations (compare_exchange
, fetch_add
) in OpenMP and optionally MPI. We aim to benchmark the implementation on shared-memory (8–128 cores) and multi-node systems using MPI, testing throughput, steal attempts, and priority correctness.
Work stealing allows idle threads to dynamically balance load by stealing tasks from others. Traditional deques are extended here with priority awareness. Atomic operations replace locks to avoid contention. This system supports fine-grained concurrency, priority-driven task selection, and hybrid OpenMP+MPI environments for full scalability. Benchmarking includes throughput, steal success rate, and accuracy of priority execution.
while not done:
task = pop_local_highest_priority_task()
if task != NULL:
execute(task)
else:
victim = check_other_threads()
task = steal_lowest_priority_task_from(victim)
if task != NULL:
execute(task)
else:
increment local_priority
spin_wait()
Each thread handles 1000 tasks with mixed priorities. After executing k
tasks, it may steal from others. We compare two approaches:
Stealing with priorities can cause inversion if not carefully controlled. We explore ways to balance fairness, correctness, and efficiency under saturation and priority skew.
Our implementation will be in C/C++ and optionally MPI (distributed memory). We refer to literature on lock-free data structures and atomics, and consult documentation on std::atomic
, __sync_*
, and __atomic_*
intrinsics. Testing is done on multi-core machines and GHC clusters.
compare_exchange
, fetch_add
) to avoid synchronization locks across threads in stealing tasks from the deque.n
deques per processor, each storing tasks tagged with its priority. Operations should be lock-free, and task stealing should choose the best priority task available in that processor.n
tasks at once, analyzing time taken, synchronization overhead, communication time, and other relevant metrics.n
tasks, attempts to steal higher-priority tasks from random processors. Evaluate whether this improves high-priority task completion without harming low-priority tasks.x86-64 multi-core architecture C/C++ for implementation.
n
deques in a lock-free way with each deque having tasks of different priority.n
tasks, tries to steal higher-priority tasks from a random processor’s queues.
Each processor in our system maintains a set of independent lock-free deques, with each deque dedicated to tasks of a fixed priority level; in our implementation, we use three priority levels. These deques are implemented using CAS operations, enabling processors to push and pop tasks without locks — local operations occur at one end of the deque, while remote processors steal from the opposite end, reducing contention. To ensure correct memory ordering and visibility of updates across all cores, we employ explicit memory barriers using __sync_synchronize()
at key points in the push, pop, and steal operations. Each processor keeps track of the current priority level it is executing and initially works on tasks from its highest-priority deque. When a processor’s local deque for the current priority becomes empty, it attempts to steal tasks of the same priority from other processors. If no tasks are found after probing all peers, the processor increments its local priority and begins working on the next lower level. CAS failures during stealing, which may occur due to contention, are tolerated by the system; a failed CAS does not guarantee that the deque is empty, only that another processor accessed it concurrently. This design accepts occasional priority inversions in favor of minimizing overhead from excessive retries, maintaining eventual consistency as processors repeatedly probe for work. Our implementation carefully balances the overhead of aggressive stealing with the need to enforce strong priority execution guarantees, and we evaluate this balance using metrics such as CAS failure rates, priority inversion occurrences, and average stealing effort.
Our project implements a scalable, lock-free work-stealing task system designed to efficiently distribute work across multiple processor cores. We evaluated system performance by measuring the total execution time required to complete a fixed set of tasks with varying workloads and distrbution on systems with varying core counts, specifically 4, 8, and 12 cores for most tests. The results demonstrate strong scalability when increasing from 4 to 8 cores, with execution times decreasing by more than half. This shows excellent parallel efficiency, where the benefits of parallelizing work significantly outweigh the overhead from synchronization and task-stealing operations. However, for fewer number of tasks, the performance gains diminish. For larger tasks with higher number of cores the execution time shows good improvement with less overhead cost. As the system scales, it transitions from being computation-bound to overhead-bound if the workload size does not proportionally increase. Our findings highlight the importance of balancing task granularity, workload size, and core count when designing scalable parallel systems. We also benchmarked all our tests with a single lock free dequeue which is not priority aware and our results show the overhead of our solution is 15-25 percent across workloads.