Processors with 100s of threads of execution and GPUs with 1000s of cores are among the state-of-the-art in high-end computing systems. This... Show moreProcessors with 100s of threads of execution and GPUs with 1000s of cores are among the state-of-the-art in high-end computing systems. This transition to many-core computing has required the community to develop new algorithms to overcome significant latency bottlenecks through massive concurrency. Implementing efficient parallel runtimes that can scale up to hundreds of threads with extremely fine-grained tasks (less than 100 microseconds) remains a challenge. We propose XQueue, a novel lockless concurrent queueing system that can scale up to hundreds of threads. We integrate XQueue into LLVM OpenMP and implement X-OpenMP, a library for lightweight tasking on modern many-core systems with hundreds of cores. We show that it is possible to implement a parallel execution model using lock-less techniques for enabling applications to strongly scale on many-core architectures. While the fork-join model is suitable for on-node parallelism, the use of joins and synchronization induces artificial dependencies which can lead to under utilization of resources. Data-flow based parallelism is crucial to overcome the limitations of fork-join parallelism by specifying dependencies at a finer granularity. It is also crucial for parallel runtime systems to support heterogeneous platforms to better utilize the hardware resources that are available in modern day supercomputers. The existing parallel programming environments that support distributed memory either discover the DAG entirely on all processes which limits the scalability or introduce explicit communications which increases the complexity of programming. We implement Template Task Graph (TTG), a novel programming model and its C++ implementation by marrying the ideas of control and data flowgraph programming. TTG can address the issues of performance portability without sacrificing scalability or programmability by providing higher-level abstractions than conventionally provided by task-centric programming systems, but without impeding the ability of these runtimes to manage task creation and execution as well as data and resource management efficiently. TTG implementation currently supports distributed memory execution over 2 different task runtimes PaRSEC and MADNESS. Show less