- OmpOptiMem: Optimized Memory Movement for Heterogeneous Computing. Prithayan Barua, Jisheng Zhao, and Santosh Pande. To appear EuroPar 2020, Aug 2020. [ bib ]
- BlankIt Library Debloating: Getting What You Want Instead of Cutting What You Don’t. Chris Porter, Girish Mururu, Prithayan Barua, and Santosh Pande. To appear PLDI 2020, June 2020. [ bib ]
OMPSan: Static Verification of OpenMP's Data Mapping Constructs [Best
Prithayan Barua, Jun Shirako, Whitney Tsang, Jeeva Paudel, Wang Chen,
and Vivek Sarkar.
In Xing Fan, Bronis R. de Supinski, Oliver Sinnen, and Nasser
Giacaman, editors, OpenMP: Conquering the Full Hardware Spectrum, pages
3--18, Cham, 2019. Springer International Publishing.
[ bib ]
OpenMP offers directives for offloading computations from CPU hosts to accelerator devices such as GPUs. A key underlying challenge is in efficiently managing the movement of data across the host and the accelerator. User experiences have shown that memory management in OpenMP programs with offloading capabilities is non-trivial and error-prone.
T2S-Tensor: Productively Generating High-Performance Spatial Hardware for
Dense Tensor Computations.
N. Srivastava, H. Rong, P. Barua, G. Feng, H. Cao,
Z. Zhang, D. Albonesi, V. Sarkar, W. Chen, P. Petersen,
G. Lowney, A. Herr, C. Hughes, T. Mattson, and P. Dubey.
In 2019 IEEE 27th Annual International Symposium on
Field-Programmable Custom Computing Machines (FCCM), pages 181--189, April
[ bib |
We present a language and compilation framework for productively generating high-performance systolic arrays for dense tensor kernels on spatial architectures, including FPGAs and CGRAs. It decouples a functional specification from a spatial mapping, allowing programmers to quickly explore various spatial optimizations for the same function. The actual implementation of these optimizations is left to a compiler. Thus, productivity and performance are achieved at the same time. We used this framework to implement several important dense tensor kernels. We implemented dense matrix multiply for an Arria-10 FPGA and a research CGRA, achieving 88% and 92% of the performance of manually written, and highly optimized expert (ninja") implementations in just 3% of their engineering time. Three other tensor kernels, including MTTKRP, TTM and TTMc, were also implemented with high performance and low design effort, and for the first time on spatial architectures."
Cost-driven Thread Coarsening for GPU Kernels.
Prithayan Barua, Jun Shirako, and Vivek Sarkar.
In Proceedings of the 27th International Conference on Parallel
Architectures and Compilation Techniques, PACT '18, pages 32:1--32:14, New
York, NY, USA, 2018. ACM.
[ bib |
Directive-based programming models like OpenACC provide a higher level abstraction and low overhead approach of porting existing applications to GPGPUs and other heterogeneous HPC hardware. Such programming models increase the design space exploration possible at the compiler level to exploit specific features of different architectures. We observed that traditional applications designed for latency optimized out-of-order pipelined CPUs do not exploit the throughput optimized in-order pipelined GPU architecture efficiently. In this paper we develop a model to estimate the memory throughput of a given application. Then we use the loop interleave transformation to improve the memory bandwidth utilization of a given kernel.
We developed a heuristic to estimate the optimal loop interleave factor, and implemented it in the OpenARC compiler for OpenACC. We evaluated our approach on over 216 kernels to achieve a Geo-mean speedup of 1.32×.
Our compiler optimization aims to provide the right balance between performance, portability and productivity.
Binary Debloating for Security via Demand Driven Loading, Girish Mururu,
Chris Porter, Prithayan Barua, and Santosh Pande.
[ bib |
Modern software systems heavily use C/C++ based libraries. Because of the weak memory model of C/C++, libraries may suffer from vulnerabilities which can expose the applications to potential attacks. For example, a very large number of return oriented programming gadgets exist in glibc that allow stitching together semantically valid but malicious Turing-complete programs. In spite of significant advances in attack detection and mitigation, full defense is unrealistic against an ever-growing set of possibilities for generating such malicious programs. In this work, we create a defense mechanism by debloating libraries to reduce the dynamic functions linked so that the possibilities of constructing malicious programs diminishes significantly. The key idea is to locate each library call site within an application, and in each case to load only the set of library functions that will be used at that call site. This approach of demand-driven loading relies on an input-aware oracle that predicts a near-exact set of library functions needed at a given call site during the execution. The predicted functions are loaded just in time, and the complete call chain (of function bodies) inside the library is purged after returning from the library call back into the application. We present a decision-tree based predictor, which acts as an oracle, and an optimized runtime system, which works directly with library binaries like GNU libc and libstdc++. We show that on average, the proposed scheme cuts the exposed code surface of libraries by 97.2%, reduces ROP gadgets present in linked libraries by 97.9%, achieves a prediction accuracy in most cases of at least 97%, and adds a small runtime overhead of 18% on all libraries (16% for glibc, 2% for others) across all benchmarks of SPEC 2006, suggesting this scheme is practical.
A Cryptosystem for Encryption and Decryption of Long Confidential
Debasis Giri, Prithayan Barua, P. D. Srivastava, and Biswapati Jana.
In Samir Kumar Bandyopadhyay, Wael Adi, Tai-hoon Kim, and Yang Xiao,
editors, Information Security and Assurance, pages 86--96, Berlin,
Heidelberg, 2010. Springer Berlin Heidelberg.
[ bib ]
In this paper, we propose a cryptosystem which can encrypt and decrypt long (text) messages in efficient manner. The proposed cryptosystem is a combination of symmetric-key and asymmetric-key cryptography, where asymmetric-key cryptography is used to transmit the secret key to an intended receiver and the sender/receiver encrypts/decrypts messages using that secret key. In 2002, Hwang et al. proposed a scheme for encrypting long messages. The main drawback of their scheme is that it requires more computational overhead. Our proposed scheme is more efficient from the computational point of view compared to that of their scheme. Our scheme is a block cipher, long messages are broken into fixed length plaintext blocks for encryption. It supports parallel computation, since encryption/decryption of all the blocks of plaintext/plaintext are independent and thus can be carried out simultaneously. In addition, our scheme retains the same security level as their scheme.
- (2020 ) Working with a Georgia Tech team to develop an emergency Ventilator for the COVID-19 crisis. I helped design the control circuit and program it. The control circuit provides a closed loop feedback to control various parameters of the Ventilator, that makes our design less dependent on health care professionals. News Article
- (2019 Summer) Intern with Xinmin Tian, GPU offloading Compiler Team at Intel Corporation
- (2016–ongoing) Research assistant at Georgia Tech, PhD student with Vivek Sarkar
- (2017 Summer) Intern with Memory Solutions group,at Samsung Semiconductor
- (2015–2016) GPU Compiler Architect with VOLTA architecture group, at Nvidia,
- (2011-2015) R&D Engineer with Synphony C Compiler, a High Level Synthesis compiler, at Synopsys
- The Bird Watcher. flickr album
- Website template: Thanks to Tiago Cogumbreiro,