Works (108)

Updated: April 4th, 2024 14:54

2023 journal article

An Intelligent Framework for Oversubscription Management in CPU-GPU Unified Memory

JOURNAL OF GRID COMPUTING, 21(1).

By: X. Long*, X. Gong*, B. Zhang* & H. Zhou n

author keywords: Discrete CPU-GPU system; Unified virtual memory; Oversubscription; Deep learning
TL;DR: This paper proposes a novel framework for UVM oversubscription management in discrete CPU-GPU systems that consists of an access pattern classifier followed by a pattern-specific transformer-based model using a novel loss function aiming to reduce page thrashing. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: March 20, 2023

2023 journal article

Deep learning based data prefetching in CPU-GPU unified virtual memory

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 174, 19–31.

By: X. Long*, X. Gong*, B. Zhang* & H. Zhou n

author keywords: Data prefetching; Graphics processing unit; Unified virtual memory; Deep learning; Transformer
TL;DR: A novel approach for page prefetching for UVM through deep learning that outperforms the state-of-the-art UVM framework, improving the performance by 10.89%, improving the device memory page hit rate by 16.98%, and reducing the CPU-GPU interconnect traffic by 11.05%. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: May 1, 2023

2023 article

Enhancing Virtual Distillation with Circuit Cutting for Quantum Error Mitigation

2023 IEEE 41ST INTERNATIONAL CONFERENCE ON COMPUTER DESIGN, ICCD, pp. 94–101.

By: P. Li n, J. Liu*, H. Patil n, P. Hovland* & H. Zhou n

author keywords: Quantum Error Mitigation; Virtual Distillation; Quantum Circuit Cutting
TL;DR: This work proposes an error mitigation strategy that uses circuit-cutting technology to cut the entire circuit into fragments, which can reduce the noise accumulation and enhance the effectiveness of the virtual distillation technique. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: February 19, 2024

2023 article

PBVR: Physically Based Rendering in Virtual Reality

2023 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION, IISWC, pp. 77–86.

By: Y. Tozlu n & H. Zhou n

TL;DR: It is shown that a handful of renderpasses consume the most time and that readily available foveated rendering solutions, such as Variable Rate Shading, might not provide significant advantages, and that eye tracking can incur a significant overhead on the graphics processing unit (GPU). (via Semantic Scholar)
UN Sustainable Development Goal Categories
9. Industry, Innovation and Infrastructure (OpenAlex)
Sources: Web Of Science, NC State University Libraries
Added: January 2, 2024

2023 article

Plutus: Bandwidth-Efficient Memory Security for GPUs

2023 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA, pp. 543–555.

By: R. Abdullah n, H. Zhou n & A. Awad n

TL;DR: This work proposes a novel design, Plutus, which enables low-overhead secure GPU memory and uses smaller block sizes for security metadata caches to optimize the number of security metadata memory requests. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: June 5, 2023

2023 article

SecPB: Architectures for Secure Non-Volatile Memory with Battery-Backed Persist Buffers

2023 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA, pp. 677–690.

By: A. Freij n, H. Zhou n & Y. Solihin*

TL;DR: This paper proposes secure persistent buffers (SecPB), a battery-backed persistent structure that moves the point of secure data persistency from the memory controller closer to the core and analyzes the metadata dependency chain required in securing PM to expose optimization opportunities. (via Semantic Scholar)
UN Sustainable Development Goal Categories
7. Affordable and Clean Energy (OpenAlex)
Sources: Web Of Science, NC State University Libraries
Added: June 5, 2023

2022 journal article

A Survey of GPU Multitasking Methods Supported by Hardware Architecture

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 33(6), 1451–1463.

By: C. Zhao*, W. Gao*, F. Nie* & H. Zhou n

author keywords: Graphics processing units; Multitasking; Kernel; Hardware; Computer architecture; Registers; Task analysis; GPU multitasking; survey; hardware architecture; temporal multitasking; spatial multitasking; simultaneous multitasking (SMK)
TL;DR: The features of some commercial GPU architectures to support multitasking and the common metrics used for evaluating the performance of GPU multitasking methods are introduced, and the main problems of each type of hardware GPUMultitasking methods to be solved are illustrated. (via Semantic Scholar)
Sources: ORCID, Web Of Science, NC State University Libraries
Added: October 29, 2021

2022 article

Adaptive Security Support for Heterogeneous Memory on GPUs

2022 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2022), pp. 213–228.

By: S. Yuan n, A. Awad n, A. Yudha*, Y. Solihin* & H. Zhou n

author keywords: GPUs; secure memory; heterogeneous memory; encryption; integrity check; security metadata cache
TL;DR: The security guarantees that used to defend against physical attacks are analyzed, and it is made the observation that heterogeneous GPU memory system may not always need all the security mechanisms to achieve the security guarantees. (via Semantic Scholar)
UN Sustainable Development Goal Categories
16. Peace, Justice and Strong Institutions (OpenAlex)
Sources: Web Of Science, NC State University Libraries
Added: August 29, 2022

2022 conference paper

Exploiting Quantum Assertions for Error Mitigation and Quantum Program Debugging

2022 IEEE 40TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2022), 124–131.

By: P. Li n, J. Liu n, Y. Li* & H. Zhou n

Event: IEEE 40th International Conference on Computer Design (ICCD) at Olympic Valley, CA, USA on October 23-26, 2022

author keywords: quantum computing; error mitigation; debugging; assertion
TL;DR: This paper presents the development of quantum assertion schemes and shows how they are used for hardware error mitigation and software debugging, and shows that besides detecting program bugs, dynamic assertion circuits can mitigate noise effects via post-selection of the assertion results. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: March 20, 2023

2022 article

LITE: A Low-Cost Practical Inter-Operable GPU TEE

PROCEEDINGS OF THE 36TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ICS 2022.

By: A. Yudha*, J. Meyer*, S. Yuan n, H. Zhou n & Y. Solihin*

author keywords: GPU TEE; software encryption; memory encryption; GPU enclave
TL;DR: This paper proposes a flexible GPU memory encryption design called LITE that relies on software memory encryption aided by small architecture support and shows that GPU applications can be adapted to the use of LITE encryption APIs without major changes. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: November 13, 2023

2022 article

Not All SWAPs Have the Same Cost: A Case for Optimization-Aware Qubit Routing

2022 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2022), pp. 709–725.

By: J. Liu n, P. Li n & H. Zhou n

author keywords: quantum computing; compiler optimization; qubit routing
TL;DR: NASSC (Not All Swaps have the Same Cost) is the first algorithm that considers the subsequent optimizations during the routing step, and optimization-aware qubit routing leads to better routing decisions and benefits subsequent optimizations. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: August 29, 2022

2021 article

Analyzing Secure Memory Architecture for GPUs

2021 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS 2021), pp. 59–69.

By: S. Yuan n, A. Yudha*, Y. Solihin* & H. Zhou n

author keywords: GPUs; security; secure memory; memory encryption; memory integrity; metadata cache
TL;DR: It is shown that the massive-threaded nature of GPUs make them latency-tolerant and the performance impact due to the extra encryption/decryption latency is limited, and direct encryption can be a promising alternative for GPU secure memory. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: August 16, 2021

2021 article

PILOT: a Runtime System to Manage Multi-tenant GPU Unified Memory Footprint

2021 IEEE 28TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS (HIPC 2021), pp. 442–447.

By: J. Ravi n, T. Nguyen n, H. Zhou n & M. Becchi n

TL;DR: A proposed three methods to transparently mitigate memory interference through kernel preemption and scheduling policies are proposed, which would enable new OS-managed scheduling policies to be implemented for GPU kernels to dynamically handle resource contention and offer consistent performance. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: May 2, 2022

2021 article

Relaxed Peephole Optimization: A Novel Compiler Optimization for Quantum Circuits

CGO '21: PROCEEDINGS OF THE 2021 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO), pp. 301–314.

By: J. Liu n, L. Bello* & H. Zhou n

author keywords: quantum computing; peephole optimization
TL;DR: A novel quantum compiler optimization, named relaxed peephole optimization (RPO) for quantum computers, which leverages the single-qubit state information that can be determined statically by the compiler and extends the approach to optimize the quantum gates when some input qubits are in known pure states. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: July 26, 2021

2021 article

Systematic Approaches for Precise and Approximate Quantum State Runtime Assertion

2021 27TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2021), pp. 179–193.

By: J. Liu n & H. Zhou n

author keywords: quantum computing; runtime assertion
TL;DR: This work proposes two systematic approaches for dynamic quantum state assertion and they can assert a much broader range of quantum states including both pure states and mixed states and introduces the idea of approximate quantumstate assertion for the cases where the programmers only have limited knowledge of the quantum states. (via Semantic Scholar)
UN Sustainable Development Goal Categories
Sources: Web Of Science, NC State University Libraries
Added: July 26, 2021

2020 journal article

Exploring Convolution Neural Network for Branch Prediction

IEEE Access, 8, 152008–152016.

By: Y. Mao n, H. Zhou n, X. Gui* & J. Shen n

author keywords: History; Neural networks; Machine learning; Convolution; Predictive models; Prediction algorithms; Correlation; Branch prediction; CNN; deep learning; VGG; ResNet
TL;DR: This paper treats branch prediction as a classification problem and employs both deep convolutional neural networks (CNNs), ranging from LeNet to ResNet-50, and deep belief network (DBN) for branch prediction, and analyzes the impact of the depth of CNNs on the misprediction rates. (via Semantic Scholar)
UN Sustainable Development Goal Categories
4. Quality Education (OpenAlex)
Source: ORCID
Added: August 27, 2020

2020 journal article

Fair and cache blocking aware warp scheduling for concurrent kernel execution on GPU

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 112, 1093–1105.

By: C. Zhao*, W. Gao*, F. Nie*, F. Wang* & H. Zhou n

author keywords: GPU; Concurrent kernels; Warp scheduling; Cache blocking; Interference
TL;DR: A fair and cache blocking aware warp scheduling (FCBWS) approach to ameliorate the contention on data cache and improve SMK on GPUs and outperforms previous multitasking methods. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: September 28, 2020

2020 conference paper

Quantum Circuits for Dynamic Runtime Assertions in Quantum Computation

ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 1017–1030.

By: J. Liu n, G. Byrd n & H. Zhou n

Contributors: J. Liu n, G. Byrd n & H. Zhou n

Event: Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems

author keywords: Quantum Computing; Runtime Assertion
Sources: Web Of Science, NC State University Libraries, ORCID
Added: May 8, 2020

2020 article

Reliability Modeling of NISQ-Era Quantum Computers

2020 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC 2020), pp. 94–105.

By: J. Liu n & H. Zhou n

author keywords: NISQ quantum computer; reliability model; neural network
TL;DR: This paper treats the NISQ quantum computer as a black box and derives a reliability estimation model using polynomial fitting and a shallow neural network, and proposes randomized benchmarks with random numbers of qubits and basic gates to generate a large data set for neural network training. (via Semantic Scholar)
UN Sustainable Development Goal Categories
Sources: Web Of Science, NC State University Libraries
Added: June 10, 2021

2020 article

Scalable and Fast Lazy Persistency on GPUs

2020 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC 2020), pp. 252–263.

By: A. Yudha*, K. Kimura*, H. Zhou n & Y. Solihin*

TL;DR: This paper proposes mapping Lazy Persistency to GPUs and identifies the design space of such mapping, and proposes a hash table-less method that performs well on hundreds and thousands of threads, achieving persistency with nearly negligible slowdown for a variety of representative benchmarks. (via Semantic Scholar)
UN Sustainable Development Goal Categories
Sources: Web Of Science, NC State University Libraries
Added: June 10, 2021

2019 journal article

Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution

ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 16(3).

By: Z. Lin n, H. Dai n, M. Mantor* & H. Zhou n

author keywords: GPGPU; TLP; bandwidth management; concurrent kernel execution
TL;DR: A coordinated approach for CTA combination and bandwidth partitioning that dynamically detects co-running kernels as latency sensitive or bandwidth intensive and allocates more CTA resources for latency-sensitive kernels and more NoC/DRAM bandwidth resources to NoC-/DRam-intensive kernels. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: December 2, 2019

2019 conference paper

Exploring Memory Persistency Models for GPUs

28th International Conference on Parallel Architectures and Compilation Techniques (PACT), 310–322.

By: Z. Lin n, M. Alshboul n, Y. Solihin* & H. Zhou n

Event: International Conference on Parallel Architectures and Compilation Techniques at Seattle, WA on September 21-25, 2019

TL;DR: This paper adapt, re-architect, and optimize CPU persistency models for GPU, and design a pragma-based compiler scheme for expressing persistency model for GPUs, and identifies that the thread hierarchy in GPUs offers intuitive scopes to form epochs and durable transactions. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: August 10, 2020

2019 conference paper

In-Place Zero-Space Memory Protection for CNN

In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 32). San Mateo, CA: Morgan Kaufmann Publishers.

By: H. Guan, L. Ning, Z. Lin, X. Shen, H. Zhou & S. Lim

Ed(s): H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox & R. Garnett

Source: NC State University Libraries
Added: November 24, 2020

2019 journal article

Quantum Circuits for Dynamic Runtime Assertions in Quantum Computation

IEEE Computer Architecture Letters, 18(2), 111–114.

By: H. Zhou n & G. Byrd n

Contributors: H. Zhou n & G. Byrd n

author keywords: Quantum computing; assertions; quantum circuits; debugging; quantum error detection
TL;DR: This paper designs quantum circuits to assert classical states, entanglement, and superposition states through ancilla qubits, which are used to indirectly collect the information of the qubits of interest. (via Semantic Scholar)
UN Sustainable Development Goal Categories
Sources: Web Of Science, ORCID, NC State University Libraries, Crossref
Added: September 23, 2019

2019 article

Quantum Circuits for Dynamic Runtime Assertions in Quantum Computation

Liu, J., Byrd, G., & Zhou, H. (2019, December 9).

By: J. Liu, G. Byrd & H. Zhou*

Source: ORCID
Added: December 30, 2019

2019 article

Quantum Circuits for Dynamic Runtime Assertions in Quantum Computation

Liu, J., Byrd, G., & Zhou, H. (2019, December 9).

By: J. Liu, G. Byrd & H. Zhou*

Source: ORCID
Added: December 30, 2019

2019 article

Scatter-and-Gather Revisited: High-Performance Side-Channel-Resistant AES on GPUs

12TH WORKSHOP ON GENERAL PURPOSE PROCESSING USING GPUS (GPGPU 12), pp. 2–11.

By: Z. Lin n, U. Mathur n & H. Zhou n

TL;DR: This paper revisits the scatter-and-gather (SG) approach and makes a case for using this approach to implement table-based cryptographic algorithms on GPUs to achieve both high performance and strong resistance to side channel attacks. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: July 22, 2019

2018 conference paper

Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

By: H. Dai n, Z. Lin n, C. Li n, C. Zhao*, F. Wang*, N. Zheng*, H. Zhou n

TL;DR: The proposed schemes significantly improve the weighted speedup of two state-of-the-art intra-SM sharing schemes, Warped-Slicer and SMK, by 24.6% and 27.2% on average, respectively, with lightweight hardware overhead. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries, ORCID
Added: September 22, 2019

2018 journal article

Developing Noise-Resistant Three-Dimensional Single Particle Tracking Using Deep Neural Networks

ANALYTICAL CHEMISTRY, 90(18), 10748–10757.

By: Y. Zhong n, C. Li n, H. Zhou n & G. Wang n

MeSH headings : Fluorescent Dyes / chemistry; Imaging, Three-Dimensional; Microscopy, Fluorescence; Neural Networks, Computer; Particle Size; Signal-To-Noise Ratio
TL;DR: This work test deep neural networks (DNNs) in recognizing and differentiating very similar image patterns incurred in 3D SPT and shows that for high S/N images, both DNNs and conventional correlation coefficient-based method perform well, however, when theS/N drops close to 1, conventional methods completely fail while DNN's show strong resistance to both artificial and experimental noises. (via Semantic Scholar)
UN Sustainable Development Goal Categories
Sources: Web Of Science, NC State University Libraries
Added: October 16, 2018

2018 journal article

GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and a Novel Way to Improve TLP

ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 15(1).

By: Z. Lin n, M. Mantor* & H. Zhou n

author keywords: GPGPU; TLP; context switching; latency hiding
TL;DR: A novel scalability analysis from the perspective of throughput utilization of various GPU components, including off-chip DRAM, multiple levels of caches, and the interconnect between L1 D-caches and L2 partitions shows that the interConnect bandwidth is a critical bound for GPU performance scalability. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: August 6, 2018

2017 article

Developing Dynamic Profiling and Debugging Support in OpenCL for FPGAs

PROCEEDINGS OF THE 2017 54TH ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC).

By: A. Verma*, H. Zhou n, S. Booth*, R. King*, J. Coole*, A. Keep*, J. Marshall*, W. Feng*

author keywords: OpenCL; FPGA; Debugging; Profiling; Framework; Code Patterns
TL;DR: This paper presents efforts in developing dynamic profiling and debugging support in OpenCL for FPGAs, and proposes primitive code patterns, including a timestamp and an event-ordering function, and develops a framework which can be plugged easily into OpenCL kernels, to dynamically collect and process run-time information. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: August 6, 2018

2017 conference paper

EffiSha: A Software Framework for Enabling Efficient Preemptive Scheduling of GPU

PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 3–16.

By: G. Chen n, Y. Zhao n, X. Shen n & H. Zhou n

TL;DR: EffiSha is presented, a pure software framework that enables preemptive scheduling of GPU kernels with very low overhead, and demonstrates significant potential for reducing the average turnaround time and improving the system overall throughput of programs that time share a modern GPU. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: November 21, 2020

2017 report

Exploring deep neural networks for branch prediction

[Technical Report]. https://people.engr.ncsu.edu/hzhou/CNN_DBN_zhou_2017.pdf

By: Y. Mao, H. Zhou & X. Gui

Source: NC State University Libraries
Added: November 21, 2020

2017 journal article

Methylation specific targeting of a chromatin remodeling complex from sponges to humans

SCIENTIFIC REPORTS, 7.

By: J. Cramer*, D. Pohlmann*, F. Gomez*, L. Mark*, B. Kornegay*, C. Hall*, E. Siraliev-Perez*, N. Walavalkar* ...

MeSH headings : Amino Acid Sequence; Animals; Chromatin Assembly and Disassembly; DNA / chemistry; DNA / metabolism; DNA Methylation; DNA-Binding Proteins / chemistry; DNA-Binding Proteins / genetics; DNA-Binding Proteins / metabolism; Gene Knockdown Techniques; Humans; Models, Molecular; Nucleic Acid Conformation; Phenotype; Porifera / genetics; Porifera / metabolism; Protein Conformation
TL;DR: A model in which the MBD2/3 methylation-dependent functional role emerged with the earliest multicellular organisms and has been maintained to varying degrees across animal evolution is supported. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: August 6, 2018

2017 conference paper

The Demand for a Sound Baseline in GPU Memory Architecture Research

14th Annual Workshop on Duplicating, Deconstructing and Debunking (WDDD). Presented at the Workshop on Duplicating, Deconstructing and Debunking, Toronto, Canada. https://people.engr.ncsu.edu/hzhou/Hongwen_WDDD2017.pdf

By: H. Dai, C. Li, Z. Lin & H. Zhou

Event: Workshop on Duplicating, Deconstructing and Debunking at Toronto, Canada on June 25, 2017

Source: NC State University Libraries
Added: November 21, 2020

2016 journal article

A Cross-Platform SpMV Framework on Many-Core Architectures

ACM Transactions on Architecture and Code Optimization, 13(4), 1–25.

By: Y. Zhang*, S. Li*, S. Yan* & H. Zhou n

author keywords: SpMV; segmented scan; BCCOO; OpenCL; CUDA; GPU; Intel MIC; parallel algorithms
TL;DR: A highly efficient matrix-based segmented sum/scan algorithm for SpMV, which eliminates global synchronization is proposed, and an autotuning framework to choose optimization parameters is introduced. (via Semantic Scholar)
Sources: Crossref, NC State University Libraries
Added: January 28, 2020

2016 article

A Model-Driven Approach to Warp/Thread-Block Level GPU Cache Bypassing

2016 ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC).

By: H. Dai n, C. Li n, H. Zhou n, S. Gupta*, C. Kartsaklis* & M. Mantor*

TL;DR: This paper proposes a simple yet effective performance model to estimate the impact of cache contention and resource congestion as a function of the number of warps/thread blocks to bypass the cache, and designs a hardware-based dynamic warp/thread-block level GPU cache bypassing scheme. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: August 6, 2018

2016 conference paper

Enabling efficient preemption for SIMT architectures with lightweight context switching

SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 898–908.

By: Z. Lin n, L. Nyland* & Huiyang

TL;DR: Three complementary ways to reduce and compress the architectural states to achieve lightweight context switching on SIMT processors with compiler and hardware co-design are proposed. (via Semantic Scholar)
Sources: NC State University Libraries, NC State University Libraries
Added: August 6, 2018

2016 conference paper

Opencl-based erasure coding on heterogeneous architectures

Ieee international conference on application-specific systems, 7, 33–40.

By: G. Chen n, Huiyang, X. Shen n, J. Gahm*, N. Venkat*, S. Booth*, J. Marshall*

TL;DR: This work exploits state-of-art heterogeneous architectures, including GPUs, APUs, and FPGAs, to accelerate erasure coding using the OpenCL framework and proposes code optimizations for each target architecture given their different hardware characteristics. (via Semantic Scholar)
Sources: NC State University Libraries, NC State University Libraries, ORCID
Added: August 6, 2018

2016 conference paper

Optimizing memory efficiency for deep convolutional neural networks on GPUs

SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 633–644.

By: C. Li n, Y. Yang*, M. Feng*, S. Chakradhar* & Huiyang

TL;DR: This work studies the memory efficiency of various CNN layers and reveals the performance implication from both data layouts and memory access patterns, which shows the universal effect of the proposed optimizations on both single layers and various networks. (via Semantic Scholar)
Sources: NC State University Libraries, NC State University Libraries
Added: August 6, 2018

2016 conference paper

Selective GPU Cache Bypassing for Un-Coalesced Loads

In X. Liao (Ed.), 22nd IEEE International Conference on Parallel and Distributed Systems : ICPADS 2016 : proceedings : 13-16 December 2016, Wuhan, Hubei, China.

Ed(s): X. Liao

Event: 22nd IEEE International Conference on Parallel and Distributed Systems at Wuhan, Hubei, China on December 13-16, 2016

TL;DR: A simple yet effective GPU cache Bypassing scheme for Un-Coalesced Loads (BUCL), which achieves 36% and 5% performance improvement over the baseline GPU for memory un-coalesced and memory coherent benchmarks, and also significantly outperforms prior GPU cache bypassing and warp throttling schemes. (via Semantic Scholar)
Sources: NC State University Libraries, NC State University Libraries
Added: January 30, 2021

2016 conference paper

Tuning stencil codes in opencl for fpgas

Proceedings of the 34th ieee international conference on computer design (iccd), 249–256.

By: Q. Jia n & Huiyang

TL;DR: This paper explores OpenCL code optimizations for stencil computations on FPGAs in both the Single-Task and NDRange modes and proposes tuning processes that can achieve up to two orders of magnitude performance improvement over the naïve kernels. (via Semantic Scholar)
Sources: NC State University Libraries, NC State University Libraries
Added: August 6, 2018

2015 conference paper

An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing

JILP Workshop on Computer Architecture Competitions (JWAC): 2nd Data Prefetching Championship (DPC2).

By: Q. Jia, M. Padia, K. Amboju & H. Zhou

Source: NC State University Libraries
Added: January 30, 2021

2015 conference paper

Analyzing graphics processor unit (GPU) instruction set architectures

Ieee international symposium on performance analysis of systems and, 155–156.

By: K. Mayank n, H. Dai n, J. Wei* & Huiyang

TL;DR: There are few studies and analyses on GPU instruction set architectures (ISAs) although it is wellknown that the ISA is a fundamental design issue of all modern processors including GPUs. (via Semantic Scholar)
UN Sustainable Development Goal Categories
7. Affordable and Clean Energy (OpenAlex)
Sources: NC State University Libraries, NC State University Libraries
Added: August 6, 2018

2015 conference paper

Automatic data placement into GPU on-chip memory resources

2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 23–33.

By: C. Li n, Y. Yang*, Z. Lin n & Huiyang

TL;DR: This paper focuses on programs that have already been reasonably optimized either manually by programmers or automatically by compiler tools and proposed compiler algorithms refine these programs by revising data placement across different types of GPU on-chip resources to achieve both performance enhancement and performance portability. (via Semantic Scholar)
Sources: NC State University Libraries, NC State University Libraries
Added: August 6, 2018

2015 journal article

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 30(1), 3–19.

By: Y. Yang, C. Li n & H. Zhou n

author keywords: GPGPU; nested parallelism; compiler; local memory
TL;DR: This paper first study a set of GPGPU benchmarks that contain parallel loops, and highlights that the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead, and presents the proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: August 6, 2018

2015 conference paper

Locality-Driven Dynamic GPU Cache Bypassing

ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing, 61–77.

By: C. Li n, S. Song*, H. Dai n, A. Sidelnik*, S. Hari* & H. Zhou n

Event: 29th International conference on supercomputing at Newport Beach/Irvine, CA on June 8-11, 2015

author keywords: GPU architecture Optimization; Locality; Cache Bypassing
TL;DR: This paper presents a design that integrates locality filtering based on reuse characteristics of GPU workloads into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions. (via Semantic Scholar)
UN Sustainable Development Goal Categories
7. Affordable and Clean Energy (OpenAlex)
Sources: Web Of Science, NC State University Libraries
Added: January 30, 2021

2015 conference paper

Revisiting ILP Designs for Throughput-Oriented GPGPU Architecture

Proceedings of the 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 121–130.

By: P. Xiang n, Y. Yang*, M. Mantor, N. Rubin* & H. Zhou n

Event: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing at Shenzhen, China on May 4-7, 2015

author keywords: GPGPU; Heterogeneous; ILP; Energy
TL;DR: Given the workload-dependent impact from ILP, a heterogeneous GPGPU architecture is proposed, consisting of both the cores designed for high TLP and those customized with ILP techniques, which achieves high throughput as well as high energy and area-efficiency compared to homogenous designs. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: February 6, 2021

2015 article

Spatial Locality-Aware Cache Partitioning for Effective Cache Sharing

2015 44TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), pp. 150–159.

By: S. Gupta* & H. Zhou n

author keywords: shared last level cache; cache partitioning; spatial locality; cache management; high bandwidth memory
TL;DR: This work highlights that exploiting spatial locality enables much more effective cache sharing and proposes a simple yet effective mechanism to measure both spatial and temporal locality at run-time, which significantly outperforms the existing approaches. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: August 6, 2018

2014 conference paper

A Case for a Flexible Scalar Unit in SIMT Architecture

Proceedings of 2014 IEEE 28th International Parallel and Distributed Processing Symposium. Presented at the 978-1-4799-3799-8, Phoenix, AZ.

By: Y. Yang*, P. Xiang n, M. Mantor*, N. Rubin*, L. Hsu*, Q. Dong*, H. Zhou n

Event: 978-1-4799-3799-8 at Phoenix, AZ on May 19-23, 2014

TL;DR: The proposed flexible scalar unit is extended so that it can either share the instruction stream with the SIMT unit or execute a separate instruction stream, and results show that significant performance gains can be achieved compared to the state-of-art SIMT style processing. (via Semantic Scholar)
Sources: NC State University Libraries, NC State University Libraries
Added: February 6, 2021

2014 chapter

A Highly Efficient FFT Using Shared-Memory Multiplexing

In Numerical Computations with GPUs (pp. 363–377).

By: Y. Yang n & H. Zhou n

TL;DR: This work proposes the pure software approaches to enable multiple thread blocks to time-multiplex shared memory so as to increase the number of concurrent thread blocks/workgroups on each streaming multiprocessor/compute unit. (via Semantic Scholar)
Sources: Crossref, NC State University Libraries
Added: January 28, 2020

2014 journal article

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

ACM SIGPLAN NOTICES, 49(8), 93–105.

By: Y. Yang & H. Zhou n

author keywords: Performance; Design; Experimentation; Languages; GPGPU; nested parallelism; compiler; local memory
Sources: Web Of Science, NC State University Libraries
Added: August 6, 2018

2014 conference paper

Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs

Ieee international symposium on performance analysis of systems and, 231–241.

By: C. Li, Y. Yang, H. Dai, S. Yan, F. Mueller & H. Zhou

Source: NC State University Libraries
Added: August 6, 2018

2014 conference paper

Warp-level divergence in GPUs: Characterization, impact, and mitigation

International symposium on high-performance computer, 284–295.

By: P. Xiang n, Y. Yang & Huiyang

TL;DR: This paper proposes to allocate and release resources at the warp level, which effectively increase the number of active warps without actually increasing the size of critical resources, and presents its lightweight architectural support for the proposed warp-level resource management. (via Semantic Scholar)
Sources: NC State University Libraries, NC State University Libraries
Added: August 6, 2018

2014 conference paper

yaSpM: Yet Another SpMV Framework on GPUs

Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 49(8), 107–118.

By: S. Yan*, C. Li n, Y. Zhang* & H. Zhou n

Event: 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming at Orlando, FL

author keywords: SpMV; Segmented Scan; BCCOO; OpenCL; CUDA; GPU; Parallel algorithms
Sources: Web Of Science, NC State University Libraries
Added: August 6, 2018

2013 article

Adaptive Cache Bypassing for Inclusive Last Level Caches

IEEE 27TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2013), pp. 1243–1253.

By: S. Gupta n, H. Gao* & H. Zhou n

author keywords: Last level cache; cache bypassing; cache replacement policy; inclusion property
TL;DR: The key insight is that the lifetime of a bypassed line, assuming a well-designed bypassing algorithm, should be short in upper level caches and is most likely dead when its tag is evicted from the bypass buffer. (via Semantic Scholar)
UN Sustainable Development Goal Categories
10. Reduced Inequalities (OpenAlex)
Sources: Web Of Science, NC State University Libraries
Added: August 6, 2018

2013 journal article

Analyzing locality of memory references in GPU architectures

MSPC '13: Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, 6.

By: S. Gupta n, P. Xiang n & H. Zhou n

Event: ACM SIGPLAN Workshop on Memory Systems Performance and Correctness at Seattle, WA

TL;DR: This paper investigates the locality of reference at different cache levels in the memory hierarchy of GPGPU kernels and shows that the locality analysis accurately captures some interesting and counter-intuitive behavior of the memory accesses. (via Semantic Scholar)
Sources: NC State University Libraries, NC State University Libraries
Added: February 6, 2021

2013 journal article

Architecting against Software Cache-Based Side-Channel Attacks

IEEE TRANSACTIONS ON COMPUTERS, 62(7), 1276–1288.

By: J. Kong*, O. Aciicmez*, J. Seifert* & H. Zhou n

author keywords: Cache memories; private/public key cryptosystems; side-channel attacks; architectural support for computer security
TL;DR: This paper proposes hardware-software integrated approaches to defend against software cache-based attacks comprehensively and shows that the proposed schemes not only provide strong security protection but also incur small performance overhead. (via Semantic Scholar)
UN Sustainable Development Goal Categories
16. Peace, Justice and Strong Institutions (OpenAlex)
Sources: Web Of Science, NC State University Libraries
Added: August 6, 2018

2013 conference paper

Exploiting Uniform Vector Instructions for GPGPU Performance, Energy Efficiency, and Opportunistic Reliability Enhancement

Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, 433–442.

By: P. Xiang n, Y. Yang*, M. Mantor*, N. Rubin*, L. Hsu* & H. Zhou n

Event: 27th International ACM Conference on International Conference on Supercomputing at Eugene, Oregon

TL;DR: This paper shows that besides redundancy within a uniform vector, different vectors can also have the identical values, and proposes detailed architecture designs to exploit both types of redundancy. (via Semantic Scholar)
UN Sustainable Development Goal Categories
7. Affordable and Clean Energy (OpenAlex)
Sources: NC State University Libraries, NC State University Libraries
Added: February 6, 2021

2013 journal article

Locality principle revisited: A probability-based quantitative approach

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 73(7), 1011–1027.

By: S. Gupta n, P. Xiang n, Y. Yang n & H. Zhou n

author keywords: Locality of references; Probability; Memory hierarchy; Last level cache; Cache replacement policy; Data prefetching; Locality optimizations
TL;DR: This paper revisits the fundamental concept of the locality of references and proposes to quantify it as a conditional probability: in an address stream, how likely the same address or an address within its neighborhood will be accessed in the near future is defined. (via Semantic Scholar)
UN Sustainable Development Goal Categories
11. Sustainable Cities and Communities (OpenAlex)
Sources: Web Of Science, NC State University Libraries
Added: August 6, 2018

2013 journal article

The Implementation of a High Performance GPGPU Compiler

INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 41(6), 768–781.

By: Y. Yang n & H. Zhou n

author keywords: GPU; Compiler; Optimization; Vectorization; OpenCL
TL;DR: This paper presents the experience in developing an optimizing compiler for general purpose computation on graphics processing units (GPGPU) based on the Cetus compiler framework, which achieves very high performance, either superior or very close to highly fine-tuned libraries. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: August 6, 2018

2012 journal article

A Unified Optimizing Compiler Framework for Different GPGPU Architectures

ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 9(2).

By: Y. Yang n, P. Xiang n, J. Kong*, M. Mantor* & H. Zhou n

author keywords: Performance; Experimentation; Languages; GPGPU; OpenCL; CUDA; CUBLAS; GPU Computing
TL;DR: A novel optimizing compiler for general purpose computation on graphics processing units (GPGPU) that addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: August 6, 2018

2012 conference paper

CPU-assisted GPGPU on fused CPU-GPU architectures

International symposium on high-performance computer, 103–114.

By: Y. Yang, P. Xiang, M. Mantor & H. Zhou

Source: NC State University Libraries
Added: August 6, 2018

2012 conference paper

Fixing Performance Bugs: An Empirical Study of Open-Source GPGPU Programs

2012 41st International Conference on Parallel Processing. Presented at the 2012 41st International Conference on Parallel Processing (ICPP).

By: Y. Yang n, P. Xiang n, M. Mantor* & H. Zhou n

Event: 2012 41st International Conference on Parallel Processing (ICPP)

TL;DR: This work conducts an empirical study on GPGPU programs from ten open-source projects, finding various performance 'bugs', i.e., code segments leading to inefficient use of GPU hardware. (via Semantic Scholar)
Sources: Crossref, NC State University Libraries
Added: January 28, 2020

2012 conference paper

Locality principle revisited: A probability-based quantitative approach

2012 ieee 26th international parallel and distributed processing symposium (ipdps), 995–1009.

By: S. Gupta n, P. Xiang n, Y. Yang n & Huiyang

UN Sustainable Development Goal Categories
11. Sustainable Cities and Communities (Web of Science; OpenAlex)
Sources: NC State University Libraries, NC State University Libraries
Added: August 6, 2018

2012 conference paper

Shared Memory Multiplexing: A Novel Way to Improve GPGPU Throughput

Proceedings of the 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). Presented at the 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT), Minneapolis, MN, USA.

By: Y. Yang, P. Xiang, M. Mantor, N. Rubin & H. Zhou

Event: 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT) at Minneapolis, MN, USA on September 19-23, 2012

Source: NC State University Libraries
Added: February 7, 2021

2011 journal article

Combining Local and Global History for High Performance Data Prefetching

Journal of Instruction-Level Parallelism (JILP), 13, 1–14.

By: M. Dimitrov & H. Zhou

Event: Data Prefetching Championship (DPC-1) held with 15th International Symposium on High Performance Computer Architecture (HPCA-15) at Raleigh, NC on February 14-18, 2009

Source: NC State University Libraries
Added: August 6, 2018

2011 conference paper

Developing a High Performance GPGPU Compiler using Cetus

Proceedings of the Cetus Users and Compiler Infrastructure Workshop, International Conference on Parallel Architectures and Compilation Techniques (PACT’11). Presented at the International Conference on Parallel Architectures and Compilation Techniques (PACT’11).

By: Y. Yang & H. Zhou

Event: International Conference on Parallel Architectures and Compilation Techniques (PACT’11)

Source: NC State University Libraries
Added: February 7, 2021

2011 journal article

Exploring Correlation for Indirect Branch Prediction

2nd JILP Workshop on Computer Architecture Competitions (JWAC-2): Championship Branch Prediction. Presented at the 2nd JILP Workshop on Computer Architecture Competitions (JWAC-2): Championship Branch Prediction, held with ISCA-38.

By: N. Bhansali, C. Panirwla & H. Zhou

Event: 2nd JILP Workshop on Computer Architecture Competitions (JWAC-2): Championship Branch Prediction, held with ISCA-38

Source: NC State University Libraries
Added: February 7, 2021

2011 conference paper

Time-Ordered Event Traces: A New Debugging Primitive for Concurrency Bugs

2011 IEEE International Parallel & Distributed Processing Symposium. Presented at the Distributed Processing Symposium (IPDPS).

By: M. Dimitrov* & H. Zhou n

Event: Distributed Processing Symposium (IPDPS)

TL;DR: The results show that exposing the time-ordered information, function calls/returns in particular, to the programmer is highly beneficial for diagnosing the root causes of these bugs. (via Semantic Scholar)
Sources: Crossref, NC State University Libraries
Added: January 28, 2020

2010 article

A GPGPU Compiler for Memory Optimization and Parallelism Management

Yang, Y., Xiang, P., Kong, J., & Zhou, H. (2010, June). ACM SIGPLAN NOTICES, Vol. 45, pp. 86–97.

By: Y. Yang n, P. Xiang*, J. Kong* & H. Zhou n

author keywords: Performance; Experimentation; Languages; GPGPU; Compiler
Sources: Web Of Science, NC State University Libraries
Added: August 6, 2018

2010 conference paper

Accelerating MATLAB Image Processing Toolbox Functions on GPUs

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, 75–85.

By: J. Kong*, M. Dimitrov*, Y. Yang n, J. Liyanage*, L. Cao, J. Staples*, M. Mantor, H. Zhou n

Event: 3rd Workshop on General-Purpose Computation on Graphics Processing Units at Pittsburgh, Pennsylvania, USA

TL;DR: This paper ported a dozen of representative functions from IPT and based on their inherent characteristics, they were grouped into four categories: data independent, data sharing, algorithm dependent and data dependent, which reveals interesting insights on how to efficiently optimize the code for GPUs. (via Semantic Scholar)
Sources: NC State University Libraries, NC State University Libraries
Added: February 7, 2021

2010 article

An Optimizing Compiler for GPGPU Programs with Input-Data Sharing

Yang, Y., Xiang, P., Kong, J., & Zhou, H. (2010, May). ACM SIGPLAN NOTICES, Vol. 45, pp. 343–344.

By: Y. Yang n, P. Xiang*, J. Kong* & H. Zhou n

author keywords: Performance; Experimentation; Languages; GPGPU; Compiler
Sources: Web Of Science, NC State University Libraries
Added: August 6, 2018

2010 article

An Optimizing Compiler for GPGPU Programs with Input-Data Sharing

PPOPP 2010: PROCEEDINGS OF THE 2010 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, pp. 343–344.

By: Y. Yang n, P. Xiang*, J. Kong* & H. Zhou n

author keywords: GPGPU; Compiler
TL;DR: A novel compiler to optimize GPGPU programs is introduced, which takes a naive GPU kernel function, which is functionally correct but without any consideration for performance optimization, and generates optimized code. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries
Added: August 6, 2018

2010 conference paper

Improving privacy and lifetime of PCM-based main memory

2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN). Presented at the Networks (DSN).

By: J. Kong* & H. Zhou n

Event: Networks (DSN)

TL;DR: This paper first adopt counter-mode encryption for privacy protection and shows that encryption significantly reduces the effectiveness of some previously proposed wear-leveling techniques for PRAM, and proposes simple, yet effective extensions to the encryption scheme. (via Semantic Scholar)
UN Sustainable Development Goal Categories
16. Peace, Justice and Strong Institutions (OpenAlex)
Sources: Crossref, NC State University Libraries
Added: January 28, 2020

2009 conference paper

Anomaly-based bug prediction, isolation, and validation

Proceeding of the 14th international conference on Architectural support for programming languages and operating systems - ASPLOS '09. Presented at the Proceeding of the 14th international conference.

By: M. Dimitrov* & H. Zhou*

Event: Proceeding of the 14th international conference

TL;DR: Compared to state-of-art debugging techniques, the proposed approach pinpoints the defect locations more accurately and presents the user with a much smaller code set to analyze. (via Semantic Scholar)
Sources: Crossref, NC State University Libraries
Added: January 28, 2020

2009 conference paper

Hardware-software integrated approaches to defend against software cache-based side channel attacks

2009 IEEE 15th International Symposium on High Performance Computer Architecture. Presented at the HPCA - 15 2009. IEEE 15th International Symposium on High Performance Computer Architecture.

By: J. Kong*, O. Aciicmez*, J. Seifert* & H. Zhou*

Event: HPCA - 15 2009. IEEE 15th International Symposium on High Performance Computer Architecture

TL;DR: This paper proposes three hardware-software approaches to defend against software cache-based attacks - they present different tradeoffs between hardware complexity and performance overhead and proposes novel software permutation to replace the random permutation hardware in the RPcache. (via Semantic Scholar)
Sources: Crossref, NC State University Libraries
Added: January 28, 2020

2009 conference paper

Understanding software approaches for GPGPU reliability

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units - GPGPU-2. Presented at the 2nd Workshop.

By: M. Dimitrov*, M. Mantor & H. Zhou*

Event: 2nd Workshop

TL;DR: The findings, based on six commonly used applications, indicate that the benefits of complex software approaches are both application and architecture dependent, and it is argued that the cost is not justified to protect memories with ECC/parity bits. (via Semantic Scholar)
Sources: Crossref, NC State University Libraries
Added: January 28, 2020

2008 conference paper

Address-branch correlation: A novel locality for long-latency hard-to-predict branches

2008 IEEE 14th International Symposium on High Performance Computer Architecture. Presented at the 2008 IEEE 14th International Symposium on High Performance Computer Architecture (HPCA).

By: H. Gao*, Y. Ma*, M. Dimitrov* & H. Zhou*

Event: 2008 IEEE 14th International Symposium on High Performance Computer Architecture (HPCA)

TL;DR: A novel program locality that can be exploited to handle long-latency hard-to-predict branches by exploiting address-branch correlation and it is shown that certain memory-intensive benchmarks, especially those with heavy pointer chasing, exhibit this locality. (via Semantic Scholar)
UN Sustainable Development Goal Categories
7. Affordable and Clean Energy (OpenAlex)
Sources: Crossref, NC State University Libraries
Added: January 28, 2020

2008 conference paper

Deconstructing new cache designs for thwarting software cache-based side channel attacks

Proceedings of the 2nd ACM workshop on Computer security architectures - CSAW '08. Presented at the the 2nd ACM workshop.

By: J. Kong*, O. Aciicmez*, J. Seifert* & H. Zhou*

Event: the 2nd ACM workshop

TL;DR: This paper analyzes two new cache designs to defeat cache-based side channel attacks by eliminating/obfuscating cache interferences and identifies significant vulnerabilities and shortcomings. (via Semantic Scholar)
Sources: Crossref, NC State University Libraries
Added: January 28, 2020

2007 journal article

Optimizing dual-core execution for power efficiency and transient-fault recovery

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 18(8), 1080–1093.

By: Y. Ma*, H. Gao*, M. Dimitrov* & H. Zhou*

author keywords: multiple data stream architectures; fault tolerance; low-power design
TL;DR: Experimental results demonstrate that, with the proposed simple techniques, the optimized DCE can effectively achieve transient-fault tolerance or significant performance enhancement in a power/energy-efficient way. (via Semantic Scholar)
UN Sustainable Development Goal Categories
7. Affordable and Clean Energy (OpenAlex)
Sources: Web Of Science, NC State University Libraries
Added: August 6, 2018

2007 journal article

PMPM: Prediction by combining multiple partial matches

Journal of Instruction-Level Parallelism, 9, 1–18.

By: H. Gao & H. Zhou

Source: NC State University Libraries
Added: August 6, 2018

2007 conference paper

Unified Architectural Support for Soft-Error Protection or Software Bug Detection

16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007). Presented at the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

By: M. Dimitrov* & H. Zhou*

Event: 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007)

Sources: Crossref, NC State University Libraries
Added: January 28, 2020

2006 conference paper

Efficient Transient-Fault Tolerance for Multithreaded Processors Using Dual-Thread Execution

2006 International Conference on Computer Design. Presented at the 2006 International Conference on Computer Design.

By: Y. Ma* & H. Zhou*

Event: 2006 International Conference on Computer Design

TL;DR: DTE is derived from the recently proposed fault-tolerant dual-core execution (FTDCE) paradigm, in which two processor cores on a single chip perform redundant execution to improve both reliability and performance, and achieves full-coverage transient-fault tolerance. (via Semantic Scholar)
Sources: Crossref, NC State University Libraries
Added: January 28, 2020

2006 conference paper

Improving software security via runtime instruction-level taint checking

Proceedings of the 1st workshop on Architectural and system support for improving software dependability - ASID '06. Presented at the the 1st workshop.

By: J. Kong*, C. Zou* & H. Zhou*

Event: the 1st workshop

TL;DR: This paper presents a generic instruction-level runtime taint checking architecture for handling non-control data attacks and demonstrates effective usages of the architecture to detect buffer overflow and format string attacks. (via Semantic Scholar)
Sources: Crossref, NC State University Libraries
Added: January 28, 2020

2006 conference paper

Locality-based Information Redundancy for Processor Reliability

2nd Workshop on Architectural Reliability (WAR-2) held in conjunction with 39th International Symposium on Microarchitecture (MICRO-39), 29–36.

By: M. Dimitrov & H. Zhou

Source: NC State University Libraries
Added: February 8, 2021

2006 conference paper

PMPM: Prediction by Combining Multiple Partial Matches

2nd Championship Branch Prediction (CBP-2) held with the 39th International Symposium on Microarchitecture (MICRO-39), 19–24.

By: H. Gao & H. Zhou

Source: NC State University Libraries
Added: February 8, 2021

2006 journal article

Using index functions to reduce conflict aliasing in branch prediction tables

IEEE Transactions on Computers, 55(8), 1057–1061.

By: G. Ma Y. & H. Zhou

Source: NC State University Libraries
Added: August 6, 2018

2005 journal article

A case for fault tolerance and performance enhancement using chip multi-processors

IEEE Computer Architecture Letters, 4, 1–4.

By: H. Zhou

Source: NC State University Libraries
Added: August 6, 2018

2005 journal article

Adaptive information processing: an effective way to improve perceptron branch predictors

Journal of Instruction-Level Parallelism, 7, 1–10.

By: H. Gao & H. Zhou

Source: NC State University Libraries
Added: August 6, 2018

2005 conference paper

Code size efficiency in global scheduling for ILP processors

Proceedings Sixth Annual Workshop on Interaction between Compilers and Computer Architectures. Presented at the Sixth Annual Workshop on Interaction between Compilers and Computer Architectures.

By: H. Zhou n & T. Conte n

Event: Sixth Annual Workshop on Interaction between Compilers and Computer Architectures

TL;DR: A quantitative measure of the code size efficiency at compile time for any code size related optimization, based on the efficiency of tail duplication, to derive a simple, yet robust threshold scheme finding it. (via Semantic Scholar)
Sources: Crossref, NC State University Libraries
Added: January 28, 2020

2005 conference paper

Detecting global stride locality in value streams

30th Annual International Symposium on Computer Architecture, 2003. Proceedings. Presented at the ISCA 2003: 30th International Symposium on Computer Architecture.

By: H. Zhou n, J. Flanagan n & T. Conte n

Event: ISCA 2003: 30th International Symposium on Computer Architecture

TL;DR: The value delay issue in an out-of-order (OOO) execution pipeline model is studied and a new hybrid scheme is proposed to maximize the exploitation of the global stride locality. (via Semantic Scholar)
UN Sustainable Development Goal Categories
13. Climate Action (OpenAlex)
Sources: Crossref, NC State University Libraries
Added: January 28, 2020

2005 conference paper

Dual-core execution: building a highly scalable single-thread instruction window

14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05). Presented at the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

By: H. Zhou*

Event: 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05)

Sources: Crossref, NC State University Libraries
Added: January 28, 2020

2005 journal article

Enhancing memory-level parallelism via recovery-free value prediction

IEEE Transactions on Computers, 54, 897–912.

By: Huiyang & T. Conte n

Sources: NC State University Libraries, NC State University Libraries
Added: August 6, 2018

2004 conference paper

Adaptive Information Processing: An Effective Way to Improve Perceptron Branch Predictors

1st Championship Branch Prediction (CBP-1) held with the 37th International Symposium on Microarchitecture (MICRO-37).

By: H. Gao & H. Zhou

Source: NC State University Libraries
Added: February 8, 2021

2003 journal article

Adaptive mode control: A static-power-efficient cache design

ACM Transactions on Embedded Computing Systems, 2(3), 347–372.

By: Huiyang, M. Toburen n, E. Rotenberg n & T. Conte n

UN Sustainable Development Goal Categories
7. Affordable and Clean Energy (OpenAlex)
Sources: NC State University Libraries, NC State University Libraries
Added: August 6, 2018

2003 report

Code size aware compilation for real-time applications

[Technical Report]. Computer Science Department, University of Central Florida.

By: H. Zhou

Source: NC State University Libraries
Added: February 8, 2021

2003 conference paper

Enhancing Memory Level Parallelism via Recovery-Free Value Prediction

The 2003 International Conference on Supercomputing (ICS'03), 326–335.

By: H. Zhou & T. Conte

Source: NC State University Libraries
Added: February 8, 2021

2003 report

Performance modeling of memory latency hiding techniques

[Technical Report,]. Raleigh, NC: Department of Electrical and Computer Engineering, North Carolina State University.

By: H. Zhou & T. Conte

Source: NC State University Libraries
Added: February 20, 2021

2003 chapter

Tree Traversal Scheduling: A Global Instruction Scheduling Technique for VLIW/EPIC Processors

In Languages and Compilers for Parallel Computing (Vol. 2624, pp. 223–238).

By: H. Zhou n, M. Jennings n & T. Conte n

TL;DR: This paper presents a new global scheduling algorithm using treegions called Tree Traversal Scheduling (TTS), and considers them analogous to superblocks with the same amount of code expansion as the base treegion. (via Semantic Scholar)
Sources: Web Of Science, NC State University Libraries, Crossref
Added: August 6, 2018

2002 report

Using Performance Bounds to Guide Pre-scheduling Code Optimizations

[Technical Report,]. Raleigh, NC: Department of Electrical and Computer Engineering, North Carolina State University.

By: H. Zhou & T. Conte

Source: NC State University Libraries
Added: February 20, 2021

2001 report

A Treegion-based Unified Approach to Speculation and Predication in Global Instruction Scheduling

[Technical Report,]. Raleigh, NC: Department of Electrical and Computer Engineering, North Carolina State University.

By: M. Jennings, H. Zhou & T. Conte

Source: NC State University Libraries
Added: February 20, 2021

2001 report

A study of value speculative execution and mispeculation recovery in superscalar microprocessors

[Technical Report,]. Raleigh, NC: Department of Electrical and Computer Engineering, North Carolina State University.

By: H. Zhou, C. Fu, E. Rotenberg & T. Conte

Source: NC State University Libraries
Added: February 20, 2021

2001 conference paper

Adaptive mode control: A static-power-efficient cache design

2001 International Conference on Parallel Architectures and Compilation Techniques: Proceedings: 8-12 September, 2001, Barcelona, Catalunya, Spain, 61–70.

By: Huiyang, M. Toburen n, E. Rotenberg n & T. Conte n

TL;DR: Simulations show an average of 73% of I-cache lines and 54% of D-cache lines are put in sleep mode with an average IPC impact of only 1.7%, for 64KB caches, and this work proposes applying sleep mode only to the data store and not the tag store. (via Semantic Scholar)
UN Sustainable Development Goal Categories
7. Affordable and Clean Energy (OpenAlex)
Sources: NC State University Libraries, NC State University Libraries
Added: August 6, 2018

2000 report

Adaptive Mode Control: A Low-Leakage Power-Efficient Cache Design

[Technical Report]. Raleigh, NC: Department of Electrical and Computer Engineering, North Carolina State University.

By: H. Zhou, M. Toburen, E. Rotenberg & T. Conte

Source: NC State University Libraries
Added: February 20, 2021

2000 journal article

Automatic IC orientation checks

Machine Vision and Applications, 12(3), 107–112.

By: A. Kassim*, Huiyang & S. Raganath

Sources: NC State University Libraries, NC State University Libraries
Added: August 6, 2018

1998 journal article

A fast algorithm for detecting die extrusion defects in IC packages

MACHINE VISION AND APPLICATIONS, 11(1), 37–41.

By: H. Zhou*, A. Kassim & S. Ranganath

author keywords: IC package inspection; die extrusion defects; linear feature extraction; feature enhancement
Sources: Web Of Science, NC State University Libraries
Added: August 6, 2018

1996 journal article

Test sequencing and diagnosis in electronic system with decision table

MICROELECTRONICS AND RELIABILITY, 36(9), 1167–1175.

By: H. Zhou*, L. Qu* & A. Li*

TL;DR: The algorithm for building optimal decision trees, which embody the solution for the test sequencing and diagnosis problem, is analyzed and the conditional probabilities are included in the decision table, called as conditional decision table. (via Semantic Scholar)
UN Sustainable Development Goal Categories
16. Peace, Justice and Strong Institutions (OpenAlex)
Sources: Web Of Science, NC State University Libraries
Added: August 6, 2018

Citation Index includes data from a number of different sources. If you have questions about the sources of data in the Citation Index or need a set of data which is free to re-distribute, please contact us.

Certain data included herein are derived from the Web of Science© and InCites© (2024) of Clarivate Analytics. All rights reserved. You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.