About Me
I am a senior researcher at the System Group in Microsoft Research Asia. I received my Ph.D. degree in Computer Science from Harbin Institute of Technology (HIT) in 2021 through a joint-PhD program with MSRA. Prior to that, I earned my B.E. degree in Computer Science from HIT in 2016. From 2015 to 2021, I served as a long-term intern at MSRA's system area mentored by Dr. Ningyi Xu, and Dr. Lintao Zhang.
My research interests lie at the intersection of computer system/architecture and deep learning, including domain-specific architectures, software-hardware co-design, deep learning compression and acceleration. More recently, my research has been focused on model-chip codesign for LLMs, with specific emphasis on low-bit quantization and sparsity techniques, etc.
I am actively seeking talents for both full-time positions and research internships throughout the year. Please feel free to contact me.
News
- 2025/07 I am serving on the HPCA 2026 program committee.
- 2025/06 We released SeerAttention-R [code], a sparse attention framework aimed to improve the long decoding efficiency of reasoning models. With a lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pretrained model without modifying the original parameters. SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning accuracy with 4K token budget in AIME benchmark under large sparse attention block sizes (64/128). AttnGate adaptors are available at Huggingface.
- 2025/05 bitnet.cpp is accepted to ACL 2025.
- 2025/03 LUT Tensor Core is accepted to ISCA 2025.
- 2024/10 We released SeerAttention [code], a new Attention mechanism that augments the conventional attention with a learnable gate that adaptively selects significant blocks in an attention map and deems the rest blocks sparse. By natively learning the attention sparsity in LLMs, SeerAttention can achieve a remarkable 90% sparsity ratio at a 32k context length with minimal perplexity loss, offering a 7.3× speedup over FlashAttention-2.
- 2024/08 T-MAC [code] is accepted to EuroSys 2025. T-MAC is a kernel library to directly support mixed-precision matrix multiplication (int1/2/3/4 x int8/fp16/fp32) without the need for dequantization by utilizing lookup tables (LUT). Specifically, T-MAC provides the LUT-based kernel foundation for bitnet.cpp.
- 2024/05 BitDistiller [code] is accepted to the ACL 2024 main conference.
- 2024/04 We released BitBLAS and T-MAC, libraries to support mixed-precision matrix multiplications on GPU and CPU respectively, specially designed for low-bit LLM deployment.
- 2024/03 Our paper Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation [code] is accepted to OSDI 2024.
- 2024/03 Our paper Pre-gated MoE is accepted to ISCA 2024.
Media
Selected Publications
*Interns/Students. †Corresponding Author
-
SeerAttention-R: Sparse Attention Adaptation for Long Reasoning.
Yizhao Gao*, Shuming Guo*, Shijie Cao†, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, Hayden Kwok-Hay So, Yu Hua, Ting Cao, Fan Yang, Mao Yang.
Arxiv Preprint
-
LUT Tensor Core: Lookup Table Enables Efficient Low-bit LLM Inference.
Zhiwen Mo*, Lei Wang*, Jianyu Wei*, Zhichen Zeng*, Shijie Cao†, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang.
ISCA 2025
-
T-MAC: CPU Renaissance via Table Lookup for Low-bit LLM Deployment on Edge.
Jianyu Wei*, Shijie Cao†, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang.
EuroSys 2025
-
Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation.
Lei Wang, Lingxiao Ma, Shijie Cao, Quanlu Zhang, Jilong Xue, Yining Shi, Ningxin Zheng, Ziming Miao, Fan Yang, Ting Cao, Yuqing Yang, Mao Yang.
OSDI 2024
-
Pre-gated MoE: An Algorithm-System Co-design for Fast and Scalable Mixture-of-Expert Inference.
Ranggi Hwang*, Jianyu Wei*, Shijie Cao†, Changho Hwang, Xiaohu Tang, Ting Cao, Mao Yang.
ISCA 2024
-
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs.
Yizhao Gao*, Zhichen Zeng*, Dayou Du*, Shijie Cao†, Peiyuan Zhou, Jiaxing Qi, Junjie Lai, Hayden Kwok-Hay So, Ting Cao, Fan Yang, Mao Yang.
Arxiv Preprint
-
BitDistiller: Unleashing the potential of sub-4-bit llms via self-distillation.
Dayou Du*, Yijia Zhang*, Shijie Cao†, Jiaqi Guo, Ting Cao, Xiaowen Chu, Ningyi Xu.
ACL 2024
-
ROMA: A Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM.
Wenqiang Wang, Yijia Zhang, Zikai Zhang, Guanting Huo, Hao Liang, Shijie Cao, Ningyi Xu.
Arxiv Preprint
-
BitNet.cpp: Efficient Edge Inference for Ternary LLMs.
Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, Furu Wei.
Arxiv Preprint
-
BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache.
Dayou Du*, Shijie Cao†, Jianyi Cheng, Ting Cao, Mao Yang.
Arxiv Preprint
-
Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models.
Yijia Zhang*, Lingran Zhao*, Shijie Cao†, Wenqiang Wang, Ting Cao, Fan Yang, Mao Yang, Shanghang Zhang, Ningyi Xu.
ICME 2024
-
Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning.
Bin Lin, Ningxin Zheng, Lei Wang, Shijie Cao†, Lingxiao Ma, Quanlu Zhang, Yi Zhu, Ting Cao, Jilong Xue, Yuqing Yang, Fan Yang.
MLSys 2023
-
Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity.
Shijie Cao, Chen Zhang, Zhuliang Yao, Wencong Xiao, Lanshun Nie, Dechen Zhan, Yunxing Liu, Ming Wu, Lintao Zhang.
FPGA 2019
-
Balanced Sparsity for Efficient DNN Inference on GPU.
Zhuliang Yao, Shijie Cao, Wencong Xiao, Chen Zhang and Lanshun Nie.
AAAI 2019
-
SeerNet: Predicting Convolutional Neural Network Feature-Map Sparsity through Low-Bit Quantization.
Shijie Cao, Lingxiao Ma, Wencong Xiao, Chen Zhang, Yunxin Liu, Lintao Zhang, Lanshun Nie, Zhi Yang.
CVPR 2019
Interns/Students that I have the privilege to mentor and work with
- Yijia Zhang (PhD at SJTU)
- Ranggi Hwang (PhD at KAIST)
- Jianyu Wei (PhD at USTC)
- Zhiwen Mo (PhD at Imperial College London)
- Dayou Du (MPhil at HKUST(GZ), now PhD at University of Edinburgh)
- Zhichen Zeng (Undergraduate at USTC, now PhD at University of Washington)
- Yizhao Gao (PhD at HKU)
- Shuming Ma (Undergraduate at HUST)