Shijie Cao (曹士杰)

About Me

I am a senior researcher at the System Group in Microsoft Research Asia. I received my Ph.D. degree in Computer Science from Harbin Institute of Technology (HIT) in 2021 through a joint-PhD program with MSRA. Prior to that, I earned my B.E. degree in Computer Science from HIT in 2016. From 2015 to 2021, I served as a long-term intern at MSRA's system area mentored by Dr. Ningyi Xu, and Dr. Lintao Zhang.

My research interests lie at the intersection of computer system/architecture and deep learning, including domain-specific architectures, software-hardware co-design, deep learning compression and acceleration. More recently, my research has been focused on model-chip codesign for LLMs, with specific emphasis on low-bit quantization and sparsity techniques, etc.

I am actively seeking talents for both full-time positions and research internships throughout the year. Please feel free to contact me.

News

2025/11 BitDecoding is accepted to HPCA 2026.
2025/09 SeerAttention is accepted to NeurIPS 2025.
2025/07 I am serving on the HPCA 2026 program committee.
2025/06 We released SeerAttention-R [code], a sparse attention framework aimed to improve the long decoding efficiency of reasoning models. With a lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pretrained model without modifying the original parameters. SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning accuracy with 4K token budget in AIME benchmark under large sparse attention block sizes (64/128). AttnGate adaptors are available at Huggingface.
2025/05 bitnet.cpp is accepted to ACL 2025.
2025/03 LUT Tensor Core is accepted to ISCA 2025.
2024/10 We released SeerAttention [code], a new Attention mechanism that augments the conventional attention with a learnable gate that adaptively selects significant blocks in an attention map and deems the rest blocks sparse. By natively learning the attention sparsity in LLMs, SeerAttention can achieve a remarkable 90% sparsity ratio at a 32k context length with minimal perplexity loss, offering a 7.3× speedup over FlashAttention-2.
2024/08 T-MAC [code] is accepted to EuroSys 2025. T-MAC is a kernel library to directly support mixed-precision matrix multiplication (int1/2/3/4 x int8/fp16/fp32) without the need for dequantization by utilizing lookup tables (LUT). Specifically, T-MAC provides the LUT-based kernel foundation for bitnet.cpp.
2024/05 BitDistiller [code] is accepted to the ACL 2024 main conference.
2024/04 We released BitBLAS and T-MAC, libraries to support mixed-precision matrix multiplications on GPU and CPU respectively, specially designed for low-bit LLM deployment.
2024/03 Our paper Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation [code] is accepted to OSDI 2024.
2024/03 Our paper Pre-gated MoE is accepted to ISCA 2024.

Media

Our series of research efforts on low-bit LLM system and hardware (including BitBLAS, T-MAC and LUT Tensor Core) is featured by 机器之心, 微软亚洲研究院 and Microsoft Research.

Selected Publications

*Interns/Students. †Corresponding Author

SeerAttention-R: Sparse Attention Adaptation for Long Reasoning.
Yizhao Gao*, Shuming Guo*, Shijie Cao†, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, Hayden Kwok-Hay So, Yu Hua, Ting Cao, Fan Yang, Mao Yang.
Arxiv Preprint
LUT Tensor Core: Lookup Table Enables Efficient Low-bit LLM Inference.
Zhiwen Mo*, Lei Wang*, Jianyu Wei*, Zhichen Zeng*, Shijie Cao†, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang.
ISCA 2025
T-MAC: CPU Renaissance via Table Lookup for Low-bit LLM Deployment on Edge.
Jianyu Wei*, Shijie Cao†, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang.
EuroSys 2025
Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation.
Lei Wang, Lingxiao Ma, Shijie Cao, Quanlu Zhang, Jilong Xue, Yining Shi, Ningxin Zheng, Ziming Miao, Fan Yang, Ting Cao, Yuqing Yang, Mao Yang.
OSDI 2024
Pre-gated MoE: An Algorithm-System Co-design for Fast and Scalable Mixture-of-Expert Inference.
Ranggi Hwang*, Jianyu Wei*, Shijie Cao†, Changho Hwang, Xiaohu Tang, Ting Cao, Mao Yang.
ISCA 2024
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs.
Yizhao Gao*, Zhichen Zeng*, Dayou Du*, Shijie Cao†, Peiyuan Zhou, Jiaxing Qi, Junjie Lai, Hayden Kwok-Hay So, Ting Cao, Fan Yang, Mao Yang.
Arxiv Preprint
BitDistiller: Unleashing the potential of sub-4-bit llms via self-distillation.
Dayou Du*, Yijia Zhang*, Shijie Cao†, Jiaqi Guo, Ting Cao, Xiaowen Chu, Ningyi Xu.
ACL 2024
ROMA: A Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM.
Wenqiang Wang, Yijia Zhang, Zikai Zhang, Guanting Huo, Hao Liang, Shijie Cao, Ningyi Xu.
Arxiv Preprint
BitNet.cpp: Efficient Edge Inference for Ternary LLMs.
Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, Furu Wei.
Arxiv Preprint
BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache.
Dayou Du*, Shijie Cao†, Jianyi Cheng, Ting Cao, Mao Yang.
Arxiv Preprint
Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models.
Yijia Zhang*, Lingran Zhao*, Shijie Cao†, Wenqiang Wang, Ting Cao, Fan Yang, Mao Yang, Shanghang Zhang, Ningyi Xu.
ICME 2024
Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning.
Bin Lin, Ningxin Zheng, Lei Wang, Shijie Cao†, Lingxiao Ma, Quanlu Zhang, Yi Zhu, Ting Cao, Jilong Xue, Yuqing Yang, Fan Yang.
MLSys 2023
Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity.
Shijie Cao, Chen Zhang, Zhuliang Yao, Wencong Xiao, Lanshun Nie, Dechen Zhan, Yunxing Liu, Ming Wu, Lintao Zhang.
FPGA 2019
Balanced Sparsity for Efficient DNN Inference on GPU.
Zhuliang Yao, Shijie Cao, Wencong Xiao, Chen Zhang and Lanshun Nie.
AAAI 2019
SeerNet: Predicting Convolutional Neural Network Feature-Map Sparsity through Low-Bit Quantization.
Shijie Cao, Lingxiao Ma, Wencong Xiao, Chen Zhang, Yunxin Liu, Lintao Zhang, Lanshun Nie, Zhi Yang.
CVPR 2019

Interns/Students that I have the privilege to mentor and work with

Yijia Zhang (PhD at SJTU)
Ranggi Hwang (PhD at KAIST)
Jianyu Wei (PhD at USTC)
Zhiwen Mo (PhD at Imperial College London)
Dayou Du (MPhil at HKUST(GZ), now PhD at University of Edinburgh)
Zhichen Zeng (Undergraduate at USTC, now PhD at University of Washington)
Yizhao Gao (PhD at HKU)
Hongyi Guan (Master at THU)
Shuming Guo (Undergraduate at HUST)