avatar

Shijie Cao (曹士杰)

Senior Researcher @ System Group, Microsoft Research Asia
Email: caoshijie0501 [at] gmail [dot] com
Google Scholar

About Me

I am a senior researcher at the System Group in Microsoft Research Asia. I received my Ph.D. degree in Computer Science from Harbin Institute of Technology (HIT) in 2021 through a joint-PhD program with MSRA. Prior to that, I earned my B.E. degree in Computer Science from HIT in 2016. From 2015 to 2021, I served as a long-term intern at MSRA's system area mentored by Dr. Ningyi Xu, and Dr. Lintao Zhang.

My research interests lie at the intersection of computer system/architecture and deep learning, including domain-specific architectures, software-hardware co-design, deep learning compression and acceleration. More recently, my research has been focused on model-chip codesign for LLMs, with specific emphasis on low-bit quantization and sparsity techniques, etc.

I am actively seeking talents for both full-time positions and research internships throughout the year. Please feel free to contact me.

News

Media

Selected Publications

*Interns/Students. †Corresponding Author

  1. SeerAttention-R: Sparse Attention Adaptation for Long Reasoning.
    Yizhao Gao*, Shuming Guo*, Shijie Cao†, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, Hayden Kwok-Hay So, Yu Hua, Ting Cao, Fan Yang, Mao Yang.
    Arxiv Preprint
  2. LUT Tensor Core: Lookup Table Enables Efficient Low-bit LLM Inference.
    Zhiwen Mo*, Lei Wang*, Jianyu Wei*, Zhichen Zeng*, Shijie Cao†, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang.
    ISCA 2025
  3. T-MAC: CPU Renaissance via Table Lookup for Low-bit LLM Deployment on Edge.
    Jianyu Wei*, Shijie Cao†, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang.
    EuroSys 2025
  4. Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation.
    Lei Wang, Lingxiao Ma, Shijie Cao, Quanlu Zhang, Jilong Xue, Yining Shi, Ningxin Zheng, Ziming Miao, Fan Yang, Ting Cao, Yuqing Yang, Mao Yang.
    OSDI 2024
  5. Pre-gated MoE: An Algorithm-System Co-design for Fast and Scalable Mixture-of-Expert Inference.
    Ranggi Hwang*, Jianyu Wei*, Shijie Cao†, Changho Hwang, Xiaohu Tang, Ting Cao, Mao Yang.
    ISCA 2024
  6. SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs.
    Yizhao Gao*, Zhichen Zeng*, Dayou Du*, Shijie Cao†, Peiyuan Zhou, Jiaxing Qi, Junjie Lai, Hayden Kwok-Hay So, Ting Cao, Fan Yang, Mao Yang.
    Arxiv Preprint
  7. BitDistiller: Unleashing the potential of sub-4-bit llms via self-distillation.
    Dayou Du*, Yijia Zhang*, Shijie Cao†, Jiaqi Guo, Ting Cao, Xiaowen Chu, Ningyi Xu.
    ACL 2024
  8. ROMA: A Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM.
    Wenqiang Wang, Yijia Zhang, Zikai Zhang, Guanting Huo, Hao Liang, Shijie Cao, Ningyi Xu.
    Arxiv Preprint
  9. BitNet.cpp: Efficient Edge Inference for Ternary LLMs.
    Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, Furu Wei.
    Arxiv Preprint
  10. BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache.
    Dayou Du*, Shijie Cao†, Jianyi Cheng, Ting Cao, Mao Yang.
    Arxiv Preprint
  11. Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models.
    Yijia Zhang*, Lingran Zhao*, Shijie Cao†, Wenqiang Wang, Ting Cao, Fan Yang, Mao Yang, Shanghang Zhang, Ningyi Xu.
    ICME 2024
  12. Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning.
    Bin Lin, Ningxin Zheng, Lei Wang, Shijie Cao†, Lingxiao Ma, Quanlu Zhang, Yi Zhu, Ting Cao, Jilong Xue, Yuqing Yang, Fan Yang.
    MLSys 2023
  13. Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity.
    Shijie Cao, Chen Zhang, Zhuliang Yao, Wencong Xiao, Lanshun Nie, Dechen Zhan, Yunxing Liu, Ming Wu, Lintao Zhang.
    FPGA 2019
  14. Balanced Sparsity for Efficient DNN Inference on GPU.
    Zhuliang Yao, Shijie Cao, Wencong Xiao, Chen Zhang and Lanshun Nie.
    AAAI 2019
  15. SeerNet: Predicting Convolutional Neural Network Feature-Map Sparsity through Low-Bit Quantization.
    Shijie Cao, Lingxiao Ma, Wencong Xiao, Chen Zhang, Yunxin Liu, Lintao Zhang, Lanshun Nie, Zhi Yang.
    CVPR 2019

Interns/Students that I have the privilege to mentor and work with