DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

2026

1Westlake University, Hangzhou, China 2The Hong Kong University of Science and Technology, HongKong, China 3Rochester Institute of Technology, Rochester, USA
Corresponding author: wanghuan@westlake.edu.cn

WLU
HKUST
RIT
ENCODE Lab
Overview of DICE. The framework enhances CUDA kernel generation robustness in dLLMs by leveraging TraceRL. This hierarchical approach integrates: (1) Bi-phase Curated Reinforcement Learning framework, a progressive RL training strategy that consists of kernel infilling and end-to-end kernel generation stages to ensure functional correctness and high performance of generated CUDA kernels, and (2) Data Scheduling, transitioning training data from basic single operations to complex whole-model structures during the two RL stages. A valid reward will only be returned if the generated CUDA kernel can be compiled and functions correctly.

Abstract

Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs, owing to their capacity for parallel token generation. This paradigm is particularly well-suited for code generation, where holistic structural planning and non-sequential refinement are critical. Despite this potential, tailoring dLLMs for CUDA kernel generation remains challenging, obstructed not only by the high specialization but also by the severe lack of high-quality training data. To address these challenges, we construct CuKe, an expanded dataset optimized for high-performance CUDA kernels. And we propose a bi-phase curated reinforcement learning (BiC-RL) framework consisting of a CUDA kernel infilling stage and an end-to-end CUDA kernel generation stage. Leveraging this training framework, we introduce DICE, a series of diffusion language models designed for CUDA kernel generation, spanning three parameter scales, 1.7B, 4B, and 8B. Extensive experiments on the KernelBench demonstrate that DICE significantly outperforms both AR LLMs and existing dLLMs of comparable scale, establishing a new state-of-the-art for CUDA kernel generation.

Inference Paradigm of dLLMs

The inference paradigm of diffusion large language models. Left Part: The sequence is divided into several blocks, where the block length equals four in this figure. The block diffusion mechanism enables models to generate autoregressively between blocks, while parallel discrete decoding within blocks. All the KV cache from previous blocks will be reused. Right Part: An actual step-by-step generation trajectory for an example CUDA kernel. While the overall trend remains autoregressive, we can clearly observe lots of non-autoregressive behavior during the generation process.

Results

Table 1: Main results on KernelBench across 8B and similar scale models. We report Execution Correctness (Exec) and speedup metrics (fast1 and fast2). We compare Autoregressive and Diffusion LLMs, featuring general, code, and reasoning models. The best and second-best results are highlighted in bold and underline, respectively. Results of commercial models are shown in gray for reference.
Table 2: Main results on KernelBench across 4B and similar scale models. We report Execution Correctness (Exec) and speedup metrics (fast1 and fast2). We compare Autoregressive and Diffusion LLMs, featuring general, code, and reasoning models. The best and second-best results are highlighted in bold and underline, respectively.
Table 3: Main results on KernelBench across 1.7B and similar scale models. We report Execution Correctness (Exec) and speedup metrics (fast1 and fast2). We compare Autoregressive and Diffusion LLMs, featuring general, code, and reasoning models. The best and second-best results are highlighted in bold and underline, respectively.

BibTeX

@article{bai2026dice,
  title={DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels},
  author={Bai, Haolei and Kong, Lingcheng and Chen, Xueyi and Wang, Jiamian and Tao, Zhiqiang and Wang, Huan},
  journal={arXiv preprint arXiv:2602.11715},
  year={2026}
}