PTX.jl
A Julia interface for NVIDIA PTX. Composes additively with CUDA.jl — CUDA.jl owns launch, memory, and control flow; PTX.jl owns instruction emission.
using PTX, CUDACore
function add_kernel!(c, a, b)
tid = ptx"mov.u32"(sreg"%tid.x")
i = Int(tid) + 1
c[i] = ptx"add.f32"(a[i], b[i])
return
end
let
n = 128
a = cu(randn(n))
b = cu(randn(n))
c = similar(a)
@cuda threads=n add_kernel!(c, a, b)
c == a + b
endWhy
PTX is the lowest authoring tier in the NVIDIA GPU stack. PTX.jl fills the gap CUDA.jl leaves uncovered: full TensorCore shape coverage (incl. TF32, FP8, sub-byte), TMA descriptors, cluster APIs, mbarriers, FP8 conversions, setmaxnreg, match.sync, prmt, and the rest of what <mma.h>, <cuda_pipeline.h>, and <cuda/barrier> ultimately compile down to.
When porting modern CUTLASS / Triton / cuDNN-style kernels, PTX.jl is on the critical path. CUDA.jl alone leaves a Julia user stuck for any kernel that uses tensor cores beyond CUDA.WMMA's limited coverage, async pipelines, cluster ops, or TMA. PTX.jl + CUDA.jl together close that gap.
Two surfaces
- Authoring. Write PTX directly in Julia via the
@ptx_strstring macro. See the Chain DSL page. - Transpiling. Turn an existing
.ptxfile into idiomatic Julia viaptx_to_julia. See the Transpiler page.
Pages
- Getting started — install, first kernel, how PTX.jl composes with CUDA.jl.
- Chain DSL —
@ptx_str/@sreg_strsemantics: return-type inference, side-effect classification, operand conventions. - Wrappers — hand-written wrapper families for ops whose operand shape breaks the chain default (mma, wgmma, ldmatrix, TMA, mbarrier, …).
- Transpiler — parse PTX, transform IR, emit Julia.
- Reference — public API.
Credits
Primary design inspiration: pyptx by Patrick Toulmé. The parser, IR, and several wrappers and example kernels are ported from pyptx (Apache 2.0); see per-file headers and LICENSE.