Chain DSL
The user-facing surface for writing PTX is one string macro:
ptx"opcode.mod1.mod2..."(args...)@ptx_str splits the literal on ., builds an Operation{parts} singleton, and the call site is @generated. From the chain parts plus the argument types, PTX.jl derives:
- the asm template string,
- LLVM constraint letters,
- side-effect classification +
~{memory}clobber, - return type,
and emits a single @asmcall.
ptx"add.f32"(a, b) # Float32 → Float32
ptx"fma.rn.f32"(a, b, c)
ptx"atom.add.gpu.u32"(p, v) # *p += v atomically
ptx"bar.sync"(Val(0)) # immediate baked in
ptx"mov.u32"(sreg"%tid.x") # special-register read
ptx"cp.async.commit_group"()
ptx"cp.async.wait_group"(Val(0))
ptx"fence.acq_rel.gpu"()This handles the vast majority of PTX with no per-op declaration. For ops whose operand layout breaks the chain default — multi-output, mixed address-space pointers, fragment-shape dispatch, descriptor packing — hand-written wrappers register typed methods on the same Operation{...} singleton. Same call site, no user-visible change. See Wrappers.
Modifier syntax
The macro splits on .; each segment becomes one Symbol verbatim. :: (PTX sub-namespace separator), digit-leading tokens (3d, m16n8k32), and underscores in modifier names all flow through cleanly:
ptx"mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32"(a, b, c)
ptx"cp.async.bulk.tensor.3d.shared::cta.global.tile.mbarrier::complete_tx::bytes"(...)Empty parts (consecutive ., leading/trailing ., or empty string) error at expansion.
Return type inference
The terminal modifier of the chain, if it's a recognized PTX dtype suffix, gives the return type:
| Modifier | Julia type |
|---|---|
.f64 / .f32 / .f16 | Float64 / Float32 / Float16 |
.u64 / .u32 / .u16 / .u8 | UInt64 / UInt32 / UInt16 / UInt8 |
.s64 / .s32 / .s16 / .s8 | Int64 / Int32 / Int16 / Int8 |
.b64 / .b32 / .b16 / .b8 | UInt64 / UInt32 / UInt16 / UInt8 |
.bf16 / .tf32 | UInt16 / UInt32 (bit-pattern carrier) |
.pred | Bool |
.f16x2 / .bf16x2 | UInt32 (packed FP carrier) |
.e4m3x2 / .e5m2x2 / .e2m1x2 / .e3m2x2 / .ue8m0x2 / … | UInt16 (packed FP carrier) |
.e4m3x4 / .e5m2x4 / .e2m3x4 / .e3m2x4 / .e2m1x4 | UInt32 (packed FP carrier) |
.f32x2 | UInt64 |
If the trailing modifier isn't a recognized dtype, the chain emits a void asm and returns Nothing.
Three families break this rule and are special-cased:
cvt— grammar iscvt.<modifiers...>.<dst>.<src>, so the destination isparts[end-1].cvt.rn.f16.f32returnsFloat16.setp—setp.<cmp>.<dtype>always returnsBool. The trailing modifier describes the input compare type, not the output.- No-return families —
setmaxnreg.{inc,dec}.sync.aligned.u32,tensormap.replace.tile.<field>...b{32,64}, and thetcgen05sinks (alloc,commit,relinquish_alloc_permit) all carry a trailing width modifier that describes an input operand. The chain treats them as void; otherwise ptxas would reject with "Arguments mismatch".
Packed FP carriers
PTX FP8 / FP6 / FP4 lanes have no native Julia primitive, so the chain returns integer carriers (UInt8 / UInt16 / UInt32) of matching width. NVPTX register classes (h/r/l) match these, and downstream packages can layer a primitive type with getindex for lane extraction on top:
primitive type Float8_E4M3x2 16 end # 2 lanes × 8 bits
function Base.getindex(x::Float8_E4M3x2, i::Int)
bits = reinterpret(UInt16, x)
byte = i == 1 ? UInt8(bits & 0xFF) : UInt8(bits >> 8)
reinterpret(Float8_E4M3FN, byte)
endA 16-bit primitive lowers to LLVM i16, fitting the same h constraint as UInt16. reinterpret between the carrier and the semantic type is zero-cost.
Side-effect classification
Inline asm is opaque to LLVM; without explicit annotation, LLVM may DCE/CSE/fold or reorder it. The chain marks an op nonpure (side_effects = true + ~{memory} clobber) when:
- the opcode prefix is one of:
bar,mbarrier,fence,wgmma,tcgen05,cluster,cp,setmaxnreg,elect,prefetch,tensormap,ld,st,atom,red,ldmatrix,stmatrix,vote,shfl,match,redux,activemask,membar,mapa,getctarank,griddepcontrol,clusterlaunchcontrol,exit; or - any argument is a
SpecialReg(sreg reads are observable).
Warp-collective ops (vote, shfl, match, redux, activemask) need ~{memory} even though they touch no memory: each lane's result depends on every other lane's input, and without the clobber LLVM would hoist or constant-fold them as if they were per-thread pure functions and silently lose the cross-lane semantics.
Constraint letters
Per-arg from a small mapping:
| Julia type | Letter | NVPTX register class |
|---|---|---|
Float64 | d | f64 |
Float32 | f | f32 |
Float16, Int16/UInt16, Int8/UInt8 | h | i16 (NVPTX has no native i8 register) |
Int32/UInt32 | r | i32 |
Int64/UInt64 | l | i64 |
Bool | b | i1 (predicate) |
Core.LLVMPtr{T, AS} | l | i64 for any address space |
Pointer arguments always get l regardless of address space — NVPTX represents non-zero address-space pointers as 64-bit at the LLVM IR level even when the underlying PTX address is 32-bit (shared, param, local). Hand-written wrappers for shared-memory ops override this to r where ptxas wants the 32-bit form.
Special argument shapes
The chain recognizes three argument shapes that don't render as plain $N operands:
Val{N::Integer} — compile-time immediate
Bakes N as a decimal literal into the asm string at the operand position. Consumes no LLVM input slot:
ptx"bar.sync"(Val(0)) # → "bar.sync 0;"
ptx"cp.async.cg.shared.global"(dst, src, Val(16)) # 16-byte size baked
ptx"shl.b32"(x, Val(2)) # shl.b32 r, x, 2SpecialReg{name} — verbatim PTX token
Renders the name verbatim (always with % prefix). Constructed via @sreg_str:
sreg"%tid.x" # → "%tid.x"
sreg"tid.x" # ≡ sreg"%tid.x" (auto-prefixed)
sreg"cluster_ctarank" # → "%cluster_ctarank"
ptx"mov.u32"(sreg"%tid.x") # → "mov.u32 r, %tid.x;"Underscore-bearing names (%cluster_ctarank, %lanemask_eq, %total_smem_size) round-trip losslessly because the macro bakes the exact PTX token.
Homogeneous tuple → braced register-vector
Each tuple element becomes its own LLVM input slot; the asm emits a braced group:
# `add.f32x2 d, {a0, a1}, {b0, b1};`
ptx"add.f32x2"((a0, a1), (b0, b1))Used for any op whose operand layout takes {$N, $N+1, ...}. Many multi-output families (ldmatrix, mma, stmatrix) emit braced operands and are covered by hand-written wrappers — see Wrappers.
Pointer bracketing
Memory-op opcodes render pointer arguments as [$N]; non-memory ops (cvta, mov, …) emit unbracketed $N. The bracketing set:
ld,st,atom,red,cp,mbarrier,ldmatrix,stmatrix,prefetch,tcgen05,tensormap,fence.
fence only takes a pointer in the fence.proxy.tensormap::generic.<acq|rel>.gpu [addr], size form; argument-less fence.sc.gpu forms emit no bracketed operand either way.
See the Reference page for full docstrings of @ptx_str and @sreg_str.