Nathan Barry (@nathanrs)

2025-11-10 | ❤️ 575 | 🔁 54

Added confidence-aware parallel decoding to my tiny text diffusion model!

Before, we had “scheduled iterative refinement” for decoding, where it would go through a masking schedule and resample all positions at each iteration. This didn’t take into account token probabilities and could remask “good” tokens instead of low confident ones.

Now, it unmasks all confident tokens above a certain confidence threshold (currently 0.9) at each iteration, until everything is unmasked, essentially unmasking only “good” tokens. This unsurprisingly led to a massive jump in output quality.

At a glance, it seems to be much closer to the quality of the GPT model from @karpathy’s “Let’s build GPT” video (which is of similar size). Still not quite there yet, but close.

I also just find it interesting to look at how different the unmasking pattern is compared to before. You can see the “easier” parts forming first. Feels like there’s an idea to be had here.

미디어

인용 트윗

Nathan Barry (@nathanrs)

Added context to my tiny diffusion model to enable sequential generation of longer outputs! Currently the context is a quarter of the sequence length (seq_len=256, context_len=64).

I have a theory that the less semantic-value-per-token, the worse the “curse of parallel decoding” is. With parallel decoding, we independently predict multiple tokens in one step.

With the sentence “My poker hand was a ___ ___”, two valid predictions are “two pair” and “straight flush”. Because each token prediction is independent though, we can end up with a nonsensical output like “two flush”.

This seems to be exacerbated with low semantic-value-per-token, as now you need more tokens to express the same concept. Instead of needing to independently predict two tokens, we might need to predict 10 instead (which is of course much harder).

The model currently has noticeably worse output compared to nanogpt (similar size) and I believe this is a main reason. I’ll try adding confidence-aware parallel decoding (from NVIDIA’s Fast-dLLM paper) and other tricks and see how much they improve generation quality.

원본 트윗

🎬 영상

📚 세현's Vault

🌍 도메인

📄 Papers

텍스트 Diffusion 확신 기반 병렬 디코딩

Nathan Barry (@nathanrs)

미디어

인용 트윗

Tags

그래프 뷰

목차

백링크