1 article on transformer.
I've been reading ML papers for 10 years. Most don't matter. These architectural choices did. RoPE, GQA, SwiGLU — each one solved a real scaling problem. Here's what practitioners need to know when a new model claims 'better architecture.'