<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Vilhelm Toivonen — Writing</title>
    <link>https://vtoivonen.com/blog</link>
    <atom:link href="https://vtoivonen.com/feed.xml" rel="self" type="application/rss+xml" />
    <description>Doctoral Researcher at University of Helsinki. Distributed AI, edge deployment, knowledge transfer, and AI safety.</description>
    <language>en-us</language>
    <managingEditor>noreply@vtoivonen.com (Vilhelm Toivonen)</managingEditor>
    <lastBuildDate>Mon, 04 May 2026 18:57:45 GMT</lastBuildDate>
    <image>
      <url>https://vtoivonen.com/og/default.png</url>
      <title>Vilhelm Toivonen — Writing</title>
      <link>https://vtoivonen.com/blog</link>
    </image>
    <item>
      <title>Per-user memory in language models: what we tried</title>
      <link>https://vtoivonen.com/blog/per-user-memory-composition</link>
      <guid isPermaLink="true">https://vtoivonen.com/blog/per-user-memory-composition</guid>
      <pubDate>Fri, 01 May 2026 00:00:00 GMT</pubDate>
      <dc:creator>Vilhelm Toivonen</dc:creator>
      <description><![CDATA[Serving has a memory problem. Stuffing facts into the prompt is the answer everyone uses because the alternatives didn't work. A field report on three things we tried in my mech-interp PhD work — two clean failures, one architecture that's getting traction, and the layer-injection finding that fell out of all of them.]]></description>
      <enclosure url="https://vtoivonen.com/og/per-user-memory-composition.png" type="image/png" length="0" />
      <category>language-models/interpretability</category>
      <category>language-models/inference/kv-cache</category>
      <category>optimization/low-rank</category>
    </item>
    <item>
      <title>Sparse embeddings: a six-week negative result</title>
      <link>https://vtoivonen.com/blog/sparse-embeddings-parameter-golf</link>
      <guid isPermaLink="true">https://vtoivonen.com/blog/sparse-embeddings-parameter-golf</guid>
      <pubDate>Fri, 24 Apr 2026 00:00:00 GMT</pubDate>
      <dc:creator>Vilhelm Toivonen</dc:creator>
      <description><![CDATA[Doubling the vocab buys ~0.01 BPB. Compressing the doubled embedding back under a 16 MB cap costs more. A tour of every embedding-side trick I tried in parameter-golf and why each one lost to a plain dense INT7 baseline.]]></description>
      <enclosure url="https://vtoivonen.com/og/sparse-embeddings-parameter-golf.png" type="image/png" length="0" />
      <category>language-models/inference/quantization</category>
      <category>language-models/training/pretraining</category>
      <category>ml-systems/storage</category>
    </item>
    <item>
      <title>Quantization is a hardware story (and an entropy puzzle)</title>
      <link>https://vtoivonen.com/blog/quantization-hardware-pareto-entropy</link>
      <guid isPermaLink="true">https://vtoivonen.com/blog/quantization-hardware-pareto-entropy</guid>
      <pubDate>Fri, 17 Apr 2026 00:00:00 GMT</pubDate>
      <dc:creator>Vilhelm Toivonen</dc:creator>
      <description><![CDATA[I came in thinking quantization was about saving disk space. The bigger story is bandwidth, on both ends of the hardware spectrum, and underneath that there's a small information-theoretic puzzle I didn't expect.]]></description>
      <enclosure url="https://vtoivonen.com/og/quantization-hardware-pareto-entropy.png" type="image/png" length="0" />
      <category>language-models/inference/quantization</category>
      <category>ml-systems/hardware</category>
      <category>math/information-theory</category>
    </item>
    <item>
      <title>State-space models, the edge, and the quiet convergence</title>
      <link>https://vtoivonen.com/blog/state-space-models-edge-and-convergence</link>
      <guid isPermaLink="true">https://vtoivonen.com/blog/state-space-models-edge-and-convergence</guid>
      <pubDate>Fri, 10 Apr 2026 00:00:00 GMT</pubDate>
      <dc:creator>Vilhelm Toivonen</dc:creator>
      <description><![CDATA[Every time I push a model down to a phone, the KV cache is what kills me first. SSMs are the cleanest answer I've found, and around 2025 the field stopped pretending the SSM-vs-attention split was a competition.]]></description>
      <enclosure url="https://vtoivonen.com/og/state-space-models-edge-and-convergence.png" type="image/png" length="0" />
      <category>language-models/architecture/sequence-modeling</category>
      <category>language-models/architecture/attention</category>
      <category>language-models/inference/kv-cache</category>
      <category>ml-systems/hardware</category>
    </item>
    <item>
      <title>Attention variants beyond softmax: a 2026 map</title>
      <link>https://vtoivonen.com/blog/attention-variants-beyond-softmax</link>
      <guid isPermaLink="true">https://vtoivonen.com/blog/attention-variants-beyond-softmax</guid>
      <pubDate>Fri, 03 Apr 2026 00:00:00 GMT</pubDate>
      <dc:creator>Vilhelm Toivonen</dc:creator>
      <description><![CDATA[A short tour of what open models actually ship in place of full softmax attention in 2025-2026. MLA, linear (lightning), sparse (NSA / DSA), and the hybrid stacks three labs landed on independently. One cache plot, one throughput plot, one Pareto sketch, and the table that names names.]]></description>
      <enclosure url="https://vtoivonen.com/og/attention-variants-beyond-softmax.png" type="image/png" length="0" />
      <category>language-models/architecture/attention</category>
      <category>language-models/inference/kv-cache</category>
      <category>language-models/architecture/sequence-modeling</category>
      <category>ml-systems/hardware</category>
    </item>
    <item>
      <title>The modern transformer block (everything except attention)</title>
      <link>https://vtoivonen.com/blog/modern-transformer-block-without-attention</link>
      <guid isPermaLink="true">https://vtoivonen.com/blog/modern-transformer-block-without-attention</guid>
      <pubDate>Fri, 27 Mar 2026 00:00:00 GMT</pubDate>
      <dc:creator>Vilhelm Toivonen</dc:creator>
      <description><![CDATA[Open the code for Llama, Qwen, DeepSeek, Gemma, and you'll find nearly identical design decisions. What has changed since the 2017 paper, in normalisation, gating, tokenizer, embeddings, position, and why each swap stuck.]]></description>
      <enclosure url="https://vtoivonen.com/og/modern-transformer-block-without-attention.png" type="image/png" length="0" />
      <category>language-models/architecture/normalization</category>
      <category>language-models/architecture/feed-forward</category>
      <category>language-models/architecture/sequence-modeling</category>
      <category>language-models/architecture/positional</category>
    </item>
    <item>
      <title>Hi!</title>
      <link>https://vtoivonen.com/blog/hello-world</link>
      <guid isPermaLink="true">https://vtoivonen.com/blog/hello-world</guid>
      <pubDate>Fri, 20 Mar 2026 00:00:00 GMT</pubDate>
      <dc:creator>Vilhelm Toivonen</dc:creator>
      <description><![CDATA[A short note on who's writing this blog and why.]]></description>
      <enclosure url="https://vtoivonen.com/og/hello-world.png" type="image/png" length="0" />
      <category>lab-notes</category>
    </item>
  </channel>
</rss>
