{
  "version": "https://jsonfeed.org/version/1.1",
  "title": "Vilhelm Toivonen — Writing",
  "home_page_url": "https://vtoivonen.com/blog",
  "feed_url": "https://vtoivonen.com/feed.json",
  "description": "Doctoral Researcher at University of Helsinki. Distributed AI, edge deployment, knowledge transfer, and AI safety.",
  "icon": "https://vtoivonen.com/og/default.png",
  "favicon": "https://vtoivonen.com/favicon.svg",
  "language": "en",
  "authors": [
    {
      "name": "Vilhelm Toivonen",
      "url": "https://vtoivonen.com"
    }
  ],
  "items": [
    {
      "id": "https://vtoivonen.com/blog/per-user-memory-composition",
      "url": "https://vtoivonen.com/blog/per-user-memory-composition",
      "title": "Per-user memory in language models: what we tried",
      "summary": "Serving has a memory problem. Stuffing facts into the prompt is the answer everyone uses because the alternatives didn't work. A field report on three things we tried in my mech-interp PhD work — two clean failures, one architecture that's getting traction, and the layer-injection finding that fell out of all of them.",
      "content_html": "<p>Serving has a memory problem. Stuffing facts into the prompt is the answer everyone uses because the alternatives didn't work. A field report on three things we tried in my mech-interp PhD work — two clean failures, one architecture that's getting traction, and the layer-injection finding that fell out of all of them.</p>",
      "date_published": "2026-05-01T00:00:00.000Z",
      "date_modified": "2026-05-01T00:00:00.000Z",
      "tags": [
        "language-models/interpretability",
        "language-models/inference/kv-cache",
        "optimization/low-rank"
      ],
      "image": "https://vtoivonen.com/og/per-user-memory-composition.png"
    },
    {
      "id": "https://vtoivonen.com/blog/sparse-embeddings-parameter-golf",
      "url": "https://vtoivonen.com/blog/sparse-embeddings-parameter-golf",
      "title": "Sparse embeddings: a six-week negative result",
      "summary": "Doubling the vocab buys ~0.01 BPB. Compressing the doubled embedding back under a 16 MB cap costs more. A tour of every embedding-side trick I tried in parameter-golf and why each one lost to a plain dense INT7 baseline.",
      "content_html": "<p>Doubling the vocab buys ~0.01 BPB. Compressing the doubled embedding back under a 16 MB cap costs more. A tour of every embedding-side trick I tried in parameter-golf and why each one lost to a plain dense INT7 baseline.</p>",
      "date_published": "2026-04-24T00:00:00.000Z",
      "date_modified": "2026-04-24T00:00:00.000Z",
      "tags": [
        "language-models/inference/quantization",
        "language-models/training/pretraining",
        "ml-systems/storage"
      ],
      "image": "https://vtoivonen.com/og/sparse-embeddings-parameter-golf.png"
    },
    {
      "id": "https://vtoivonen.com/blog/quantization-hardware-pareto-entropy",
      "url": "https://vtoivonen.com/blog/quantization-hardware-pareto-entropy",
      "title": "Quantization is a hardware story (and an entropy puzzle)",
      "summary": "I came in thinking quantization was about saving disk space. The bigger story is bandwidth, on both ends of the hardware spectrum, and underneath that there's a small information-theoretic puzzle I didn't expect.",
      "content_html": "<p>I came in thinking quantization was about saving disk space. The bigger story is bandwidth, on both ends of the hardware spectrum, and underneath that there's a small information-theoretic puzzle I didn't expect.</p>",
      "date_published": "2026-04-17T00:00:00.000Z",
      "date_modified": "2026-04-17T00:00:00.000Z",
      "tags": [
        "language-models/inference/quantization",
        "ml-systems/hardware",
        "math/information-theory"
      ],
      "image": "https://vtoivonen.com/og/quantization-hardware-pareto-entropy.png"
    },
    {
      "id": "https://vtoivonen.com/blog/state-space-models-edge-and-convergence",
      "url": "https://vtoivonen.com/blog/state-space-models-edge-and-convergence",
      "title": "State-space models, the edge, and the quiet convergence",
      "summary": "Every time I push a model down to a phone, the KV cache is what kills me first. SSMs are the cleanest answer I've found, and around 2025 the field stopped pretending the SSM-vs-attention split was a competition.",
      "content_html": "<p>Every time I push a model down to a phone, the KV cache is what kills me first. SSMs are the cleanest answer I've found, and around 2025 the field stopped pretending the SSM-vs-attention split was a competition.</p>",
      "date_published": "2026-04-10T00:00:00.000Z",
      "date_modified": "2026-04-10T00:00:00.000Z",
      "tags": [
        "language-models/architecture/sequence-modeling",
        "language-models/architecture/attention",
        "language-models/inference/kv-cache",
        "ml-systems/hardware"
      ],
      "image": "https://vtoivonen.com/og/state-space-models-edge-and-convergence.png"
    },
    {
      "id": "https://vtoivonen.com/blog/attention-variants-beyond-softmax",
      "url": "https://vtoivonen.com/blog/attention-variants-beyond-softmax",
      "title": "Attention variants beyond softmax: a 2026 map",
      "summary": "A short tour of what open models actually ship in place of full softmax attention in 2025-2026. MLA, linear (lightning), sparse (NSA / DSA), and the hybrid stacks three labs landed on independently. One cache plot, one throughput plot, one Pareto sketch, and the table that names names.",
      "content_html": "<p>A short tour of what open models actually ship in place of full softmax attention in 2025-2026. MLA, linear (lightning), sparse (NSA / DSA), and the hybrid stacks three labs landed on independently. One cache plot, one throughput plot, one Pareto sketch, and the table that names names.</p>",
      "date_published": "2026-04-03T00:00:00.000Z",
      "date_modified": "2026-04-03T00:00:00.000Z",
      "tags": [
        "language-models/architecture/attention",
        "language-models/inference/kv-cache",
        "language-models/architecture/sequence-modeling",
        "ml-systems/hardware"
      ],
      "image": "https://vtoivonen.com/og/attention-variants-beyond-softmax.png"
    },
    {
      "id": "https://vtoivonen.com/blog/modern-transformer-block-without-attention",
      "url": "https://vtoivonen.com/blog/modern-transformer-block-without-attention",
      "title": "The modern transformer block (everything except attention)",
      "summary": "Open the code for Llama, Qwen, DeepSeek, Gemma, and you'll find nearly identical design decisions. What has changed since the 2017 paper, in normalisation, gating, tokenizer, embeddings, position, and why each swap stuck.",
      "content_html": "<p>Open the code for Llama, Qwen, DeepSeek, Gemma, and you'll find nearly identical design decisions. What has changed since the 2017 paper, in normalisation, gating, tokenizer, embeddings, position, and why each swap stuck.</p>",
      "date_published": "2026-03-27T00:00:00.000Z",
      "date_modified": "2026-03-27T00:00:00.000Z",
      "tags": [
        "language-models/architecture/normalization",
        "language-models/architecture/feed-forward",
        "language-models/architecture/sequence-modeling",
        "language-models/architecture/positional"
      ],
      "image": "https://vtoivonen.com/og/modern-transformer-block-without-attention.png"
    },
    {
      "id": "https://vtoivonen.com/blog/hello-world",
      "url": "https://vtoivonen.com/blog/hello-world",
      "title": "Hi!",
      "summary": "A short note on who's writing this blog and why.",
      "content_html": "<p>A short note on who's writing this blog and why.</p>",
      "date_published": "2026-03-20T00:00:00.000Z",
      "date_modified": "2026-03-20T00:00:00.000Z",
      "tags": [
        "lab-notes"
      ],
      "image": "https://vtoivonen.com/og/hello-world.png"
    }
  ]
}