Portrait of Vilhelm Toivonen

AI systems researcher | Builder

Vilhelm
Toivonen

Distributed LLMs — cognitive core, edge deployment, and tool-using agents.

Doctoral Researcher (distributed LLM inference), University of Helsinki
Consulting AI Architect, Bondata
Founder, Teknet (2019) • Co-founder, Padlo.co (2025)

CURRENT FOCUS 2026

  • • BridgeLoRA: distributed fine-tuning across edge adapters and cloud backbones (ICDCS 2026)
  • • On-policy distillation: removing the teacher early without quality loss
  • • Edge inference for small models: predictive MoE routing and additive hierarchical memory

01RESEARCH

2024 – 2026

I focus on distributed LLM inference and small, tool-using models that can live on devices. The goal: a “cognitive core” that reasons well, uses tools, and keeps most knowledge offloaded to retrieval instead of parameters. I got into ML early—high-school research on data augmentation for speech recognition—and I still work empirically: publishing benchmarks, code, and measurements on real consumer hardware (iPhone, MacBook, edge servers).

Edge / cloud architecture for a cognitive coreA small, tool-using model with adapters runs on the device. A frozen backbone runs in the cloud. The two exchange rollouts and weight updates.01 EDGE02 CLOUDCognitive coreBackboneSmall model + adaptersTools, retrievalOn-device rolloutsFrozen weightsOff-policy updatesInference at scaleROLLOUTSWEIGHT UPDATES
Fig. 1 — Edge / cloud collaboration

Recent Papers

  • • BridgeLoRA: Privacy-preserving Collaborative Skip-Layer Connectors for Efficient Transformer Fine-tuning at the Edge — accepted at ICDCS 2026
  • • Measuring the True Cost of On-Device Agents (4 devices, 4 models, 300 tasks) — MobiHoc 2026 submission
  • • Scaffold-and-Release: When Can We Remove the Teacher from RLVR Training? — COLM 2026 submission
  • • LLM Inference on Edge — Survey (first author, 180 references, in review since April 2026)

Theses

  • • Determining User Preference Profiles from Email And User Engagement Data (M.Sc., 2024)
  • • Lossless Compression of Deep Neural Networks (B.Sc., 2024)

Current Agenda

BridgeLoRA — Journal Extension

Question:Which transformer layers benefit most from per-task adapters, and where can frozen backbones still serve?
Method:Extending the ICDCS 2026 result with mechanistic interpretability — which layers, which adapters, which datasets — so adapter placement is principled, not heuristic
Output:Journal paper with concrete parameter-efficiency recipes; reference pipelines for layer-targeted adaptation

Predictive MoE Routing

Question:Can a lightweight predictor running alongside a Mixture-of-Experts model speed up its inference on edge devices?
Method:Anticipate which experts the model will dispatch to, prefetch the relevant weights, and pre-stage the routing path so memory bandwidth on consumer hardware stops being the bottleneck
Output:Latency improvements that make MoE feasible for on-device deployment

Hierarchical Memory Bank

Question:Can a small language model gain reliable, composable memory by injecting trainable deltas into its residual stream?
Method:Different fact types target different transformer layers; memories compose additively, so banks can stack on top of one another; tested on hierarchically organized data
Output:A modular memory architecture that scales by addition rather than retraining the base model

02PROJECTS

2019 – 2026
2025–2026

Vibemetrics → Bondata acquisition

CTO → Head of AI → Consulting AI Architect

Led the platform through acquisition (May 2025). Moved from CTO to Head of AI, shipping RAG-based survey agents and recommendations to production. Transitioned to Consulting AI Architect in 2026 to focus on PhD research while staying engaged with the AI roadmap.

Acquisition closed; AI systems shipped; ongoing advisory role

2025

Padlo

Founder

Founded padlo.co, a padel live scoreboard + coaching app. Sole coder across mobile, backend, and analytics for player/coach insights.

Launched March 2025 to live tournaments

2026

BridgeLoRA: Skip-Layer Connectors at the Edge

Lead Researcher

Privacy-preserving collaborative fine-tuning: adapters target specific transformer layers and stay on-device while frozen backbones run in the cloud. Mechanistic interpretability drives layer selection — knowing which layers, which adapters, and which datasets to bind.

Accepted at ICDCS 2026; journal extension underway

2025–2026

Measuring the True Cost of On-Device Agents

Lead Researcher

Systematic evaluation of LLM agents on consumer hardware (iPhone, MacBook, edge servers) across 4 devices, 4 models, and 300 tasks.

MobiHoc 2026 submission with public measurements

2025–2026

LLM Inference on the Edge — Survey

First Author

180-reference survey covering serving stacks, hardware, and emerging methods for running language models on-device and at the edge.

In review since April 2026

2025

Stanford CS336 Pretraining Competition

1st Place

Designed and trained a language model achieving the lowest perplexity on the OpenWebText dataset for the CS336 pretraining leaderboard.

1st place finish

2019–

Teknet

Founder, sole operator → Co-owner with brother (2025–)

Continued the company from my grandfather’s legacy. Sole worker for the first ~5 years — sales, manufacturing, packaging, marketing, customer service. Expanded in early 2025 by taking my brother as co-owner.

Profitable services business across two generations

03BACKGROUND

2018 – present

Education

Current

University of Helsinki

Doctoral Researcher

Department of Computer Science

Distributed LLM inference, cognitive core, edge/cloud RL

2023-2024

Aalto University

M.Sc. Computer Science

Machine Learning, Data Science and Artificial Intelligence

LLMs, systems, applied ML

2021-2023

Aalto University

B.Sc. Mathematics and Operations Research

Mathematics, statistical learning, optimization

Completed both B.Sc. and M.Sc. in roughly three years while working in industry roles.

Applied / Embodied Work

Built a go-kart from scratch (moped engine) — practical systems intuition for how parts interact under real constraints.

Ran a small construction company for three summers with a coworker / shareholder — renovations, painting, and small builds; learned end-to-end delivery and hands-on project management of a two-person business.

With the same coworker, sold two products on Amazon US — a deliberate exercise in learning sales, marketing, branding, and end-to-end product building from another angle.

Built an outdoor sauna from scratch — frame, walls, stove, the lot. Same lesson the go-kart taught about real-world constraints, at a different scale.

Competitive Sports

Cross-country Skiing

Level: Competitive (regional)

Club: Pirkkalan Hiihtäjät

Achievements: Multiple regional podium finishes

Orienteering

Level: Club

Club: Kangasala SK

Achievements: Active participant in national competitions

Not the highest national or international level — but the working habits competitive sports demand (tight schedules, knowing limits, pushing through under pressure) translate directly to research and to high-velocity teams.

04FUTURE

2026 – 2027

Final research push

Three threads to close the PhD: a journal extension of BridgeLoRA, an edge-mesh version of on-policy distillation, and a systems paper unifying the work.

Goals

  • • BridgeLoRA → journal extension: which layers, which adapters, which datasets — mechanistic interpretability driving parameter efficiency
  • • Edge-mesh on-policy distillation: students and teachers split across devices, clusters, and even model families
  • • Systems paper unifying BridgeLoRA, the measurement work, and the scaffold-and-release framing (COLM submission) — PhD thesis spine

Timeline

  • • 2026: BridgeLoRA accepted at ICDCS; measurement paper submitted to MobiHoc; on-policy distillation finding in writing
  • • Late 2026 – early 2027: edge-mesh distillation manuscript, BridgeLoRA journal extension, systems paper draft
  • • Early 2027: PhD defense and graduation

Long-term Vision

I’m not aiming to spend the next decade on fundamental research alone. The role I want combines the work I’ve already shipped — agentic systems, evaluations, production code — with the cognitive-core research I’m doing now, in a small high-agency team where the system actually reaches users. That research only matters if it’s built ground up: designed, distilled, trained from scratch, and deployed at scale, which takes real compute and a real team. Three years ago I set a ten-year goal of becoming one of the top hundred AI researchers in the world; seven years remain. I don’t know if I’ll get there, but I want to spend those years in a role where both my wins and my failures show up in something millions of people use every day.

05WRITING

Language modelsOptimization

Serving has a memory problem. Stuffing facts into the prompt is the answer everyone uses because the alternatives didn't work. A field report on three things we tried in my mech-interp PhD work — two clean failures, one architecture that's getting traction, and the layer-injection finding that fell out of all of them.

Language modelsML systems

Doubling the vocab buys ~0.01 BPB. Compressing the doubled embedding back under a 16 MB cap costs more. A tour of every embedding-side trick I tried in parameter-golf and why each one lost to a plain dense INT7 baseline.

Language modelsML systemsMath

I came in thinking quantization was about saving disk space. The bigger story is bandwidth, on both ends of the hardware spectrum, and underneath that there's a small information-theoretic puzzle I didn't expect.

Language modelsML systems

Every time I push a model down to a phone, the KV cache is what kills me first. SSMs are the cleanest answer I've found, and around 2025 the field stopped pretending the SSM-vs-attention split was a competition.

Language modelsML systems

A short tour of what open models actually ship in place of full softmax attention in 2025-2026. MLA, linear (lightning), sparse (NSA / DSA), and the hybrid stacks three labs landed on independently. One cache plot, one throughput plot, one Pareto sketch, and the table that names names.

Language models

Open the code for Llama, Qwen, DeepSeek, Gemma, and you'll find nearly identical design decisions. What has changed since the 2017 paper, in normalisation, gating, tokenizer, embeddings, position, and why each swap stuck.

06CONTACT

© 2026 Vilhelm Toivonen.