Serving has a memory problem. Stuffing facts into the prompt is the answer everyone uses because the alternatives didn't work. A field report on three things we tried in my mech-interp PhD work — two clean failures, one architecture that's getting traction, and the layer-injection finding that fell out of all of them.
AI systems researcher | Builder
Vilhelm
Toivonen
Distributed LLMs — cognitive core, edge deployment, and tool-using agents.
Doctoral Researcher (distributed LLM inference), University of Helsinki
Consulting AI Architect, Bondata
Founder, Teknet (2019) • Co-founder, Padlo.co (2025)
CURRENT FOCUS 2026
- • BridgeLoRA: distributed fine-tuning across edge adapters and cloud backbones (ICDCS 2026)
- • On-policy distillation: removing the teacher early without quality loss
- • Edge inference for small models: predictive MoE routing and additive hierarchical memory
01RESEARCH
2024 – 2026I focus on distributed LLM inference and small, tool-using models that can live on devices. The goal: a “cognitive core” that reasons well, uses tools, and keeps most knowledge offloaded to retrieval instead of parameters. I got into ML early—high-school research on data augmentation for speech recognition—and I still work empirically: publishing benchmarks, code, and measurements on real consumer hardware (iPhone, MacBook, edge servers).
Recent Papers
- • BridgeLoRA: Privacy-preserving Collaborative Skip-Layer Connectors for Efficient Transformer Fine-tuning at the Edge — accepted at ICDCS 2026
- • Measuring the True Cost of On-Device Agents (4 devices, 4 models, 300 tasks) — MobiHoc 2026 submission
- • Scaffold-and-Release: When Can We Remove the Teacher from RLVR Training? — COLM 2026 submission
- • LLM Inference on Edge — Survey (first author, 180 references, in review since April 2026)
Theses
- • Determining User Preference Profiles from Email And User Engagement Data (M.Sc., 2024)
- • Lossless Compression of Deep Neural Networks (B.Sc., 2024)
Current Agenda
BridgeLoRA — Journal Extension
Predictive MoE Routing
Hierarchical Memory Bank
02PROJECTS
2019 – 2026Vibemetrics → Bondata acquisition →
CTO → Head of AI → Consulting AI Architect
Led the platform through acquisition (May 2025). Moved from CTO to Head of AI, shipping RAG-based survey agents and recommendations to production. Transitioned to Consulting AI Architect in 2026 to focus on PhD research while staying engaged with the AI roadmap.
Acquisition closed; AI systems shipped; ongoing advisory role
Padlo →
Founder
Founded padlo.co, a padel live scoreboard + coaching app. Sole coder across mobile, backend, and analytics for player/coach insights.
Launched March 2025 to live tournaments
BridgeLoRA: Skip-Layer Connectors at the Edge
Lead Researcher
Privacy-preserving collaborative fine-tuning: adapters target specific transformer layers and stay on-device while frozen backbones run in the cloud. Mechanistic interpretability drives layer selection — knowing which layers, which adapters, and which datasets to bind.
Accepted at ICDCS 2026; journal extension underway
Measuring the True Cost of On-Device Agents
Lead Researcher
Systematic evaluation of LLM agents on consumer hardware (iPhone, MacBook, edge servers) across 4 devices, 4 models, and 300 tasks.
MobiHoc 2026 submission with public measurements
LLM Inference on the Edge — Survey
First Author
180-reference survey covering serving stacks, hardware, and emerging methods for running language models on-device and at the edge.
In review since April 2026
Stanford CS336 Pretraining Competition
1st Place
Designed and trained a language model achieving the lowest perplexity on the OpenWebText dataset for the CS336 pretraining leaderboard.
1st place finish
Teknet
Founder, sole operator → Co-owner with brother (2025–)
Continued the company from my grandfather’s legacy. Sole worker for the first ~5 years — sales, manufacturing, packaging, marketing, customer service. Expanded in early 2025 by taking my brother as co-owner.
Profitable services business across two generations
03BACKGROUND
2018 – presentEducation
University of Helsinki
Doctoral Researcher
Department of Computer Science
Distributed LLM inference, cognitive core, edge/cloud RL
Aalto University
M.Sc. Computer Science
Machine Learning, Data Science and Artificial Intelligence
LLMs, systems, applied ML
Aalto University
B.Sc. Mathematics and Operations Research
Mathematics, statistical learning, optimization
Completed both B.Sc. and M.Sc. in roughly three years while working in industry roles.
Applied / Embodied Work
Built a go-kart from scratch (moped engine) — practical systems intuition for how parts interact under real constraints.
Ran a small construction company for three summers with a coworker / shareholder — renovations, painting, and small builds; learned end-to-end delivery and hands-on project management of a two-person business.
With the same coworker, sold two products on Amazon US — a deliberate exercise in learning sales, marketing, branding, and end-to-end product building from another angle.
Built an outdoor sauna from scratch — frame, walls, stove, the lot. Same lesson the go-kart taught about real-world constraints, at a different scale.
Competitive Sports
Cross-country Skiing
Level: Competitive (regional)
Club: Pirkkalan Hiihtäjät
Achievements: Multiple regional podium finishes
Orienteering
Level: Club
Club: Kangasala SK
Achievements: Active participant in national competitions
Not the highest national or international level — but the working habits competitive sports demand (tight schedules, knowing limits, pushing through under pressure) translate directly to research and to high-velocity teams.
04FUTURE
2026 – 2027Final research push
Three threads to close the PhD: a journal extension of BridgeLoRA, an edge-mesh version of on-policy distillation, and a systems paper unifying the work.
Goals
- • BridgeLoRA → journal extension: which layers, which adapters, which datasets — mechanistic interpretability driving parameter efficiency
- • Edge-mesh on-policy distillation: students and teachers split across devices, clusters, and even model families
- • Systems paper unifying BridgeLoRA, the measurement work, and the scaffold-and-release framing (COLM submission) — PhD thesis spine
Timeline
- • 2026: BridgeLoRA accepted at ICDCS; measurement paper submitted to MobiHoc; on-policy distillation finding in writing
- • Late 2026 – early 2027: edge-mesh distillation manuscript, BridgeLoRA journal extension, systems paper draft
- • Early 2027: PhD defense and graduation
Long-term Vision
I’m not aiming to spend the next decade on fundamental research alone. The role I want combines the work I’ve already shipped — agentic systems, evaluations, production code — with the cognitive-core research I’m doing now, in a small high-agency team where the system actually reaches users. That research only matters if it’s built ground up: designed, distilled, trained from scratch, and deployed at scale, which takes real compute and a real team. Three years ago I set a ten-year goal of becoming one of the top hundred AI researchers in the world; seven years remain. I don’t know if I’ll get there, but I want to spend those years in a role where both my wins and my failures show up in something millions of people use every day.
05WRITING
Doubling the vocab buys ~0.01 BPB. Compressing the doubled embedding back under a 16 MB cap costs more. A tour of every embedding-side trick I tried in parameter-golf and why each one lost to a plain dense INT7 baseline.
I came in thinking quantization was about saving disk space. The bigger story is bandwidth, on both ends of the hardware spectrum, and underneath that there's a small information-theoretic puzzle I didn't expect.
Every time I push a model down to a phone, the KV cache is what kills me first. SSMs are the cleanest answer I've found, and around 2025 the field stopped pretending the SSM-vs-attention split was a competition.
A short tour of what open models actually ship in place of full softmax attention in 2025-2026. MLA, linear (lightning), sparse (NSA / DSA), and the hybrid stacks three labs landed on independently. One cache plot, one throughput plot, one Pareto sketch, and the table that names names.
Open the code for Llama, Qwen, DeepSeek, Gemma, and you'll find nearly identical design decisions. What has changed since the 2017 paper, in normalisation, gating, tokenizer, embeddings, position, and why each swap stuck.
06CONTACT
© 2026 Vilhelm Toivonen.
