Building Plaudio — Yuyang Wang

Today I shipped a small thing called Plaudio. It is voice-bank-first speaker labelling for the Plaud Note recorder family, written in Python, AGPL-3.0, macOS Apple Silicon only. v0.1.0 went up on PyPI this morning.

This is a personal project, in the most literal sense. I built it for the specific shape of meetings I actually have. The recorder I use is a Plaud Note Pro. The transcription quality I need is high enough that knowing who said what matters as much as knowing what was said. The meetings I attend tend to have four to seven people, and many of them are colleagues whose voices sound similar to a diariser model that has never met any of them.

That last part is where existing tools failed me.

The cluster-merger problem

The standard local diarisation pipeline (pyannote-audio 3.1 in its default mode) clusters speakers into groups and assigns each group a label. When voices are sufficiently distinct, this works well. When four people in a meeting share similar pitch, accent, and meeting register, pyannote often collapses them into a single SPEAKER cluster. The model is confident. The output is confidently wrong. Every utterance gets attributed to one of the four. The other three vanish from the transcript.

A confidently wrong transcript is worse than no transcript. It is a thing you might cite, summarise, or share — and then attribute a decision to someone who did not make it.

The fix

The Plaudio approach is different. Instead of clustering speakers and matching the cluster to a profile, Plaudio matches per window.

You enrol each frequent speaker once, with a clean 30-second clip and their knowledge. The enrolment produces a 256-dimensional embedding via pyannote's diarisation pipeline. Plaudio stores it in a local JSON file at mode 0600.

For every new meeting, Plaudio slides a 2-second window across the audio with a 1-second hop. Each window is embedded by pyannote and cosine-matched against every enrolled profile independently. The window picks its own match. Consecutive same-label windows coalesce into runs. Runs overlay onto the transcript segments.

Similar voices no longer get merged into one cluster, because there is no clustering step. Each window is its own decision. Unmatched windows stay as Unknown, which is the correct answer for someone you have not enrolled yet.

The implementation is small, a few hundred lines of Python, but the practical effect is large. Once the bank covers your regulars, a meeting where four similar voices would otherwise collapse into one cluster gets labelled window by window, correctly.

What v0.1 ships

The audio-in half of the pipeline:

plaudio transcribe meeting.mp3
plaudio match meeting.mp3 meeting.plaud.json --threshold 0.55
plaudio label meeting.mp3 meeting.plaud.json --enrol      # bootstrap: when the bank is empty
plaudio enrol clean-30s-clip.mp3 --name "Alice Smith" --start 0 --end 30
plaudio db ingest meeting.plaud.json --meeting-id 2026-05-30-team
plaudio db search "deadline" --speaker "Alice Smith"

The most overlooked of these is plaudio label. The first time you run Plaudio on a meeting, your bank is empty, so every cluster comes back as Unknown and you have no profiles to match against. The interactive labeller solves this: it picks the longest clean monologue from each unknown cluster, plays it through your speakers in the background, prompts you for a name, writes the label back to the transcript, and (with --enrol) adds the voice to the bank in a single pass. After ten or so meetings, the bank covers most of your regulars and the interactive step disappears. If you already know who is who, there is a --batch-label "SPEAKER_00=Alice,SPEAKER_01=Bob" form that skips audio playback entirely.

Plus a voicebank manager (plaudio voicebank list / export / import / migrate / remove) and a doctor command that tells you exactly which dependency is missing.

The stack is mlx-whisper for ASR (it runs on the Neural Engine and GPU, around six times realtime on M-series chips), pyannote-audio 3.1 for the embedding model, and SQLite with the FTS5 trigram tokeniser for the searchable corpus. The trigram tokeniser is what makes Chinese-English code-switched search work, which matters for the meetings I have.

Plaud cloud sync arrives in v0.2. The full README and stack rationale are on GitHub.

What I am learning from building this

Three observations.

First, building for yourself first is a discipline, not a fallback. I had a clear test case from the start (my own meetings). I knew what success looked like (the right names appear in the transcript). I was the first user, the first bug reporter, and the first beneficiary. The temptation when shipping open source is to broaden too early. The discipline is to ship the narrow thing well first, and let other people's interest decide whether to broaden.

Second, the audit gate matters more than the code. I have a private wordlist generated from my work vault: colleague surnames, project codes, internal acronyms. Every commit and CI build scans the diff against that wordlist. The pre-commit hook is mandatory; the --no-verify bypass is forbidden by personal rule. I built this audit infrastructure before I wrote any of the actual Plaudio code, because once code lives in a public repo it is published forever. A leak is harder to recover from than a missing feature.

Third, vertical slices ship faster than horizontal layers. Each Plaudio subcommand was built end-to-end before the next one started: tests, implementation, CLI, all in one slice. The temptation in a library extraction is to build all the core algorithms first, then all the CLI wrappers. The discipline is to ship a single working command, end to end, and only then move to the next one.

What's next

v0.1 is live. v0.1.1 has a small backlog of known follow-ups: one default flag is wrong, one error path returns a raw exception instead of a friendly message, two deprecation warnings need cleaning up. v0.2 adds Plaud cloud sync.

If you have a Plaud Note Pro on an Apple Silicon Mac and the same cluster-merger problem, install with pip install plaudio. If you do not, this tool is probably not for you, and that is by design.