What Is Speaker Diarization, and Why Meeting Notes Need It

2026-06-13 · 6 min read

You can have a perfect, word-for-word transcript and still have no idea what happened in the meeting — because you can't tell who said what. The capability that fixes this has an unglamorous name, speaker diarization, and it is the difference between a wall of text and a usable record of a conversation.

Definition in plain English

So, what is speaker diarization? In plain English, speaker diarization is the process of splitting an audio recording into segments and labelling each segment by who was speaking. It answers the question every meeting transcript has to answer to be useful: who said what, and when.

It helps to separate two things that often get confused. Transcription turns sound into words. Diarization works out how many distinct voices are present and assigns each stretch of speech to one of them — usually as anonymous labels first ("Speaker 1", "Speaker 2"), which a person or a later step then maps to real names. A related but separate task, speaker identification, goes further and attaches an actual identity to a voice. Diarization is the "who's talking now" layer; identification is the "and their name is" layer on top.

If you have ever searched for the term and landed on a dictionary-style definition full of jargon, here is the one-sentence version worth keeping: speaker diarization is the automatic partitioning of a recording into who spoke when. Everything else in this article is a consequence of that one idea.

Why transcripts without speakers are useless

Imagine a forty-minute call between four people delivered to you as one unbroken block of text, with no indication of where one person stops and the next begins. You can read every word and still not be able to reconstruct the meeting. Who agreed to the deadline? Who raised the objection? Who committed to send the contract? Without speaker labels, the transcript records that things were said, but not by whom — and in a meeting, "by whom" is most of the meaning.

This is why speaker identification in a meeting context is not a nice-to- have. Action items belong to people. Commitments belong to people. Decisions are made by people, sometimes over the objection of other people, and the record only matters if it preserves that structure. A who-said-what transcription lets you answer "did the client actually agree to that, or was it our own rep talking himself into it?" — a flat transcript cannot.

There is a second, quieter cost. Behavioural signal — who drove the conversation, who went quiet, where the turn-taking broke down — only exists once speech is attributed to speakers. You cannot say "the prospect hesitated before answering" if you don't know which voice was the prospect. Diarization is the foundation the more interesting analysis is built on, which is why we treat it as a first-class step rather than a formatting afterthought. You can read more about how we use it on the auraScribe speaker diarization page.

How diarization works

Under the hood, diarization is a sequence of steps rather than a single trick. First, the system performs voice-activity detection: it finds the parts of the audio that contain speech and discards silence and noise. Then it slices the speech into short segments and converts each one into a numerical fingerprint — an embedding that captures the acoustic characteristics of a voice rather than the words being said. Segments with similar fingerprints are clustered together, and each cluster becomes a speaker. Finally, the speaker labels are aligned with the word-level transcript so that every sentence carries an attribution.

The hard parts are exactly where you'd expect. Two people with similar voices can blur into one cluster. One person on a bad connection can fracture into two. Crosstalk — people speaking over each other — is genuinely difficult, because at that moment the audio contains more than one voice at once. Estimating the number of speakers is its own challenge: guess too few and people get merged, guess too many and one person is split apart. Good diarization is mostly about handling these failure modes gracefully rather than pretending they don't happen.

Because of all this, diarization is rarely a single model in isolation. At auraScribe it runs as one stage inside a larger sequence of passes, each refining the last, so that attribution can be revisited with more context instead of being locked in on the first guess. If you're curious about that structure, the multi-pass pipeline page walks through how the stages fit together. Terms like embedding, voice-activity detection and clustering are explained in more depth in our glossary.

The human review step

No diarization system is perfect, and the honest approach is to build for that rather than to hide it. Machine diarization gets you most of the way: it tells you there were four distinct voices and assigns the speech between them. What it cannot reliably do on its own is know that Speaker 2 is your client and Speaker 3 is your colleague — that mapping from anonymous cluster to real identity is where a quick human review earns its place.

In practice this is a small, fast step. You skim the recording, confirm or correct the speaker boundaries where the machine was unsure, and attach names to the anonymous labels. The system can carry those names forward, so the same voice in a future meeting is recognised rather than re-labelled from scratch. The goal is not to make you re-do the work; it is to let you correct the handful of places the audio was genuinely ambiguous, and to keep your fingerprints on the parts that matter — the attribution of commitments and decisions to the right people.

We are deliberate about one thing here: we would rather surface a speaker we are uncertain about than silently drop them. An extra speaker is easy for you to merge away in a couple of clicks. A missing speaker — someone whose contribution was folded into another person's — is far harder to notice and recover. So the review step is biased towards showing you everything the audio contained, not towards a tidy-looking result that quietly loses people.

Accuracy with many speakers

Diarization accuracy is not a single number, and you should be suspicious of anyone who quotes one as if it were. It depends heavily on the recording: a two-person call recorded on good microphones is a comparatively easy case, while a six-person workshop with overlapping speech, a couple of people on speakerphone and background noise is a hard one. More speakers means more chances to confuse two similar voices, and more crosstalk means more moments where the audio simply does not contain a single clean speaker to attribute.

What we will say plainly is this: accuracy degrades as the room gets busier, and we design for that reality rather than against it. Clean audio helps enormously — a decent microphone and people not talking over each other will do more for your transcript than any amount of clever modelling. Where the audio is ambiguous, we lean towards keeping speakers distinct and flagging uncertainty for the review step, on the principle above that a recoverable extra speaker beats an invisible lost one.

The practical upshot is that diarization should be judged on the recordings you actually make, not on a benchmark number. The best way to know whether who-said-what attribution holds up on your meetings is to run a few real ones through and look. You can start that for free — try auraScribe on your own recordings and see how the speaker labels hold up across your typical meetings, from the easy two-person call to the messy group session.

Definition in plain English

Why transcripts without speakers are useless

How diarization works

The human review step

Accuracy with many speakers

Stop exporting transcripts. Start delivering.

Related pages

Speaker Diarization

Raw Audio Cues

auraScribe vs Otter.ai