The 3-Pass Pipeline: How auraScribe Processes Audio

auraScribe does not process audio in a single pass. Instead, it uses a carefully orchestrated 3-pass pipeline where each pass is optimised for a different task. This architecture maximises transcription quality, minimises AI costs, and enables the deep behavioral analysis that differentiates auraScribe.

Pass 1 — Transcription

The first pass focuses exclusively on speed and accuracy: transcribe the audio as fast as possible, with speaker diarization (identifying who said what) but no behavioral analysis.

This is the only pass that ingests the raw audio. If the audio is longer than what the AI can process in one request, a continuation loop automatically picks up where it left off, using timestamp anchoring to seek forward. Every word is processed — no truncation, regardless of length.

The output: a complete, timestamped transcript with acoustic speaker labels (Voice 1, Voice 2, etc.).

Pass 1.5 — Speaker Identification

A lightweight text-only pass that reads the transcript and deduces real names, job titles, companies, and roles from conversational context. By separating acoustic diarization (Pass 1) from name attribution (Pass 1.5), the model avoids the common mistake of confusing voice labels with identity.

Pass 2 — Speaker Profiling and Behavioral Analysis

This is where auraScribe's key differentiator — Raw Audio Cues — is generated. Using a context cache containing both the audio and the transcript, Pass 2 performs three tasks simultaneously:

  1. Voice verification: Count distinct voices, detect merged or duplicate speakers
  2. Speaker identification: Verify names, functions, roles, and companies
  3. Raw Audio Cues generation: The exhaustive chronological behavioral log — micro-signals, tone shifts, hesitations, interruptions, laughter, engagement patterns, and interpersonal dynamics

This is the only audio analysis pass. The cost-conscious architecture means audio tokens are used exactly twice: once for transcription, once for behavioral analysis.

Review Stage — Human-in-the-Loop

Before generating the final analysis, auraScribe pauses for human review. Users can edit the transcript, correct speaker names, swap or merge speakers, and verify uncertain attributions. This is the one step where human judgement outperforms AI — speaker identification is where models frequently stumble.

Pass 3 — Summary and Analysis

The final pass uses only text — the reviewed transcript plus the Raw Audio Cues — to generate the complete analysis. No audio re-upload, which means this pass is fast and cheap.

The output includes: executive summary, structured summary, behavioral summary (group dynamics), buyer intent signals (when detected), and per-speaker individual remarks with coaching points.

Why three passes?

Audio tokens are expensive. By structuring the pipeline as transcribe → cache → analyse → synthesise, auraScribe minimises multimodal API costs while maximising insight depth. The alternative — doing everything in one massive prompt — would sacrifice either detail or cost efficiency.

Stop exporting transcripts. Start delivering.

Try auraScribe free for 14 days. You talk — auraScribe takes it from there.

Try auraScribe