Menu
Back to Blog
2 min read
Voice AI

Voice AI Infrastructure: Echtzeit-Sprachagenten mit Deepgram & ElevenLabs

Architektur für produktionsreife Voice AI Systeme. Streaming ASR mit Deepgram Nova-2, TTS mit ElevenLabs Turbo v2.5, WebSocket-Integration und Latenz-Optimierung.

Voice AIDeepgramElevenLabsSpeech-to-TextText-to-SpeechReal-Time Voice
Voice AI Infrastructure: Echtzeit-Sprachagenten mit Deepgram & ElevenLabs

Voice AI Infrastructure: Echtzeit-Sprachagenten mit Deepgram & ElevenLabs

Meta-Description: Architektur für produktionsreife Voice AI Systeme. Streaming ASR mit Deepgram Nova-2, TTS mit ElevenLabs Turbo v2.5, WebSocket-Integration und Latenz-Optimierung.

Keywords: Voice AI, Deepgram, ElevenLabs, Speech-to-Text, Text-to-Speech, Real-Time Voice, ASR, TTS, Voice Agent Architecture


Einführung

Die 500-Millisekunden-Schwelle trennt natürliche von künstlicher Sprachinteraktion. 2026 haben wir die Tools, um diese Grenze zu unterschreiten – aber nur mit der richtigen Architektur.


Die Voice AI Pipeline

┌─────────────────────────────────────────────────────────────┐
│                 VOICE AI STREAMING PIPELINE                 │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  [Mikrofon] ──WebSocket──→ [Deepgram Nova-2] ──Text──→     │
│                                 ASR                         │
│                                  │                          │
│                                  ▼                          │
│                            [LLM Agent]                      │
│                           (Claude/GPT)                      │
│                                  │                          │
│                                  ▼                          │
│  [Speaker] ←──Audio Stream──← [ElevenLabs] ←──Text──┘      │
│                               Turbo v2.5                    │
│                                                             │
│  Ziel-Latenz: < 500ms End-to-End                           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Deepgram Nova-2: Speech-to-Text

Warum Nova-2?

MetrikNova-2WhisperIndustrie-Ø
**Word Error Rate**8.4%13.1%12%
**Verarbeitung**29.8s/h150s/h120s/h
**Preis**$0.0043/min$0.006/min$0.01/min
**Sprachen**3699variiert

Streaming-Integration

// src/services/deepgram.ts
import { createClient, LiveTranscriptionEvents } from '@deepgram/sdk';

interface TranscriptionConfig {
  model: 'nova-2' | 'nova-2-meeting' | 'nova-2-phonecall';
  language: string;
  smart_format: boolean;
  interim_results: boolean;
  endpointing: number;
}

export class DeepgramStreamer {
  private client = createClient(process.env.DEEPGRAM_API_KEY!);
  private connection: any = null;

  async startStream(
    config: TranscriptionConfig,
    onTranscript: (text: string, isFinal: boolean) => void
  ) {
    this.connection = this.client.listen.live({
      model: config.model,
      language: config.language,
      smart_format: config.smart_format,
      interim_results: config.interim_results,
      endpointing: config.endpointing, // ms Stille für Satzende
      punctuate: true,
      diarize: false
    });

    this.connection.on(LiveTranscriptionEvents.Open, () => {
      console.log('Deepgram connection opened');
    });

    this.connection.on(LiveTranscriptionEvents.Transcript, (data: any) => {
      const transcript = data.channel.alternatives[0];
      if (transcript.transcript) {
        onTranscript(transcript.transcript, data.is_final);
      }
    });

    this.connection.on(LiveTranscriptionEvents.Error, (err: Error) => {
      console.error('Deepgram error:', err);
    });

    return this.connection;
  }

  sendAudio(audioChunk: Buffer) {
    if (this.connection) {
      this.connection.send(audioChunk);
    }
  }

  async close() {
    if (this.connection) {
      await this.connection.finish();
    }
  }
}

Optimierte Konfiguration für Deutsch

const germanConfig: TranscriptionConfig = {
  model: 'nova-2',
  language: 'de',
  smart_format: true,
  interim_results: true,  // Für Live-Feedback
  endpointing: 300        // 300ms für schnelles Turn-Taking
};

ElevenLabs Turbo v2.5: Text-to-Speech

Modell-Vergleich

ModellLatenzQualitätUse Case
**Flash v2.5**~75msGutEchtzeit-Agents
**Turbo v2.5**~300msSehr gutConversational AI
**Multilingual v2**~900msExzellentVorproduzierte Inhalte

Streaming TTS Implementation

// src/services/elevenlabs.ts
import { ElevenLabsClient } from 'elevenlabs';

interface TTSConfig {
  voiceId: string;
  modelId: 'eleven_turbo_v2_5' | 'eleven_flash_v2_5';
  stability: number;
  similarityBoost: number;
  latencyOptimization: 0 | 1 | 2 | 3 | 4;
}

export class ElevenLabsStreamer {
  private client = new ElevenLabsClient({
    apiKey: process.env.ELEVENLABS_API_KEY!
  });

  async *streamSpeech(
    text: string,
    config: TTSConfig
  ): AsyncGenerator<Buffer> {
    const audioStream = await this.client.textToSpeech.convertAsStream(
      config.voiceId,
      {
        text,
        model_id: config.modelId,
        voice_settings: {
          stability: config.stability,
          similarity_boost: config.similarityBoost
        },
        optimize_streaming_latency: config.latencyOptimization
      }
    );

    for await (const chunk of audioStream) {
      yield Buffer.from(chunk);
    }
  }

  // Für Sentence-by-Sentence Streaming
  async streamBySentence(
    sentences: string[],
    config: TTSConfig,
    onChunk: (audio: Buffer) => void
  ) {
    for (const sentence of sentences) {
      for await (const chunk of this.streamSpeech(sentence, config)) {
        onChunk(chunk);
      }
    }
  }
}

Latenz-Optimierung

// Maximale Latenz-Optimierung
const lowLatencyConfig: TTSConfig = {
  voiceId: 'pNInz6obpgDQGcFmaJgB', // Adam
  modelId: 'eleven_flash_v2_5',    // Schnellstes Modell
  stability: 0.5,
  similarityBoost: 0.75,
  latencyOptimization: 4           // Max Optimierung
};

// Qualitäts-fokussiert
const qualityConfig: TTSConfig = {
  voiceId: 'pNInz6obpgDQGcFmaJgB',
  modelId: 'eleven_turbo_v2_5',
  stability: 0.7,
  similarityBoost: 0.9,
  latencyOptimization: 0           // Keine Optimierung
};

Vollständige Voice Agent Architektur

// src/voice-agent.ts
import { DeepgramStreamer } from './services/deepgram';
import { ElevenLabsStreamer } from './services/elevenlabs';
import Anthropic from '@anthropic-ai/sdk';

interface VoiceAgentConfig {
  systemPrompt: string;
  voiceId: string;
  language: string;
}

export class VoiceAgent {
  private deepgram = new DeepgramStreamer();
  private elevenlabs = new ElevenLabsStreamer();
  private anthropic = new Anthropic();
  private conversationHistory: Message[] = [];

  constructor(private config: VoiceAgentConfig) {}

  async start(
    audioInput: AsyncIterable<Buffer>,
    onAudioOutput: (chunk: Buffer) => void
  ) {
    let currentTranscript = '';

    // STT Stream starten
    await this.deepgram.startStream(
      {
        model: 'nova-2',
        language: this.config.language,
        smart_format: true,
        interim_results: true,
        endpointing: 500
      },
      async (text, isFinal) => {
        if (isFinal && text.trim()) {
          // User hat fertig gesprochen
          currentTranscript = text;
          await this.processUserInput(text, onAudioOutput);
        }
      }
    );

    // Audio-Chunks an Deepgram senden
    for await (const chunk of audioInput) {
      this.deepgram.sendAudio(chunk);
    }
  }

  private async processUserInput(
    userText: string,
    onAudioOutput: (chunk: Buffer) => void
  ) {
    // History aktualisieren
    this.conversationHistory.push({
      role: 'user',
      content: userText
    });

    // LLM Response generieren (streaming)
    const stream = await this.anthropic.messages.stream({
      model: 'claude-3-haiku-20240307',
      max_tokens: 500,
      system: this.config.systemPrompt,
      messages: this.conversationHistory
    });

    let fullResponse = '';
    let sentenceBuffer = '';

    // Sentence-by-sentence TTS
    for await (const event of stream) {
      if (event.type === 'content_block_delta') {
        const text = event.delta.text;
        fullResponse += text;
        sentenceBuffer += text;

        // Prüfe auf Satzende
        const sentenceEnd = sentenceBuffer.match(/[.!?]\s/);
        if (sentenceEnd) {
          const sentence = sentenceBuffer.substring(
            0,
            sentenceEnd.index! + 1
          );
          sentenceBuffer = sentenceBuffer.substring(
            sentenceEnd.index! + 2
          );

          // TTS für diesen Satz starten
          for await (const audioChunk of this.elevenlabs.streamSpeech(
            sentence,
            {
              voiceId: this.config.voiceId,
              modelId: 'eleven_turbo_v2_5',
              stability: 0.5,
              similarityBoost: 0.75,
              latencyOptimization: 2
            }
          )) {
            onAudioOutput(audioChunk);
          }
        }
      }
    }

    // Restlichen Buffer aussprechen
    if (sentenceBuffer.trim()) {
      for await (const chunk of this.elevenlabs.streamSpeech(
        sentenceBuffer,
        {
          voiceId: this.config.voiceId,
          modelId: 'eleven_turbo_v2_5',
          stability: 0.5,
          similarityBoost: 0.75,
          latencyOptimization: 2
        }
      )) {
        onAudioOutput(chunk);
      }
    }

    // History aktualisieren
    this.conversationHistory.push({
      role: 'assistant',
      content: fullResponse
    });
  }

  async stop() {
    await this.deepgram.close();
  }
}

Latenz-Breakdown

┌─────────────────────────────────────────────────────────────┐
│                    LATENCY BREAKDOWN                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Component              │ Latency    │ Cumulative          │
│  ───────────────────────│────────────│──────────────────── │
│  Audio Capture          │ ~20ms      │ 20ms                │
│  Network (Upload)       │ ~30ms      │ 50ms                │
│  Deepgram ASR          │ ~150ms     │ 200ms               │
│  LLM (First Token)      │ ~100ms     │ 300ms               │
│  ElevenLabs TTS        │ ~75ms      │ 375ms               │
│  Network (Download)     │ ~30ms      │ 405ms               │
│  Audio Playback         │ ~20ms      │ 425ms               │
│                                                             │
│  TOTAL: ~425ms (unter 500ms Ziel)                          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Hybrid Architecture: Edge + Cloud

// Für niedrigste Latenz: Lokale VAD + Cloud Processing
interface HybridConfig {
  localVAD: boolean;        // Voice Activity Detection lokal
  localWakeWord: boolean;   // "Hey Agent" lokal erkennen
  cloudASR: boolean;        // Transkription in Cloud
  cloudLLM: boolean;        // LLM in Cloud
  cloudTTS: boolean;        // TTS in Cloud
}

// 80% der einfachen Commands können lokal verarbeitet werden
const hybridArchitecture: HybridConfig = {
  localVAD: true,           // Spart Bandbreite & Latenz
  localWakeWord: true,      // Instant Response
  cloudASR: true,           // Deepgram qualitativ besser
  cloudLLM: true,           // Keine lokalen GPU-Ressourcen
  cloudTTS: true            // ElevenLabs Qualität
};

Production Checklist

  • [ ] WebSocket Keep-Alive implementiert
  • [ ] Audio-Codec optimiert (Opus/G.711)
  • [ ] Graceful Degradation bei Netzwerkproblemen
  • [ ] Retry-Logic für API-Failures
  • [ ] Audio-Buffer für Jitter-Compensation
  • [ ] Monitoring für Latenz-Metriken
  • [ ] Fallback-Stimmen konfiguriert
  • [ ] Rate Limiting beachtet

Kosten-Kalkulation

KomponentePreis1000 Gespräche (3min)
Deepgram Nova-2$0.0043/min$12.90
ElevenLabs Turbo$0.30/1000 chars~$45.00
Claude Haiku$0.25/1M tokens~$7.50
**Gesamt****~$65/1000 Gespräche**

Fazit

Production-Grade Voice AI erfordert:

  1. Streaming-First: Keine Batch-Verarbeitung
  2. Sentence-by-Sentence TTS: Frühzeitig mit Sprechen beginnen
  3. Optimierte Modelle: Flash/Turbo statt High-Quality
  4. Edge Processing: VAD und Wake-Word lokal

Die 500ms-Grenze ist erreichbar – mit der richtigen Architektur.


Bildprompts

  1. "Sound waves flowing through neural network, real-time audio visualization, blue and purple gradients"
  2. "Voice assistant architecture diagram with microphone, cloud, and speaker, technical blueprint style"
  3. "Stopwatch showing 500ms with sound wave in background, latency concept, clean tech illustration"

Quellen