Voice AI Infrastructure: Echtzeit-Sprachagenten mit Deepgram & ElevenLabs
Architektur für produktionsreife Voice AI Systeme. Streaming ASR mit Deepgram Nova-2, TTS mit ElevenLabs Turbo v2.5, WebSocket-Integration und Latenz-Optimierung.

Voice AI Infrastructure: Echtzeit-Sprachagenten mit Deepgram & ElevenLabs
Meta-Description: Architektur für produktionsreife Voice AI Systeme. Streaming ASR mit Deepgram Nova-2, TTS mit ElevenLabs Turbo v2.5, WebSocket-Integration und Latenz-Optimierung.
Keywords: Voice AI, Deepgram, ElevenLabs, Speech-to-Text, Text-to-Speech, Real-Time Voice, ASR, TTS, Voice Agent Architecture
Einführung
Die 500-Millisekunden-Schwelle trennt natürliche von künstlicher Sprachinteraktion. 2026 haben wir die Tools, um diese Grenze zu unterschreiten – aber nur mit der richtigen Architektur.
Die Voice AI Pipeline
┌─────────────────────────────────────────────────────────────┐
│ VOICE AI STREAMING PIPELINE │
├─────────────────────────────────────────────────────────────┤
│ │
│ [Mikrofon] ──WebSocket──→ [Deepgram Nova-2] ──Text──→ │
│ ASR │
│ │ │
│ ▼ │
│ [LLM Agent] │
│ (Claude/GPT) │
│ │ │
│ ▼ │
│ [Speaker] ←──Audio Stream──← [ElevenLabs] ←──Text──┘ │
│ Turbo v2.5 │
│ │
│ Ziel-Latenz: < 500ms End-to-End │
│ │
└─────────────────────────────────────────────────────────────┘Deepgram Nova-2: Speech-to-Text
Warum Nova-2?
| Metrik | Nova-2 | Whisper | Industrie-Ø |
|---|---|---|---|
| **Word Error Rate** | 8.4% | 13.1% | 12% |
| **Verarbeitung** | 29.8s/h | 150s/h | 120s/h |
| **Preis** | $0.0043/min | $0.006/min | $0.01/min |
| **Sprachen** | 36 | 99 | variiert |
Streaming-Integration
// src/services/deepgram.ts
import { createClient, LiveTranscriptionEvents } from '@deepgram/sdk';
interface TranscriptionConfig {
model: 'nova-2' | 'nova-2-meeting' | 'nova-2-phonecall';
language: string;
smart_format: boolean;
interim_results: boolean;
endpointing: number;
}
export class DeepgramStreamer {
private client = createClient(process.env.DEEPGRAM_API_KEY!);
private connection: any = null;
async startStream(
config: TranscriptionConfig,
onTranscript: (text: string, isFinal: boolean) => void
) {
this.connection = this.client.listen.live({
model: config.model,
language: config.language,
smart_format: config.smart_format,
interim_results: config.interim_results,
endpointing: config.endpointing, // ms Stille für Satzende
punctuate: true,
diarize: false
});
this.connection.on(LiveTranscriptionEvents.Open, () => {
console.log('Deepgram connection opened');
});
this.connection.on(LiveTranscriptionEvents.Transcript, (data: any) => {
const transcript = data.channel.alternatives[0];
if (transcript.transcript) {
onTranscript(transcript.transcript, data.is_final);
}
});
this.connection.on(LiveTranscriptionEvents.Error, (err: Error) => {
console.error('Deepgram error:', err);
});
return this.connection;
}
sendAudio(audioChunk: Buffer) {
if (this.connection) {
this.connection.send(audioChunk);
}
}
async close() {
if (this.connection) {
await this.connection.finish();
}
}
}Optimierte Konfiguration für Deutsch
const germanConfig: TranscriptionConfig = {
model: 'nova-2',
language: 'de',
smart_format: true,
interim_results: true, // Für Live-Feedback
endpointing: 300 // 300ms für schnelles Turn-Taking
};ElevenLabs Turbo v2.5: Text-to-Speech
Modell-Vergleich
| Modell | Latenz | Qualität | Use Case |
|---|---|---|---|
| **Flash v2.5** | ~75ms | Gut | Echtzeit-Agents |
| **Turbo v2.5** | ~300ms | Sehr gut | Conversational AI |
| **Multilingual v2** | ~900ms | Exzellent | Vorproduzierte Inhalte |
Streaming TTS Implementation
// src/services/elevenlabs.ts
import { ElevenLabsClient } from 'elevenlabs';
interface TTSConfig {
voiceId: string;
modelId: 'eleven_turbo_v2_5' | 'eleven_flash_v2_5';
stability: number;
similarityBoost: number;
latencyOptimization: 0 | 1 | 2 | 3 | 4;
}
export class ElevenLabsStreamer {
private client = new ElevenLabsClient({
apiKey: process.env.ELEVENLABS_API_KEY!
});
async *streamSpeech(
text: string,
config: TTSConfig
): AsyncGenerator<Buffer> {
const audioStream = await this.client.textToSpeech.convertAsStream(
config.voiceId,
{
text,
model_id: config.modelId,
voice_settings: {
stability: config.stability,
similarity_boost: config.similarityBoost
},
optimize_streaming_latency: config.latencyOptimization
}
);
for await (const chunk of audioStream) {
yield Buffer.from(chunk);
}
}
// Für Sentence-by-Sentence Streaming
async streamBySentence(
sentences: string[],
config: TTSConfig,
onChunk: (audio: Buffer) => void
) {
for (const sentence of sentences) {
for await (const chunk of this.streamSpeech(sentence, config)) {
onChunk(chunk);
}
}
}
}Latenz-Optimierung
// Maximale Latenz-Optimierung
const lowLatencyConfig: TTSConfig = {
voiceId: 'pNInz6obpgDQGcFmaJgB', // Adam
modelId: 'eleven_flash_v2_5', // Schnellstes Modell
stability: 0.5,
similarityBoost: 0.75,
latencyOptimization: 4 // Max Optimierung
};
// Qualitäts-fokussiert
const qualityConfig: TTSConfig = {
voiceId: 'pNInz6obpgDQGcFmaJgB',
modelId: 'eleven_turbo_v2_5',
stability: 0.7,
similarityBoost: 0.9,
latencyOptimization: 0 // Keine Optimierung
};Vollständige Voice Agent Architektur
// src/voice-agent.ts
import { DeepgramStreamer } from './services/deepgram';
import { ElevenLabsStreamer } from './services/elevenlabs';
import Anthropic from '@anthropic-ai/sdk';
interface VoiceAgentConfig {
systemPrompt: string;
voiceId: string;
language: string;
}
export class VoiceAgent {
private deepgram = new DeepgramStreamer();
private elevenlabs = new ElevenLabsStreamer();
private anthropic = new Anthropic();
private conversationHistory: Message[] = [];
constructor(private config: VoiceAgentConfig) {}
async start(
audioInput: AsyncIterable<Buffer>,
onAudioOutput: (chunk: Buffer) => void
) {
let currentTranscript = '';
// STT Stream starten
await this.deepgram.startStream(
{
model: 'nova-2',
language: this.config.language,
smart_format: true,
interim_results: true,
endpointing: 500
},
async (text, isFinal) => {
if (isFinal && text.trim()) {
// User hat fertig gesprochen
currentTranscript = text;
await this.processUserInput(text, onAudioOutput);
}
}
);
// Audio-Chunks an Deepgram senden
for await (const chunk of audioInput) {
this.deepgram.sendAudio(chunk);
}
}
private async processUserInput(
userText: string,
onAudioOutput: (chunk: Buffer) => void
) {
// History aktualisieren
this.conversationHistory.push({
role: 'user',
content: userText
});
// LLM Response generieren (streaming)
const stream = await this.anthropic.messages.stream({
model: 'claude-3-haiku-20240307',
max_tokens: 500,
system: this.config.systemPrompt,
messages: this.conversationHistory
});
let fullResponse = '';
let sentenceBuffer = '';
// Sentence-by-sentence TTS
for await (const event of stream) {
if (event.type === 'content_block_delta') {
const text = event.delta.text;
fullResponse += text;
sentenceBuffer += text;
// Prüfe auf Satzende
const sentenceEnd = sentenceBuffer.match(/[.!?]\s/);
if (sentenceEnd) {
const sentence = sentenceBuffer.substring(
0,
sentenceEnd.index! + 1
);
sentenceBuffer = sentenceBuffer.substring(
sentenceEnd.index! + 2
);
// TTS für diesen Satz starten
for await (const audioChunk of this.elevenlabs.streamSpeech(
sentence,
{
voiceId: this.config.voiceId,
modelId: 'eleven_turbo_v2_5',
stability: 0.5,
similarityBoost: 0.75,
latencyOptimization: 2
}
)) {
onAudioOutput(audioChunk);
}
}
}
}
// Restlichen Buffer aussprechen
if (sentenceBuffer.trim()) {
for await (const chunk of this.elevenlabs.streamSpeech(
sentenceBuffer,
{
voiceId: this.config.voiceId,
modelId: 'eleven_turbo_v2_5',
stability: 0.5,
similarityBoost: 0.75,
latencyOptimization: 2
}
)) {
onAudioOutput(chunk);
}
}
// History aktualisieren
this.conversationHistory.push({
role: 'assistant',
content: fullResponse
});
}
async stop() {
await this.deepgram.close();
}
}Latenz-Breakdown
┌─────────────────────────────────────────────────────────────┐
│ LATENCY BREAKDOWN │
├─────────────────────────────────────────────────────────────┤
│ │
│ Component │ Latency │ Cumulative │
│ ───────────────────────│────────────│──────────────────── │
│ Audio Capture │ ~20ms │ 20ms │
│ Network (Upload) │ ~30ms │ 50ms │
│ Deepgram ASR │ ~150ms │ 200ms │
│ LLM (First Token) │ ~100ms │ 300ms │
│ ElevenLabs TTS │ ~75ms │ 375ms │
│ Network (Download) │ ~30ms │ 405ms │
│ Audio Playback │ ~20ms │ 425ms │
│ │
│ TOTAL: ~425ms (unter 500ms Ziel) │
│ │
└─────────────────────────────────────────────────────────────┘Hybrid Architecture: Edge + Cloud
// Für niedrigste Latenz: Lokale VAD + Cloud Processing
interface HybridConfig {
localVAD: boolean; // Voice Activity Detection lokal
localWakeWord: boolean; // "Hey Agent" lokal erkennen
cloudASR: boolean; // Transkription in Cloud
cloudLLM: boolean; // LLM in Cloud
cloudTTS: boolean; // TTS in Cloud
}
// 80% der einfachen Commands können lokal verarbeitet werden
const hybridArchitecture: HybridConfig = {
localVAD: true, // Spart Bandbreite & Latenz
localWakeWord: true, // Instant Response
cloudASR: true, // Deepgram qualitativ besser
cloudLLM: true, // Keine lokalen GPU-Ressourcen
cloudTTS: true // ElevenLabs Qualität
};Production Checklist
- [ ] WebSocket Keep-Alive implementiert
- [ ] Audio-Codec optimiert (Opus/G.711)
- [ ] Graceful Degradation bei Netzwerkproblemen
- [ ] Retry-Logic für API-Failures
- [ ] Audio-Buffer für Jitter-Compensation
- [ ] Monitoring für Latenz-Metriken
- [ ] Fallback-Stimmen konfiguriert
- [ ] Rate Limiting beachtet
Kosten-Kalkulation
| Komponente | Preis | 1000 Gespräche (3min) |
|---|---|---|
| Deepgram Nova-2 | $0.0043/min | $12.90 |
| ElevenLabs Turbo | $0.30/1000 chars | ~$45.00 |
| Claude Haiku | $0.25/1M tokens | ~$7.50 |
| **Gesamt** | **~$65/1000 Gespräche** |
Fazit
Production-Grade Voice AI erfordert:
- Streaming-First: Keine Batch-Verarbeitung
- Sentence-by-Sentence TTS: Frühzeitig mit Sprechen beginnen
- Optimierte Modelle: Flash/Turbo statt High-Quality
- Edge Processing: VAD und Wake-Word lokal
Die 500ms-Grenze ist erreichbar – mit der richtigen Architektur.
Bildprompts
- "Sound waves flowing through neural network, real-time audio visualization, blue and purple gradients"
- "Voice assistant architecture diagram with microphone, cloud, and speaker, technical blueprint style"
- "Stopwatch showing 500ms with sound wave in background, latency concept, clean tech illustration"