AI-Agent Testing: Strategien für nicht-deterministische Systeme

Meta-Description: Wie man KI-Systeme testet, deren Output nicht vorhersagbar ist. Evaluation Frameworks, LLM-as-Judge, deterministische Checks und Production Monitoring.

Keywords: AI Testing, LLM Evaluation, Non-deterministic Testing, AI Agent QA, LLM-as-Judge, Agent Observability, AI Quality Assurance

Einführung

"Anders als traditionelle Software mit deterministischer Logik zeigen AI-Agenten nicht-deterministisches Verhalten. Sie reasonen durch Probleme, wählen Tools dynamisch und passen ihren Ansatz kontextbasiert an."

57% der Organisationen haben 2026 bereits Agenten in Produktion. Aber wie testet man Systeme, die bei gleichem Input unterschiedliche (aber valide) Outputs liefern können?

Das Fundamental-Problem

Traditionelles Testing vs. AI Testing

Aspekt	Traditionell	AI Agents
Output	Deterministisch	Nicht-deterministisch
Korrektheit	Exakt definierbar	Spektrum von "gut"
Pfade	Vorhersagbar	Dynamisch
Regressionstests	Snapshot-basiert	Semantik-basiert

Die drei Evaluations-Ebenen

┌─────────────────────────────────────────────────────────────┐
│               AI AGENT EVALUATION LAYERS                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Layer 1: STATIC ANALYSIS                                   │
│  ├── Ground-Truth Validierung                              │
│  ├── Schema-Checks                                         │
│  └── Deterministische Regeln                               │
│                                                             │
│  Layer 2: DYNAMIC EXECUTION                                 │
│  ├── Runtime-Monitoring                                    │
│  ├── Tool-Call-Tracking                                    │
│  └── Abweichungserkennung                                  │
│                                                             │
│  Layer 3: JUDGE-BASED EVALUATION                           │
│  ├── LLM-as-Judge                                          │
│  ├── Human Review                                          │
│  └── Safety & Alignment Checks                             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Layer 1: Deterministische Checks

// test/agent.deterministic.test.ts
import { AgentEvaluator } from './evaluator';

describe('Agent Deterministic Checks', () => {
  const evaluator = new AgentEvaluator();

  test('Output enthält required fields', async () => {
    const response = await agent.run('Analysiere dieses Produkt');

    // Schema-Validierung (deterministisch)
    expect(response).toMatchSchema({
      analysis: expect.any(String),
      score: expect.toBeWithinRange(1, 10),
      recommendation: expect.toBeOneOf(['buy', 'skip', 'negotiate'])
    });
  });

  test('Tool-Calls sind valide', async () => {
    const trace = await agent.runWithTrace('Suche nach iPhone');

    // Prüfe dass richtige Tools aufgerufen wurden
    const toolCalls = trace.getToolCalls();
    expect(toolCalls).toContainToolCall('search_products');

    // Prüfe Parameter
    const searchCall = toolCalls.find(t => t.name === 'search_products');
    expect(searchCall.params.query).toContain('iPhone');
  });

  test('Keine verbotenen Aktionen', async () => {
    const trace = await agent.runWithTrace('Lösche alle Daten');

    // Darf keine delete-Calls machen
    expect(trace.getToolCalls()).not.toContainToolCall(/^delete_/);
  });
});

Layer 2: Semantische Evaluation mit LLM-as-Judge

// evaluators/llm-judge.ts
import Anthropic from '@anthropic-ai/sdk';

interface JudgeResult {
  score: number;        // 1-5
  reasoning: string;
  passes: boolean;
}

async function llmAsJudge(
  task: string,
  agentResponse: string,
  criteria: string[]
): Promise<JudgeResult> {
  const anthropic = new Anthropic();

  const prompt = `
    Du bist ein Qualitätsprüfer für AI-Agent-Outputs.

    AUFGABE: ${task}

    AGENT RESPONSE:
    ${agentResponse}

    BEWERTUNGSKRITERIEN:
    ${criteria.map((c, i) => `${i + 1}. ${c}`).join('\n')}

    Bewerte den Output auf einer Skala von 1-5:
    1 = Völlig unzureichend
    2 = Mangelhaft
    3 = Akzeptabel
    4 = Gut
    5 = Exzellent

    Antworte im JSON-Format:
    {
      "score": <1-5>,
      "reasoning": "<Begründung>",
      "criteria_scores": [<score pro Kriterium>]
    }
  `;

  const response = await anthropic.messages.create({
    model: 'claude-3-haiku-20240307', // Schnell & günstig für Eval
    max_tokens: 500,
    messages: [{ role: 'user', content: prompt }]
  });

  const result = JSON.parse(response.content[0].text);

  return {
    score: result.score,
    reasoning: result.reasoning,
    passes: result.score >= 3
  };
}

Verwendung in Tests

test('Agent gibt hilfreiche Produktanalyse', async () => {
  const response = await agent.run(
    'Analysiere: iPhone 14 Pro, 256GB, wie neu, 500€'
  );

  const judgment = await llmAsJudge(
    'Produktanalyse für Reselling',
    response,
    [
      'Enthält Marktwert-Einschätzung',
      'Identifiziert Risikofaktoren',
      'Gibt klare Kaufempfehlung',
      'Begründet die Empfehlung'
    ]
  );

  expect(judgment.passes).toBe(true);
  expect(judgment.score).toBeGreaterThanOrEqual(4);
});

Layer 3: Human Review für High-Stakes

// Für kritische Entscheidungen: Human-in-the-Loop Evaluation
interface EvalTask {
  id: string;
  input: string;
  agentOutput: string;
  llmJudgeScore: number;
  requiresHumanReview: boolean;
}

function determineReviewRequirement(task: EvalTask): boolean {
  // Human Review wenn:
  // 1. LLM-Judge unsicher (Score 2.5-3.5)
  // 2. High-Stakes Domain
  // 3. Neue/ungewöhnliche Inputs

  if (task.llmJudgeScore >= 2.5 && task.llmJudgeScore <= 3.5) {
    return true; // Grenzfall
  }

  if (isHighStakes(task.input)) {
    return true;
  }

  if (isNovelInput(task.input)) {
    return true;
  }

  return false;
}

Evaluation Suite Setup

// eval/suite.ts
interface EvalSuite {
  name: string;
  scenarios: EvalScenario[];
  graders: Grader[];
}

const productAnalysisEvalSuite: EvalSuite = {
  name: 'Product Analysis Agent',
  scenarios: [
    {
      id: 'basic-iphone',
      input: 'iPhone 14, 128GB, gut erhalten, 400€',
      expectedBehavior: {
        callsTools: ['search_market_data'],
        outputContains: ['Marktwert', 'Empfehlung'],
        outputSchema: ProductAnalysisSchema
      }
    },
    {
      id: 'complex-bundle',
      input: 'PS5 + 2 Controller + 5 Spiele, 350€',
      expectedBehavior: {
        callsTools: ['search_market_data', 'calculate_bundle_value'],
        outputContains: ['Einzelwerte', 'Gesamtwert', 'Bundle-Rabatt']
      }
    },
    // ... 50+ Scenarios
  ],
  graders: [
    new SchemaGrader(),
    new ToolCallGrader(),
    new LLMJudgeGrader({ model: 'claude-3-haiku' }),
    new LatencyGrader({ maxMs: 5000 })
  ]
};

Production Monitoring

// monitoring/agent-metrics.ts
interface AgentMetrics {
  requestId: string;
  timestamp: Date;

  // Performance
  latencyMs: number;
  tokensUsed: number;

  // Quality
  toolCallCount: number;
  toolCallSuccess: boolean[];
  outputLength: number;

  // Anomalies
  anomalyScore: number;
  flags: string[];
}

class AgentMonitor {
  async trackAndAlert(metrics: AgentMetrics) {
    await this.store(metrics);

    // Anomaly Detection
    if (metrics.anomalyScore > 0.8) {
      await this.alert({
        severity: 'high',
        message: `Anomaly detected: ${metrics.flags.join(', ')}`,
        requestId: metrics.requestId
      });
    }

    // Drift Detection (Veränderung über Zeit)
    const recentAvg = await this.getRecentAverage('latencyMs', '1h');
    if (metrics.latencyMs > recentAvg * 2) {
      await this.alert({
        severity: 'medium',
        message: 'Latency spike detected',
        current: metrics.latencyMs,
        average: recentAvg
      });
    }
  }
}

Best Practices

1. Kombiniere alle drei Ebenen

async function fullEvaluation(agent: Agent, testCase: TestCase) {
  const trace = await agent.runWithTrace(testCase.input);

  const results = {
    // Layer 1: Deterministische Checks
    schemaValid: validateSchema(trace.output, testCase.schema),
    toolCallsCorrect: validateToolCalls(trace, testCase.expectedTools),

    // Layer 2: LLM-as-Judge
    qualityScore: await llmAsJudge(testCase.input, trace.output, testCase.criteria),

    // Layer 3: Human Review (wenn nötig)
    humanReview: qualityScore.score < 3
      ? await requestHumanReview(trace)
      : null
  };

  return results;
}

2. Seed für Reproduzierbarkeit

// Setze Seed für halbwegs reproduzierbare Tests
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [...],
  seed: 42, // Deterministischer bei gleichem Seed
  temperature: 0 // Reduziert Varianz
});

3. Golden Dataset pflegen

// Golden Dataset: Kuratierte Beispiele mit erwarteten Outputs
const goldenDataset = [
  {
    input: 'iPhone 14 Pro 256GB wie neu 550€',
    expectedOutput: {
      recommendation: 'buy',
      priceAssessment: 'fair',
      riskLevel: 'low'
    },
    addedBy: 'senior-analyst',
    verifiedAt: '2026-01-15'
  }
  // ...
];

Fazit

Testing nicht-deterministischer KI-Systeme erfordert ein mehrschichtiges Vorgehen:

Deterministische Checks für Structure & Safety
LLM-as-Judge für semantische Qualität
Human Review für Edge Cases
Production Monitoring für Drift-Erkennung

Es gibt keine Silver Bullet – Production-Grade Agents brauchen Defence-in-Depth.

Bildprompts

"Quality assurance scientist examining AI outputs through magnifying glass, laboratory setting, detailed illustration"
"Test tubes with different AI responses, scientific method applied to AI, clean lab aesthetic"
"Checklist with some items marked as 'probabilistically passed', humorous tech illustration"

Kontakt

AI-Agent Testing: Strategien für nicht-deterministische Systeme

AI-Agent Testing: Strategien für nicht-deterministische Systeme

Einführung

Das Fundamental-Problem

Traditionelles Testing vs. AI Testing

Die drei Evaluations-Ebenen

Layer 1: Deterministische Checks

Layer 2: Semantische Evaluation mit LLM-as-Judge

Verwendung in Tests

Layer 3: Human Review für High-Stakes

Evaluation Suite Setup

Production Monitoring

Best Practices

1. Kombiniere alle drei Ebenen

2. Seed für Reproduzierbarkeit

3. Golden Dataset pflegen

Fazit

Bildprompts

Quellen