🔥 고급2026-05-085~7분

Batch API + 비용 관측 파이프라인으로 LLM 지출 50% 통제하기

Anthropic Batch API를 실시간 비용 관측 루프와 결합해 대규모 추론 작업의 단가를 절반으로 낮추고, 예산 초과를 사전에 차단하는 운영 패턴을 다룬다.

batch-apicost-observabilityproduction

왜 Batch API인가 — 수치로 보는 트레이드오프

Anthropic Messages Batch API는 동기 API 대비 토큰당 비용 50% 절감을 제공하는 대신 최대 24시간 처리 지연을 감수해야 한다. 따라서 적용 판단 기준은 단순하다: 응답 지연이 비즈니스 SLA에 영향을 주지 않는 워크로드(오프라인 평가, 대량 문서 분류, 야간 리포트 생성 등)에만 사용한다. 반대로 사용자가 대기 중인 실시간 챗에는 절대 사용하지 않는다.

실패 모드: 배치 내 개별 요청이 실패해도 전체 배치는 완료 상태로 반환된다. 결과 파일의 result.type이 errored인 항목을 반드시 별도 처리해야 한다. 이를 놓치면 실패율이 조용히 누적된다.

비용 관측 루프 구현

import anthropic
import json
from datetime import datetime

client = anthropic.Anthropic()

# 1. 배치 생성
requests = [
    {
        "custom_id": f"doc-{i}",
        "params": {
            "model": "claude-opus-4-5",
            "max_tokens": 512,
            "messages": [{"role": "user", "content": f"Classify sentiment: {text}"}]
        }
    }
    for i, text in enumerate(["Great product!", "Terrible service.", "Average experience."])
]

batch = client.messages.batches.create(requests=requests)
print(f"Batch ID: {batch.id} | Status: {batch.processing_status}")

# 2. 완료 폴링 + 비용 집계
import time

BUDGET_INPUT_TOKENS = 500_000   # 일별 예산 토큰 한도
total_input, total_output = 0, 0

while True:
    result = client.messages.batches.retrieve(batch.id)
    if result.processing_status == "ended":
        break
    time.sleep(60)

# 3. 결과 순회 및 비용 계산
for item in client.messages.batches.results(batch.id):
    if item.result.type == "succeeded":
        usage = item.result.message.usage
        total_input += usage.input_tokens
        total_output += usage.output_tokens
    else:
        print(f"[FAILED] {item.custom_id}: {item.result.error}")

# claude-opus-4-5 기준 (배치 할인 50% 적용)
cost_usd = (total_input / 1_000_000 * 7.5) + (total_output / 1_000_000 * 37.5)
print(f"Total cost: ${cost_usd:.4f} | Input: {total_input} | Output: {total_output}")

if total_input > BUDGET_INPUT_TOKENS:
    raise RuntimeError(f"Budget exceeded: {total_input} > {BUDGET_INPUT_TOKENS}")

위 패턴에서 핵심은 완료 후 토큰 집계를 코드 레벨에서 강제하는 것이다. 외부 모니터링 도구(Datadog, Grafana)로 cost_usd와 total_input을 메트릭으로 push하면 일별·배치별 지출 추이를 시각화할 수 있다.

운영 체크리스트

[ ] custom_id는 전역 고유값(UUID 권장) — 결과 매핑 실패 방지
[ ] 배치당 요청 수 ≤ 10,000건, 단일 요청 토큰 ≤ 모델 컨텍스트 한도 확인
[ ] result.type == "errored" 항목 재시도 큐에 자동 삽입
[ ] 일별 예산 토큰 한도 설정 후 초과 시 Slack 알림 연동
[ ] 배치 처리 시간 P95 모니터링 — 24시간 초과 시 Anthropic 상태 페이지 확인
[ ] 모델 버전 고정(claude-opus-4-5 등) — 모델 업데이트로 인한 가격 변동 방지

← 이전

LLM-as-Judge 평가 파이프라인 구축: 편향 제거와 신뢰구간 확보

계층형 메모리 스토리지로 에이전트 장기 컨텍스트 비용 80% 줄이기