Claude API Prompt Caching 详解：重复内容成本降低 90% 的缓存技巧（2026）

如果你的应用每次都发送相同的系统提示、文档内容或工具定义，这些 Token 每次都要重新计费。Prompt Caching 让 Claude 缓存这部分内容，命中缓存时读取成本仅为普通输入的 10%，写入缓存也只需 125%，多次调用后净省超过 85%。

工作原理

普通 API 调用：每次发送全部 Prompt，全部按输入 Token 计费。

开启缓存：

第一次调用（Cache Miss）：Prompt 写入缓存，按 1.25x 输入价格计费
后续调用（Cache Hit）：缓存部分只按 0.1x 价格计费，节省 90%

缓存 TTL：默认 5 分钟，最长 1 小时。每次命中会刷新计时。

支持的模型

claude-opus-4-5 及以上
claude-sonnet-4-5 及以上
claude-haiku-3-5 及以上

如何启用

在要缓存的内容末尾加上 cache_control：

python

import anthropic
client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert code reviewer...(very long system prompt)",
            "cache_control": {"type": "ephemeral"}  # 缓存这段
        }
    ],
    messages=[{"role": "user", "content": "Review this PR..."}]
)
print(response.usage)  # 查看 cache_creation_input_tokens / cache_read_input_tokens

三类常见缓存场景

1. 缓存超长系统提示

python

SYSTEM_PROMPT = """You are a senior software engineer...
# Coding Standards
...(几千字的规范文档)
"""

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": SYSTEM_PROMPT,
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": user_question}]
)

2. 缓存长文档（RAG 场景）

python

def analyze_document(document_text, question):
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": f"Document:\n{document_text}",
                    "cache_control": {"type": "ephemeral"}  # 文档内容缓存
                },
                {
                    "type": "text",
                    "text": f"Question: {question}"  # 问题不缓存（每次不同）
                }
            ]
        }]
    )
    return response

# 第一次调用：写缓存
r1 = analyze_document(long_doc, 'Summarize the main points')
# 第二次调用：命中缓存，文档部分只需 10% 成本
r2 = analyze_document(long_doc, 'What are the risks mentioned?')

3. 缓存对话历史

python

messages = []

def chat_with_cache(user_input):
    messages.append({"role": "user", "content": user_input})

    # 对较长的历史记录加缓存标记
    cached_messages = []
    for i, msg in enumerate(messages):
        if i == len(messages) - 3 and len(messages) > 3:
            # 在第三个最新消息加 cache_control
            cached_messages.append({
                "role": msg["role"],
                "content": [{
                    "type": "text",
                    "text": msg["content"],
                    "cache_control": {"type": "ephemeral"}
                }]
            })
        else:
            cached_messages.append(msg)

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=cached_messages
    )
    reply = response.content[0].text
    messages.append({"role": "assistant", "content": reply})
    return reply

查看缓存效果

python

usage = response.usage
print(f'输入 Tokens: {usage.input_tokens}')
print(f'缓存写入 Tokens: {usage.cache_creation_input_tokens}')
print(f'缓存命中 Tokens: {usage.cache_read_input_tokens}')
print(f'输出 Tokens: {usage.output_tokens}')

# 成本计算示例（Sonnet 4.5）
# 普通输入：$3.00 / 1M tokens
# 缓存写入：$3.75 / 1M tokens（1.25x）
# 缓存读取：$0.30 / 1M tokens（0.1x）

成本对比示例

场景：10000 token 系统提示，每天调用 1000 次

方案	每日成本（Sonnet）
无缓存	$30.00
有缓存（第一次写入后）	$3.05
节省	89.8%

缓存命中的关键条件

Prompt 完全一致：缓存内容必须逐字相同，包括空格和换行
相同模型：不同模型不共享缓存
TTL 内：默认 5 分钟，每次命中刷新
最小 Token 数：Sonnet/Opus 需要至少 1024 tokens，Haiku 需要 2048 tokens

最佳实践

把静态内容放最前面：系统提示、工具定义、文档放消息开头，动态内容放末尾
不要频繁修改缓存内容：每次修改都会触发重新写入（1.25x 计费）
高频应用优先缓存：调用越频繁，缓存收益越高
监控命中率：通过 cache_read_input_tokens / total_input 计算

来源：Prompt Caching - Anthropic 官方文档

Claude API Prompt Caching 详解：让重复内容成本降低 90%

工作原理

支持的模型

如何启用

三类常见缓存场景

1. 缓存超长系统提示

2. 缓存长文档（RAG 场景）

3. 缓存对话历史

查看缓存效果

成本对比示例

缓存命中的关键条件

最佳实践

相关文章推荐

工作原理#

支持的模型#

如何启用#

三类常见缓存场景#

1. 缓存超长系统提示#

2. 缓存长文档（RAG 场景）#

3. 缓存对话历史#

查看缓存效果#

成本对比示例#

缓存命中的关键条件#

最佳实践#

相关文章推荐

工作原理

支持的模型

如何启用

三类常见缓存场景

1. 缓存超长系统提示

2. 缓存长文档（RAG 场景）

3. 缓存对话历史

查看缓存效果

成本对比示例

缓存命中的关键条件

最佳实践