深度

Claude 多模态能力实战:用 Vision API 分析图片、截图转代码、OCR 提取

Claude Vision 多模态 API 完整实战:图片上传方式(base64/URL)、截图直接转 React 组件代码、OCR 文字提取、数据图表分析、设计稿审查、PDF 页面处理,以及 Claude Code 终端上传图片的完整工作流。

2026/3/154分钟 阅读ClaudeEagle

Claude 的 Vision(视觉)能力让它可以直接理解图片内容——分析截图、识别文字、理解图表、把设计稿转成代码。本文展示所有实用场景。

支持的图片格式

  • JPEG、PNG、GIF、WebP
  • 最大单张:5MB(base64)或 URL 引用
  • 每次请求最多 20 张图片

基础 API 用法

方式 1:本地图片(base64)

python
import anthropic, base64

client = anthropic.Anthropic()

with open('screenshot.png', 'rb') as f:
    image_data = base64.standard_b64encode(f.read()).decode('utf-8')

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": image_data
                }
            },
            {"type": "text", "text": "Describe what you see in this screenshot."}
        ]
    }]
)
print(response.content[0].text)

方式 2:URL 图片

python
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "url",
                    "url": "https://example.com/chart.png"
                }
            },
            {"type": "text", "text": "Analyze this chart and extract the key data points."}
        ]
    }]
)

场景 1:截图转 React 代码

python
def screenshot_to_react(image_path):
    with open(image_path, 'rb') as f:
        data = base64.standard_b64encode(f.read()).decode('utf-8')
    
    prompt = """
    Convert this UI screenshot to a React component.
    Requirements:
    - TypeScript
    - Tailwind CSS for styling
    - Match the layout and colors as closely as possible
    - Make it responsive (mobile-first)
    - Use semantic HTML
    Output only the component code.
    """
    
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": data}},
                {"type": "text", "text": prompt}
            ]
        }]
    )
    return response.content[0].text

code = screenshot_to_react('figma-design.png')

场景 2:OCR 文字提取

python
def extract_text(image_path):
    with open(image_path, 'rb') as f:
        data = base64.standard_b64encode(f.read()).decode('utf-8')
    
    ext = image_path.split('.')[-1].lower()
    media_type = {'jpg': 'image/jpeg', 'jpeg': 'image/jpeg',
                  'png': 'image/png', 'webp': 'image/webp'}.get(ext, 'image/png')
    
    response = client.messages.create(
        model="claude-haiku-3-5",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": media_type, "data": data}},
                {"type": "text", "text": "Extract all text from this image. Preserve formatting (tables, lists). Output only the extracted text."}
            ]
        }]
    )
    return response.content[0].text

# 批量处理扫描文档
import glob
for img in glob.glob('scanned/*.png'):
    text = extract_text(img)
    with open(img.replace('.png', '.txt'), 'w') as f:
        f.write(text)

场景 3:数据图表分析

python
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "url", "url": chart_url}},
            {"type": "text", "text": """
Analyze this chart:
1. What type of chart is this?
2. Extract all data points as JSON
3. Identify the trend (increasing/decreasing/stable)
4. What's the highest and lowest value?
5. Key insight in one sentence
            """}
        ]
    }]
)

场景 4:设计稿审查

python
def review_design(design_img, spec_img=None):
    content = []
    with open(design_img, 'rb') as f:
        d = base64.standard_b64encode(f.read()).decode()
    content.append({"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": d}})
    
    if spec_img:
        with open(spec_img, 'rb') as f:
            d2 = base64.standard_b64encode(f.read()).decode()
        content.append({"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": d2}})
        content.append({"type": "text", "text": "First image is the implementation, second is the spec. Find differences."})
    else:
        content.append({"type": "text", "text": "Review this UI for: accessibility issues, spacing inconsistencies, color contrast, missing hover states."})
    
    response = client.messages.create(
        model="claude-sonnet-4-5", max_tokens=1024,
        messages=[{"role": "user", "content": content}]
    )
    return response.content[0].text

在 Claude Code 终端中使用图片

bash
# 在交互模式里直接粘贴截图
claude
# 然后 Ctrl+V 粘贴截图(macOS/Linux 支持)
# 或拖拽图片文件到终端

# 非交互模式
claude -p "Convert this design to React component" --image design.png

多图对比

python
# 对比两个版本的 UI
def compare_screenshots(before_path, after_path):
    images = []
    for path in [before_path, after_path]:
        with open(path, 'rb') as f:
            d = base64.standard_b64encode(f.read()).decode()
        images.append({"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": d}})
    
    images.append({"type": "text", "text": "Compare these two screenshots. List all visual differences."})
    
    response = client.messages.create(
        model="claude-sonnet-4-5", max_tokens=1024,
        messages=[{"role": "user", "content": images}]
    )
    return response.content[0].text

来源:Vision API - Anthropic 官方文档

相关文章推荐

深度Claude Code Skills vs CLAUDE.md vs Plugins vs Sub-agents:何时用哪个的完整决策指南Claude Code 四种扩展机制的完整决策指南:四种机制本质对比表;CLAUDE.md 适合放/不适合放的内容清单(含内容精简测试);Skills 四种模式和完整决策树;Plugins 与 Skills 的选择对比表及 Token 开销警告;Sub-agents 三种触发方式和 context: fork 对比;四种组合使用模式;以及快速决策查询表(12 个场景)。2026/5/10深度2026 企业 AI Agent 现状报告:80% 已获可量化 ROI,编程是突破口Anthropic 联合 Material 公司调研 500+ 技术领导者的《2026 State of AI Agents Report》:57% 已部署多阶段工作流;86% 在生产代码部署 Agent;80% 报告可量化 ROI;编程时间节省覆盖规划/代码生成/文档/测试各 58-59%;真实案例(Doctolib 功能交付快 40%、eSentire 威胁分析从 5 小时到 7 分钟、L'Oréal 44000 月活数据直查);三大规模化挑战;以及企业 Claude Code 四阶段部署路径。2026/5/7深度Claude Code Auto Mode 技术深度解析:两层分类器架构如何防止 AI 越权行为Anthropic 工程博客深度解析 Auto Mode 背后的技术:用户审批了 93% 的权限请求却仍有疲劳感;内部事故日志(误删远程分支/上传 GitHub Token/生产数据库误迁移);两层防御(输入层提示注入探针+输出层对话记录分类器);三层许可决策;实测数据(0.4% 误报率,17% 漏报率,附原因分析);多 Agent 传递的安全处理;以及 Deny-and-Continue 机制。2026/5/3深度2026 高级提示工程完全指南:7 个真正有效的技术,从 60% 精度提升到 90%2026 年生产环境有效的提示工程技术:思维链(零样本 CoT)、自一致性多数投票、思维树(ToT)、结构化 RAG 提示设计(带来源引用+相关性过滤)、宪法提示(Constitutional Prompting)、角色注入、强制结构化输出,以及已经失效的过时技术和技术选择决策树。2026/4/23深度Anthropic 2026 Agentic Coding 趋势报告:8 大预测解读,工程师角色从实施者转向编排者Anthropic《2026 Agentic Coding Trends Report》完整解读:60% AI 协作但只有 0-20% 完全委托的关键数据、8 大趋势(SDLC 压缩/多 Agent 团队/长时间 Agent/智能监督扩展/新用户群/经济重塑/全组织扩展/安全架构),以及 Rakuten/Fountain/TELUS/Zapier 的真实案例数据。2026/4/22深度MCP 代码执行模式深度解析:Anthropic 官方揭秘如何减少 98.7% 的 Token 消耗Anthropic 工程博客深度解析:传统 MCP 直接调用的两大 Token 浪费问题(工具定义占满上下文 + 中间结果来回传递),以及代码执行模式如何把 150,000 Token 降到 2,000 Token。涵盖文件树结构设计、按需加载工具、数据过滤、隐私保护和 Skill 持久化。2026/4/21