深度

Claude 多模态能力实战:用 Vision API 分析图片、截图转代码、OCR 提取

Claude Vision 多模态 API 完整实战:图片上传方式(base64/URL)、截图直接转 React 组件代码、OCR 文字提取、数据图表分析、设计稿审查、PDF 页面处理,以及 Claude Code 终端上传图片的完整工作流。

2026/3/154分钟 阅读ClaudeEagle

Claude 的 Vision(视觉)能力让它可以直接理解图片内容——分析截图、识别文字、理解图表、把设计稿转成代码。本文展示所有实用场景。

支持的图片格式

  • JPEG、PNG、GIF、WebP
  • 最大单张:5MB(base64)或 URL 引用
  • 每次请求最多 20 张图片

基础 API 用法

方式 1:本地图片(base64)

python
import anthropic, base64

client = anthropic.Anthropic()

with open('screenshot.png', 'rb') as f:
    image_data = base64.standard_b64encode(f.read()).decode('utf-8')

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": image_data
                }
            },
            {"type": "text", "text": "Describe what you see in this screenshot."}
        ]
    }]
)
print(response.content[0].text)

方式 2:URL 图片

python
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "url",
                    "url": "https://example.com/chart.png"
                }
            },
            {"type": "text", "text": "Analyze this chart and extract the key data points."}
        ]
    }]
)

场景 1:截图转 React 代码

python
def screenshot_to_react(image_path):
    with open(image_path, 'rb') as f:
        data = base64.standard_b64encode(f.read()).decode('utf-8')
    
    prompt = """
    Convert this UI screenshot to a React component.
    Requirements:
    - TypeScript
    - Tailwind CSS for styling
    - Match the layout and colors as closely as possible
    - Make it responsive (mobile-first)
    - Use semantic HTML
    Output only the component code.
    """
    
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": data}},
                {"type": "text", "text": prompt}
            ]
        }]
    )
    return response.content[0].text

code = screenshot_to_react('figma-design.png')

场景 2:OCR 文字提取

python
def extract_text(image_path):
    with open(image_path, 'rb') as f:
        data = base64.standard_b64encode(f.read()).decode('utf-8')
    
    ext = image_path.split('.')[-1].lower()
    media_type = {'jpg': 'image/jpeg', 'jpeg': 'image/jpeg',
                  'png': 'image/png', 'webp': 'image/webp'}.get(ext, 'image/png')
    
    response = client.messages.create(
        model="claude-haiku-3-5",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": media_type, "data": data}},
                {"type": "text", "text": "Extract all text from this image. Preserve formatting (tables, lists). Output only the extracted text."}
            ]
        }]
    )
    return response.content[0].text

# 批量处理扫描文档
import glob
for img in glob.glob('scanned/*.png'):
    text = extract_text(img)
    with open(img.replace('.png', '.txt'), 'w') as f:
        f.write(text)

场景 3:数据图表分析

python
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "url", "url": chart_url}},
            {"type": "text", "text": """
Analyze this chart:
1. What type of chart is this?
2. Extract all data points as JSON
3. Identify the trend (increasing/decreasing/stable)
4. What's the highest and lowest value?
5. Key insight in one sentence
            """}
        ]
    }]
)

场景 4:设计稿审查

python
def review_design(design_img, spec_img=None):
    content = []
    with open(design_img, 'rb') as f:
        d = base64.standard_b64encode(f.read()).decode()
    content.append({"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": d}})
    
    if spec_img:
        with open(spec_img, 'rb') as f:
            d2 = base64.standard_b64encode(f.read()).decode()
        content.append({"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": d2}})
        content.append({"type": "text", "text": "First image is the implementation, second is the spec. Find differences."})
    else:
        content.append({"type": "text", "text": "Review this UI for: accessibility issues, spacing inconsistencies, color contrast, missing hover states."})
    
    response = client.messages.create(
        model="claude-sonnet-4-5", max_tokens=1024,
        messages=[{"role": "user", "content": content}]
    )
    return response.content[0].text

在 Claude Code 终端中使用图片

bash
# 在交互模式里直接粘贴截图
claude
# 然后 Ctrl+V 粘贴截图(macOS/Linux 支持)
# 或拖拽图片文件到终端

# 非交互模式
claude -p "Convert this design to React component" --image design.png

多图对比

python
# 对比两个版本的 UI
def compare_screenshots(before_path, after_path):
    images = []
    for path in [before_path, after_path]:
        with open(path, 'rb') as f:
            d = base64.standard_b64encode(f.read()).decode()
        images.append({"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": d}})
    
    images.append({"type": "text", "text": "Compare these two screenshots. List all visual differences."})
    
    response = client.messages.create(
        model="claude-sonnet-4-5", max_tokens=1024,
        messages=[{"role": "user", "content": images}]
    )
    return response.content[0].text

来源:Vision API - Anthropic 官方文档

相关文章推荐

深度Claude Advisor Tool 详解:用 Sonnet 执行、Opus 做战略顾问的低成本 Agent 架构Claude Advisor Tool 让 Sonnet 或 Haiku 作为执行器,在复杂节点向 Opus 4.8 咨询战略建议,从而在长程编码 Agent、研究流水线和 computer use 中获得接近 Opus 的质量与更低总成本。2026/6/6深度OpenClaw Capability 架构指南:插件边界、共享运行时和供应商解耦OpenClaw Capability Cookbook 官方文档中文整理:什么时候创建 capability、标准开发顺序、core/vendor plugin/feature plugin 分工、provider registry、runtime helper、image generation 示例和架构审查清单。2026/6/4深度Claude Computer Use 完整指南:桌面自动化、Agent Loop 与安全隔离实践Claude Computer Use 官方文档中文整理:功能定位、支持模型、beta header、工具配置、截图/鼠标/键盘控制、agent loop、参考实现、Docker 沙箱、网络 allowlist、prompt injection 风险和生产安全建议。2026/5/21深度Claude Tool Use 完整指南:Client Tools、Server Tools 与 Agent Loop 实战Claude Tool Use 官方文档中文整理:工具在哪里执行、client tools 和 server tools 的差异、tool_use/stop_reason/tool_result 的循环机制、strict schema、工具描述写法、成本构成与 Agent 安全设计。2026/5/21深度Claude Code vs Cursor vs GitHub Copilot 2026:真实对比,该选哪个?Claude Code vs Cursor vs GitHub Copilot 2026 真实对比:7 个实际场景(日常补全/单文件重构/多文件实现/Debug/Git 工作流/命令执行/CI 集成)的逐项分析;价格对比(免费版到企业版);学习曲线评估;生态覆盖对比;决策框架(何时选哪个);以及 Copilot/Cursor + Claude Code 同时使用的最佳组合方案。2026/5/12深度Claude Code Skills vs CLAUDE.md vs Plugins vs Sub-agents:何时用哪个的完整决策指南Claude Code 四种扩展机制的完整决策指南:四种机制本质对比表;CLAUDE.md 适合放/不适合放的内容清单(含内容精简测试);Skills 四种模式和完整决策树;Plugins 与 Skills 的选择对比表及 Token 开销警告;Sub-agents 三种触发方式和 context: fork 对比;四种组合使用模式;以及快速决策查询表(12 个场景)。2026/5/10