深度

Claude 多模态能力实战:用 Vision API 分析图片、截图转代码、OCR 提取

Claude Vision 多模态 API 完整实战:图片上传方式(base64/URL)、截图直接转 React 组件代码、OCR 文字提取、数据图表分析、设计稿审查、PDF 页面处理,以及 Claude Code 终端上传图片的完整工作流。

2026/3/154分钟 阅读ClaudeEagle

Claude 的 Vision(视觉)能力让它可以直接理解图片内容——分析截图、识别文字、理解图表、把设计稿转成代码。本文展示所有实用场景。

支持的图片格式

  • JPEG、PNG、GIF、WebP
  • 最大单张:5MB(base64)或 URL 引用
  • 每次请求最多 20 张图片

基础 API 用法

方式 1:本地图片(base64)

python
import anthropic, base64

client = anthropic.Anthropic()

with open('screenshot.png', 'rb') as f:
    image_data = base64.standard_b64encode(f.read()).decode('utf-8')

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": image_data
                }
            },
            {"type": "text", "text": "Describe what you see in this screenshot."}
        ]
    }]
)
print(response.content[0].text)

方式 2:URL 图片

python
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "url",
                    "url": "https://example.com/chart.png"
                }
            },
            {"type": "text", "text": "Analyze this chart and extract the key data points."}
        ]
    }]
)

场景 1:截图转 React 代码

python
def screenshot_to_react(image_path):
    with open(image_path, 'rb') as f:
        data = base64.standard_b64encode(f.read()).decode('utf-8')
    
    prompt = """
    Convert this UI screenshot to a React component.
    Requirements:
    - TypeScript
    - Tailwind CSS for styling
    - Match the layout and colors as closely as possible
    - Make it responsive (mobile-first)
    - Use semantic HTML
    Output only the component code.
    """
    
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": data}},
                {"type": "text", "text": prompt}
            ]
        }]
    )
    return response.content[0].text

code = screenshot_to_react('figma-design.png')

场景 2:OCR 文字提取

python
def extract_text(image_path):
    with open(image_path, 'rb') as f:
        data = base64.standard_b64encode(f.read()).decode('utf-8')
    
    ext = image_path.split('.')[-1].lower()
    media_type = {'jpg': 'image/jpeg', 'jpeg': 'image/jpeg',
                  'png': 'image/png', 'webp': 'image/webp'}.get(ext, 'image/png')
    
    response = client.messages.create(
        model="claude-haiku-3-5",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": media_type, "data": data}},
                {"type": "text", "text": "Extract all text from this image. Preserve formatting (tables, lists). Output only the extracted text."}
            ]
        }]
    )
    return response.content[0].text

# 批量处理扫描文档
import glob
for img in glob.glob('scanned/*.png'):
    text = extract_text(img)
    with open(img.replace('.png', '.txt'), 'w') as f:
        f.write(text)

场景 3:数据图表分析

python
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "url", "url": chart_url}},
            {"type": "text", "text": """
Analyze this chart:
1. What type of chart is this?
2. Extract all data points as JSON
3. Identify the trend (increasing/decreasing/stable)
4. What's the highest and lowest value?
5. Key insight in one sentence
            """}
        ]
    }]
)

场景 4:设计稿审查

python
def review_design(design_img, spec_img=None):
    content = []
    with open(design_img, 'rb') as f:
        d = base64.standard_b64encode(f.read()).decode()
    content.append({"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": d}})
    
    if spec_img:
        with open(spec_img, 'rb') as f:
            d2 = base64.standard_b64encode(f.read()).decode()
        content.append({"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": d2}})
        content.append({"type": "text", "text": "First image is the implementation, second is the spec. Find differences."})
    else:
        content.append({"type": "text", "text": "Review this UI for: accessibility issues, spacing inconsistencies, color contrast, missing hover states."})
    
    response = client.messages.create(
        model="claude-sonnet-4-5", max_tokens=1024,
        messages=[{"role": "user", "content": content}]
    )
    return response.content[0].text

在 Claude Code 终端中使用图片

bash
# 在交互模式里直接粘贴截图
claude
# 然后 Ctrl+V 粘贴截图(macOS/Linux 支持)
# 或拖拽图片文件到终端

# 非交互模式
claude -p "Convert this design to React component" --image design.png

多图对比

python
# 对比两个版本的 UI
def compare_screenshots(before_path, after_path):
    images = []
    for path in [before_path, after_path]:
        with open(path, 'rb') as f:
            d = base64.standard_b64encode(f.read()).decode()
        images.append({"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": d}})
    
    images.append({"type": "text", "text": "Compare these two screenshots. List all visual differences."})
    
    response = client.messages.create(
        model="claude-sonnet-4-5", max_tokens=1024,
        messages=[{"role": "user", "content": images}]
    )
    return response.content[0].text

来源:Vision API - Anthropic 官方文档

相关文章推荐

深度OpenClaw 多 Gateway 架构完全指南:一台机器运行多个独立 AI 助手实例OpenClaw 多 Gateway(Multi-Gateway)架构完整教程:多实例的隔离优势、同一台机器运行多个 Gateway(不同端口/配置文件/workspace)、systemd 管理多个 Gateway 服务、Nginx 虚拟主机为每个实例分配独立域名、API Key 隔离与成本拆分、单机多实例 vs 多机方案对比,以及 Docker Compose 多容器隔离部署方案。2026/3/26深度OpenClaw Hooks 自动化进阶:消息前后的智能拦截、转换与触发机制OpenClaw Hooks(钩子)自动化系统进阶教程:Hooks 的触发时机(before-send/after-receive/on-tool-call)、用 Hooks 拦截消息并修改内容(自动翻译/过滤/格式化)、基于条件的 Hook 触发(渠道过滤/关键词匹配)、Hook 中调用外部 API(Notion 记录/Bark 通知/监控告警)、exec 工具二次确认 Hook,以及 Hooks 与 SOUL.md 和 Standing Orders 的优先级关系详解。2026/3/26深度OpenClaw 插件开发完全指南:从零构建自定义渠道和工具插件OpenClaw 插件(Plugin)开发完整教程:插件类型(渠道插件/工具插件/Provider插件)、插件的目录结构和 package.json 规范、使用 Plugin SDK 开发自定义消息渠道(实现 onMessage/sendMessage 接口)、开发自定义工具(Tool)的函数签名和参数 Schema、本地插件安装与调试(openclaw plugins install ./local-plugin)、发布到 npm 的规范要求(@openclaw/ 命名空间)、插件的权限声明(capabilities)、社区插件列表(Plugin Bundles)获取,以及常见插件开发错误和调试技巧。2026/3/25深度OpenClaw 安全威胁模型深度解析:MITRE ATLAS 框架下的 AI 助手攻防分析OpenClaw 安全架构深度分析:个人助手信任模型(单用户/单 Gateway 边界)、形式化验证的认证逻辑、基于 MITRE ATLAS 框架的 AI 系统威胁分类(直接提示注入/间接提示注入/工具滥用/数据泄露/会话劫持)、多租户共享 Gateway 的风险与安全边界说明、exec/browser/文件工具的权限最小化配置、频道白名单与沙箱配置对应的威胁缓解措施,以及 `openclaw security audit` 命令的使用方法。2026/3/24深度OpenClaw 多模型路由完全指南:30+ 模型提供商接入、智能切换与故障转移OpenClaw 多模型路由系统完整教程:支持的 30+ 模型提供商全览(Anthropic/OpenAI/Gemini/Ollama/OpenRouter/DeepSeek/Qwen/GLM 等)、provider/model 格式的模型指定方式、按渠道/Agent/任务类型设置不同默认模型、Model Failover 故障转移配置(主模型失败自动切换备用模型)、Claude Max API Proxy 接入方式、本地模型(Ollama/vLLM)与云端模型混用策略,以及 Token 限制和费用控制实践。2026/3/24深度OpenClaw 多渠道路由完全指南:同时管理 Telegram、WhatsApp、Slack 的统一 AI 助手OpenClaw 多渠道路由(Channel Routing)完整教程:如何在一个 OpenClaw 实例上同时运行 Telegram、WhatsApp、Slack 等多个渠道、每个渠道使用独立 Agent(SOUL.md)的路由配置、基于渠道类型和群组 ID 的路由规则、同一消息跨渠道广播(Broadcast Groups)、根据渠道身份动态调整 AI 人格与语言风格,以及多渠道管理的最佳实践(避免消息混淆/保持上下文独立/渠道专属配置)。2026/3/23