Os Use

201 installs2 stars

Summary

This is a comprehensive cross-platform automation toolkit that gives Claude direct control over your desktop. It handles screenshots, visual recognition with OpenCV, mouse and keyboard control via pyautogui, and window management through AppleScript on macOS or pywinauto on Windows. You'd reach for this when you need to automate repetitive UI tasks, run automated testing without Selenium, or extract data from applications without APIs. The implementation is mature with proper error handling patterns and retry logic built in. Fair warning: you'll need to grant Accessibility and Screen Recording permissions on macOS, and some operations require admin rights on Windows. It's genuinely useful for desktop automation but expect to do some permission wrangling on first setup.

Install to Claude Code

npx -y skills add zrong/skills --skill os-use --agent claude-code

Installs into .claude/skills of the current project.

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

Files

SKILL.mdView on GitHub

OS Use - Cross-Platform OS Automation

A comprehensive cross-platform toolkit for OS automation, screenshot capture, visual recognition, mouse/keyboard control, and window management. Supports macOS 12+ and Windows 10+.

Platform Support Matrix

Feature	macOS Implementation	Windows Implementation
Screenshot	`pyautogui` + `PIL`	`pyautogui` + `PIL`
Visual Recognition	`opencv-python` + `pyautogui`	`opencv-python` + `pyautogui`
Mouse/Keyboard	`pyautogui`	`pyautogui`
Window Management	`AppleScript` (native)	`pywinauto` / `pygetwindow`
Application Control	`AppleScript` / `subprocess`	`subprocess` / `pywinauto`
Browser Automation	Chrome DevTools MCP	Chrome DevTools MCP

Capabilities

1. Screenshot Capture 📸

Universal (macOS & Windows):

Full screen capture
Region capture (specified coordinates)
Window capture (specific application window)
Clipboard screenshot access

Implementation: pyautogui.screenshot() + PIL.Image

2. Visual Recognition 👁️

Universal (macOS & Windows):

Image matching/locating on screen
Template matching with confidence threshold
Multi-scale matching (handle different resolutions)
Color detection and region extraction

Optional OCR:

Text recognition from screenshots (requires pytesseract + Tesseract OCR engine)

Implementation: opencv-python + pyautogui.locateOnScreen()

3. Mouse & Keyboard Control 🖱️⌨️

Universal (macOS & Windows):

Mouse movement (absolute and relative coordinates)
Mouse clicking (left, right, middle, double-click)
Mouse dragging and dropping
Scroll wheel operations
Keyboard text input
Keyboard shortcuts and hotkeys
Special key combinations

Implementation: pyautogui

4. Window Management 🪟

macOS Implementation:

List all application windows
Get window position, size, title
Activate/minimize/close windows
Move and resize windows
Launch/quit applications

Implementation: AppleScript via subprocess

Windows Implementation:

Same capabilities as macOS
Additional: Get window handle (HWND), process information
Better integration with Windows window manager

Implementation: pywinauto or pygetwindow

5. Browser Automation 🌐

Universal (macOS & Windows):

Webpage screenshots
Element screenshots
Page navigation
Form filling and clicking
Network monitoring
Performance analysis

Implementation: Chrome DevTools MCP (separate tool)

6. System Integration 🔧

Clipboard Operations:

Read/write clipboard content
Support images and text

Implementation: pyperclip + pyautogui

Technical Implementation Details

Python Environment Setup

# Create virtual environment
python3 -m venv ~/.nanobot/workspace/macos-automation/.venv

# Activate
source ~/.nanobot/workspace/macos-automation/.venv/bin/activate

# Install dependencies
pip install pyautogui opencv-python-headless numpy Pillow pyperclip

# macOS specific
# (AppleScript is built-in, no installation needed)

# Windows specific
pip install pywinauto pygetwindow

Key Libraries Reference

Library	Version	Purpose
`pyautogui`	0.9.54+	Screenshot, mouse/keyboard control
`opencv-python-headless`	4.11.0.84+	Image recognition, computer vision
`numpy`	2.4.2+	Numerical operations for OpenCV
`Pillow`	12.1.1+	Image processing
`pyperclip`	Latest	Clipboard operations
`pywinauto`	Latest	Windows window management
`pygetwindow`	Latest	Cross-platform window control

Platform-Specific Notes

macOS Specifics

Permissions Required:

Accessibility: System Settings > Privacy & Security > Accessibility
Screen Recording: System Settings > Privacy & Security > Screen Recording

AppleScript Quirks:

Some modern apps (e.g., Chrome) may have limited AppleScript support
Window titles may be truncated or localized
Some operations require app to be frontmost

Coordinate System:

Origin (0, 0) at top-left
Retina displays: pyautogui automatically handles scaling

Windows Specifics

Administrator Privileges:

Some operations (e.g., interacting with elevated windows) may require admin rights

High DPI Displays:

Windows scaling may affect coordinate accuracy
Use pyautogui.size() to get actual screen dimensions

Window Handle (HWND):

Windows provides low-level window handles for precise control
pywinauto provides both high-level and low-level access

Error Handling Patterns

import pyautogui
import time

# Pattern 1: Retry with backoff
def retry_with_backoff(func, max_retries=3, base_delay=1):
    for i in range(max_retries):
        try:
            return func()
        except Exception as e:
            if i == max_retries - 1:
                raise
            delay = base_delay * (2 ** i)
            print(f"Retry {i+1}/{max_retries} after {delay}s: {e}")
            time.sleep(delay)

# Pattern 2: Safe operations with fallback
def safe_screenshot(output_path):
    try:
        screenshot = pyautogui.screenshot()
        screenshot.save(output_path)
        return output_path
    except Exception as e:
        print(f"Screenshot failed: {e}")
        return None

# Pattern 3: Coordinate boundary checking
def safe_click(x, y, max_x=None, max_y=None):
    """安全点击，确保坐标在屏幕范围内"""
    if max_x is None or max_y is None:
        max_x, max_y = pyautogui.size()
    
    x = max(0, min(x, max_x - 1))
    y = max(0, min(y, max_y - 1))
    
    pyautogui.click(x, y)

Usage Examples by Scenario

Scenario 1: Automated Testing

"""
自动化 UI 测试示例
测试一个假设的登录页面
"""
import pyautogui
import time

def test_login_flow():
    # 1. 截取初始状态
    initial_screenshot = pyautogui.screenshot()
    initial_screenshot.save("test_01_initial.png")
    
    # 2. 查找并点击登录按钮
    button_location = pyautogui.locateOnScreen(
        "login_button.png",
        confidence=0.9
    )
    if button_location:
        center = pyautogui.center(button_location)
        pyautogui.click(center.x, center.y)
        time.sleep(1)
    
    # 3. 输入用户名
    pyautogui.typewrite("testuser@example.com", interval=0.01)
    pyautogui.press('tab')
    
    # 4. 输入密码
    pyautogui.typewrite("TestPassword123", interval=0.01)
    
    # 5. 点击提交
    pyautogui.press('return')
    time.sleep(2)
    
    # 6. 验证结果
    result_screenshot = pyautogui.screenshot()
    result_screenshot.save("test_02_result.png")
    
    # 检查是否出现成功提示
    success_indicator = pyautogui.locateOnScreen(
        "success_message.png",
        confidence=0.8
    )
    
    if success_indicator:
        print("✅ 测试通过：登录成功")
        return True
    else:
        print("❌ 测试失败：未找到成功提示")
        return False

# 运行测试
if __name__ == "__main__":
    test_login_flow()

Scenario 2: Data Entry Automation

"""
数据录入自动化示例
将 Excel 数据自动填入网页表单
"""
import pyautogui
import pandas as pd
import time

def automate_data_entry(excel_file, form_template):
    """
    从 Excel 读取数据并自动填入表单
    
    Args:
        excel_file: Excel 文件路径
        form_template: 表单字段与 Excel 列的映射
    """
    # 1. 读取 Excel 数据
    df = pd.read_excel(excel_file)
    print(f"读取到 {len(df)} 条记录")
    
    # 2. 遍历每条记录
    for index, row in df.iterrows():
        print(f"\n正在处理第 {index + 1} 条记录...")
        
        # 3. 填写每个字段
        for field_name, column_name in form_template.items():
            value = row.get(column_name, '')
            
            # 查找表单字段（需要提前准备字段截图）
            field_location = pyautogui.locateOnScreen(
                f"form_field_{field_name}.png",
                confidence=0.8
            )
            
            if field_location:
                # 点击字段
                center = pyautogui.center(field_location)
                pyautogui.click(center.x, center.y)
                time.sleep(0.2)
                
                # 输入值
                pyautogui.hotkey('ctrl', 'a')  # 全选
                pyautogui.typewrite(str(value), interval=0.01)
                time.sleep(0.2)
            else:
                print(f"  ⚠️ 未找到字段: {field_name}")
        
        # 4. 提交表单
        submit_btn = pyautogui.locateOnScreen(
            "submit_button.png",
            confidence=0.8
        )
        if submit_btn:
            center = pyautogui.center(submit_btn)
            pyautogui.click(center.x, center.y)
            print("  ✅ 已提交")
            time.sleep(2)  # 等待提交完成
        else:
            print("  ⚠️ 未找到提交按钮")
        
        # 5. 准备下一条记录
        # 可能需要点击"添加新记录"或返回列表
        time.sleep(1)
    
    print("\n🎉 所有记录处理完成！")

# 使用示例
if __name__ == "__main__":
    # 表单模板：字段名 -> Excel 列名
    form_template = {
        "name": "姓名",
        "email": "邮箱",
        "phone": "电话",
        "address": "地址"
    }
    
    automate_data_entry("data.xlsx", form_template)

Scenario 3: Screen Monitoring & Alerting

"""
屏幕监控与告警示例
监控特定区域变化，发现变化时发送通知
"""
import pyautogui
import cv2
import numpy as np
import time
from datetime import datetime

def monitor_screen_region(region, template_image=None, check_interval=5, callback=None):
    """
    监控屏幕特定区域的变化
    
    Args:
        region: (left, top, width, height) 监控区域
        template_image: 要查找的模板图像路径（可选）
        check_interval: 检查间隔（秒）
        callback: 发现变化时的回调函数
    
    Returns:
        监控会话对象（可调用 stop() 停止）
    """
    class MonitorSession:
        def __init__(self):
            self.running = True
            self.baseline = None
        
        def stop(self):
            self.running = False
    
    session = MonitorSession()
    
    print(f"🔍 开始监控区域: {region}")
    print(f"⏱️  检查间隔: {check_interval}秒")
    print("按 Ctrl+C 停止监控\n")
    
    try:
        while session.running:
            # 捕获当前区域
            current = pyautogui.screenshot(region=region)
            current_array = np.array(current)
            
            if template_image:
                # 模式1: 查找模板图像
                template_location = pyautogui.locateOnScreen(
                    template_image,
                    confidence=0.8
                )
                
                if template_location:
                    print(f"✅ [{datetime.now()}] 找到模板图像: {template_location}")
                    if callback:
                        callback('template_found', {
                            'location': template_location,
                            'screenshot': current
                        })
            else:
                # 模式2: 检测变化
                if session.baseline is None:
                    session.baseline = current_array
                    print(f"📸 [{datetime.now()}] 已建立基准图像")
                else:
                    # 计算差异
                    diff = cv2.absdiff(session.baseline, current_array)
                    diff_gray = cv2.cvtColor(diff, cv2.COLOR_RGB2GRAY)
                    diff_score = np.mean(diff_gray)
                    
                    if diff_score > 10:  # 阈值可调
                        print(f"⚠️  [{datetime.now()}] 检测到变化! 差异分数: {diff_score:.2f}")
                        if callback:
                            callback('change_detected', {
                                'diff_score': diff_score,
                                'screenshot': current,
                                'baseline': session.baseline
                            })
                        # 更新基准
                        session.baseline = current_array
            
            time.sleep(check_interval)
    
    except KeyboardInterrupt:
        print("\n🛑 监控已停止")
    
    return session

# 使用示例
def alert_callback(event_type, data):
    """告警回调函数示例"""
    if event_type == 'template_found':
        print(f"🎯 模板出现在: {data['location']}")
        # 可以在这里发送通知、发送邮件、执行操作等
    elif event_type == 'change_detected':
        print(f"📊 变化强度: {data['diff_score']}")
        # 保存差异图像
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        data['screenshot'].save(f"change_{timestamp}.png")

if __name__ == "__main__":
    # 示例1: 监控屏幕变化
    print("=== 监控屏幕变化 ===")
    monitor = monitor_screen_region(
        region=(0, 0, 1920, 1080),  # 全屏
        check_interval=5,           # 每5秒检查一次
        callback=alert_callback
    )
    
    # 10分钟后停止（实际使用可以一直运行）
    # time.sleep(600)
    # monitor.stop()
    
    # 示例2: 查找特定图像
    # monitor = monitor_screen_region(
    #     region=(0, 0, 1920, 1080),
    #     template_image="target_button.png",  # 要查找的图像
    #     check_interval=2,
    #     callback=alert_callback
    # )

Advanced Techniques

Handling Multiple Monitors

import pyautogui

def get_all_screen_sizes():
    """获取所有显示器尺寸（仅 Windows 支持多显示器详细信息）"""
    # macOS 返回主屏尺寸
    # Windows 可以使用 pygetwindow 或 win32api 获取多显示器信息
    
    primary = pyautogui.size()
    print(f"主屏幕尺寸: {primary}")
    
    # Windows 示例（需要安装 pywin32）
    try:
        import win32api
        monitors = win32api.EnumDisplayMonitors()
        for i, monitor in enumerate(monitors):
            print(f"显示器 {i+1}: {monitor[2]}")
    except ImportError:
        pass
    
    return primary

def screenshot_specific_monitor(monitor_num=0):
    """截图指定显示器（实验性功能）"""
    # 目前 pyautogui 主要支持主显示器
    # 多显示器支持需要平台特定代码
    pass

Performance Optimization

import cv2
import numpy as np
import pyautogui
import time
from functools import lru_cache

class ScreenCache:
    """屏幕缓存优化器"""
    
    def __init__(self, cache_duration=0.5):
        self.cache_duration = cache_duration
        self.last_capture = None
        self.last_capture_time = 0
    
    def get_screenshot(self, region=None):
        """获取截图（带缓存）"""
        current_time = time.time()
        
        # 检查缓存是否有效
        if (self.last_capture is not None and 
            current_time - self.last_capture_time < self.cache_duration and
            region is None):
            return self.last_capture
        
        # 捕获新截图
        screenshot = pyautogui.screenshot(region=region)
        
        if region is None:
            self.last_capture = screenshot
            self.last_capture_time = current_time
        
        return screenshot
    
    def clear_cache(self):
        """清除缓存"""
        self.last_capture = None
        self.last_capture_time = 0

class FastImageFinder:
    """快速图像查找器（使用多尺度金字塔）"""
    
    def __init__(self, scales=[0.8, 0.9, 1.0, 1.1, 1.2]):
        self.scales = scales
    
    def find_multi_scale(self, template_path, screenshot=None, confidence=0.8):
        """
        多尺度图像查找
        
        Returns:
            (x, y, scale) 或 None
        """
        if screenshot is None:
            screenshot = pyautogui.screenshot()
        
        template = cv2.imread(template_path)
        if template is None:
            return None
        
        screenshot_cv = cv2.cvtColor(np.array(screenshot), cv2.COLOR_RGB2BGR)
        
        for scale in self.scales:
            # 缩放模板
            scaled_template = cv2.resize(
                template,
                None,
                fx=scale,
                fy=scale,
                interpolation=cv2.INTER_AREA
            )
            
            # 模板匹配
            result = cv2.matchTemplate(
                screenshot_cv,
                scaled_template,
                cv2.TM_CCOEFF_NORMED
            )
            
            _, max_val, _, max_loc = cv2.minMaxLoc(result)
            
            if max_val >= confidence:
                h, w = scaled_template.shape[:2]
                center_x = max_loc[0] + w // 2
                center_y = max_loc[1] + h // 2
                return (center_x, center_y, scale)
        
        return None

# 使用示例
cache = ScreenCache()
finder = FastImageFinder()

# 快速截图（带缓存）
screenshot = cache.get_screenshot()

# 多尺度图像查找
result = finder.find_multi_scale("button.png", screenshot)
if result:
    x, y, scale = result
    print(f"找到图像: ({x}, {y}), 缩放: {scale}")

Security Considerations

"""
安全最佳实践
"""

import pyautogui
import hashlib
import time

class SecureAutomation:
    """安全自动化包装器"""
    
    def __init__(self):
        self.action_log = []
        self.max_retries = 3
        self.rate_limit_delay = 0.1  # 操作间隔
    
    def log_action(self, action, details):
        """记录操作日志"""
        timestamp = time.strftime("%Y-%m-%d %H:%M:%S")
        log_entry = {
            'timestamp': timestamp,
            'action': action,
            'details': details,
            'hash': hashlib.md5(f"{timestamp}{action}{details}".encode()).hexdigest()[:8]
        }
        self.action_log.append(log_entry)
    
    def safe_click(self, x, y, description=""):
        """安全点击（带验证）"""
        try:
            # 验证坐标在屏幕范围内
            screen_width, screen_height = pyautogui.size()
            if not (0 <= x < screen_width and 0 <= y < screen_height):
                raise ValueError(f"坐标 ({x}, {y}) 超出屏幕范围")
            
            # 执行点击
            pyautogui.moveTo(x, y, duration=0.2)
            time.sleep(self.rate_limit_delay)
            pyautogui.click()
            
            # 记录日志
            self.log_action('click', f"({x}, {y}) - {description}")
            
            return True
            
        except Exception as e:
            self.log_action('click_failed', f"({x}, {y}) - Error: {str(e)}")
            return False
    
    def safe_typewrite(self, text, interval=0.01):
        """安全输入（敏感信息不记录）"""
        try:
            pyautogui.typewrite(text, interval=interval)
            self.log_action('typewrite', f"输入 {len(text)} 个字符 [内容已隐藏]")
            return True
        except Exception as e:
            self.log_action('typewrite_failed', f"Error: {str(e)}")
            return False
    
    def get_action_report(self):
        """生成操作报告"""
        total = len(self.action_log)
        successful = sum(1 for log in self.action_log if 'failed' not in log['action'])
        failed = total - successful
        
        report = f"""
=== 自动化操作报告 ===
总操作数: {total}
成功: {successful}
失败: {failed}
成功率: {(successful/total*100):.1f}%

详细日志:
"""
        for log in self.action_log:
            report += f"[{log['timestamp']}] [{log['hash']}] {log['action']}: {log['details']}\n"
        
        return report

# 使用示例
secure = SecureAutomation()

# 执行安全操作
secure.safe_click(500, 400, "登录按钮")
secure.safe_typewrite("username@example.com")
secure.safe_click(500, 450, "密码输入框")
secure.safe_typewrite("********")
secure.safe_click(500, 500, "提交按钮")

# 生成报告
print(secure.get_action_report())

Troubleshooting Guide

Common Issues and Solutions

1. Permission Errors

Symptom: pyautogui fails with permission errors or captures black screenshots.

macOS Solution:

Open System Settings > Privacy & Security > Accessibility
Add your terminal application (e.g., Terminal.app, iTerm.app, or the Python executable)
Repeat for Screen Recording permission

Windows Solution:

Run as Administrator if needed
Check Windows Defender or antivirus isn't blocking

2. Coordinate Inaccuracy

Symptom: Clicks or screenshots miss the intended target.

Possible Causes:

High DPI / Retina display scaling
Multiple monitors with different resolutions
Window decorations or taskbar affecting coordinates

Solution:

import pyautogui

# Debug: Print screen info
print(f"Screen size: {pyautogui.size()}")
print(f"Mouse position: {pyautogui.position()}")

# Handle high DPI (Windows)
import ctypes
ctypes.windll.user32.SetProcessDPIAware()  # Windows only

3. Image Recognition Failures

Symptom: locateOnScreen returns None even when image is visible.

Common Causes:

Resolution mismatch (captured image at different scale)
Color depth differences
Transparency or alpha channel issues
Confidence threshold too high

Solutions:

import pyautogui
import cv2
import numpy as np

# Solution 1: Lower confidence
location = pyautogui.locateOnScreen('button.png', confidence=0.7)  # Default is 0.9

# Solution 2: Multi-scale matching (see FastImageFinder class in Performance section)
finder = FastImageFinder(scales=[0.5, 0.75, 1.0, 1.25, 1.5])
result = finder.find_multi_scale('button.png')

# Solution 3: Convert to grayscale for matching
screenshot = pyautogui.screenshot()
screenshot_cv = cv2.cvtColor(np.array(screenshot), cv2.COLOR_RGB2GRAY)
template = cv2.imread('button.png', cv2.IMREAD_GRAYSCALE)

result = cv2.matchTemplate(screenshot_cv, template, cv2.TM_CCOEFF_NORMED)
min_val, max_val, min_loc, max_loc = cv2.minMaxLoc(result)

if max_val >= 0.8:
    print(f"找到匹配，置信度: {max_val}")
    h, w = template.shape
    center_x = max_loc[0] + w // 2
    center_y = max_loc[1] + h // 2
    pyautogui.click(center_x, center_y)

4. Slow Performance

Symptom: Operations are slow, high CPU usage, or noticeable delays.

Optimization Strategies:

Reduce Screenshot Frequency
- Cache screenshots when possible
- Use region-specific captures instead of full screen
Optimize Image Matching
- Resize large images before matching
- Use grayscale matching when color isn't important
- Set appropriate confidence levels
Batch Operations
- Group multiple actions together
- Minimize unnecessary delays

See the "Performance Optimization" section for detailed code examples.

5. Application-Specific Issues

Browser Automation:

Modern browsers may block automation
Use Chrome DevTools Protocol instead of pyautogui for web
Consider Playwright or Selenium for complex web automation

Game/Graphics Applications:

DirectX/OpenGL apps may not be capturable by standard screenshot
May require specialized tools (e.g., OBS Studio's capture API)

Protected Content:

DRM-protected content (Netflix, etc.) cannot be screenshotted
This is a system-level restriction

Integration with Other Tools

With ChatGPT/AI Assistants

This skill is designed to work with AI assistants like nanobot. Here's how to integrate:

# Example: AI assistant using this skill

def ai_assisted_automation(user_request):
    """
    AI 助手使用自动化技能
    
    Args:
        user_request: 用户的自然语言请求
    """
    # 1. AI 解析用户意图
    intent = parse_intent(user_request)
    
    if intent == 'screenshot':
        # 2. 执行截图
        screenshot = pyautogui.screenshot()
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        path = f"screenshot_{timestamp}.png"
        screenshot.save(path)
        return f"已截图并保存到: {path}"
    
    elif intent == 'click_button':
        # 2. 查找并点击按钮
        button_name = extract_button_name(user_request)
        location = pyautogui.locateOnScreen(f"{button_name}.png")
        if location:
            pyautogui.click(pyautogui.center(location))
            return f"已点击按钮: {button_name}"
        else:
            return f"未找到按钮: {button_name}"
    
    # ... 其他意图处理

With CI/CD Pipelines

# Example: GitHub Actions using this skill for visual testing

name: Visual Regression Tests

on: [push, pull_request]

jobs:
  visual-test:
    runs-on: macos-latest  # or windows-latest
    
    steps:
    - uses: actions/checkout@v3
    
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.11'
    
    - name: Install dependencies
      run: |
        pip install pyautogui opencv-python-headless numpy Pillow
    
    - name: Run visual tests
      run: python tests/visual_regression.py
    
    - name: Upload screenshots
      uses: actions/upload-artifact@v3
      with:
        name: screenshots
        path: screenshots/

With Monitoring Systems

# Example: Integration with Prometheus/Grafana for screen monitoring

from prometheus_client import Gauge, start_http_server
import pyautogui
import time

# Define metrics
screen_change_gauge = Gauge('screen_change_score', 'Screen change detection score')
template_match_gauge = Gauge('template_match_confidence', 'Template matching confidence')

start_http_server(8000)

def monitoring_loop():
    baseline = None
    
    while True:
        # Capture screen
        current = pyautogui.screenshot()
        current_array = np.array(current)
        
        if baseline is not None:
            # Calculate change
            diff = cv2.absdiff(baseline, current_array)
            diff_score = np.mean(diff)
            screen_change_gauge.set(diff_score)
        
        baseline = current_array
        
        # Check for template
        try:
            location = pyautogui.locateOnScreen('alert_icon.png', confidence=0.8)
            if location:
                template_match_gauge.set(1.0)
            else:
                template_match_gauge.set(0.0)
        except:
            template_match_gauge.set(0.0)
        
        time.sleep(5)

monitoring_loop()

Future Roadmap

Planned Features

Linux Support
- X11 and Wayland compatibility
- xdotool and scrot integration
- mss for multi-monitor support
AI-Powered Recognition
- Integration with OpenAI GPT-4V or Google Gemini for visual understanding
- Natural language element finding ("click the blue submit button")
- OCR-free text extraction using vision models
Mobile Device Support
- Android: ADB (Android Debug Bridge) integration
- iOS: WebDriverAgent via Appium
- Screenshot and touch simulation
Cloud Integration
- AWS Lambda support for serverless automation
- Azure Functions and GCP Cloud Functions compatibility
- Distributed screenshot processing
Advanced Analytics
- Built-in A/B testing framework for UI changes
- Heatmap generation from user interactions
- Performance regression detection

Contributing

We welcome contributions! Please see the Contributing Guide for details on:

Code style and formatting
Testing requirements
Documentation standards
Pull request process

License

This skill is licensed under the MIT License. See LICENSE for details.

Last Updated: 2026-03-06
Version: 1.0.0
Maintainer: nanobot skills team

Featured

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

First SeenJun 3, 2026

View on GitHub

OS Use - Cross-Platform OS Automation

A comprehensive cross-platform toolkit for OS automation, screenshot capture, visual recognition, mouse/keyboard control, and window management. Supports macOS 12+ and Windows 10+.

Platform Support Matrix

Feature	macOS Implementation	Windows Implementation
Screenshot	`pyautogui` + `PIL`	`pyautogui` + `PIL`
Visual Recognition	`opencv-python` + `pyautogui`	`opencv-python` + `pyautogui`
Mouse/Keyboard	`pyautogui`	`pyautogui`
Window Management	`AppleScript` (native)	`pywinauto` / `pygetwindow`
Application Control	`AppleScript` / `subprocess`	`subprocess` / `pywinauto`
Browser Automation	Chrome DevTools MCP	Chrome DevTools MCP

Capabilities

1. Screenshot Capture 📸

Universal (macOS & Windows):

Full screen capture
Region capture (specified coordinates)
Window capture (specific application window)
Clipboard screenshot access

Implementation: pyautogui.screenshot() + PIL.Image

2. Visual Recognition 👁️

Universal (macOS & Windows):

Image matching/locating on screen
Template matching with confidence threshold
Multi-scale matching (handle different resolutions)
Color detection and region extraction

Optional OCR:

Text recognition from screenshots (requires pytesseract + Tesseract OCR engine)

Implementation: opencv-python + pyautogui.locateOnScreen()

3. Mouse & Keyboard Control 🖱️⌨️

Universal (macOS & Windows):

Mouse movement (absolute and relative coordinates)
Mouse clicking (left, right, middle, double-click)
Mouse dragging and dropping
Scroll wheel operations
Keyboard text input
Keyboard shortcuts and hotkeys
Special key combinations

Implementation: pyautogui

4. Window Management 🪟

macOS Implementation:

List all application windows
Get window position, size, title
Activate/minimize/close windows
Move and resize windows
Launch/quit applications

Implementation: AppleScript via subprocess

Windows Implementation:

Same capabilities as macOS
Additional: Get window handle (HWND), process information
Better integration with Windows window manager

Implementation: pywinauto or pygetwindow

5. Browser Automation 🌐

Universal (macOS & Windows):

Webpage screenshots
Element screenshots
Page navigation
Form filling and clicking
Network monitoring
Performance analysis

Implementation: Chrome DevTools MCP (separate tool)

6. System Integration 🔧

Clipboard Operations:

Read/write clipboard content
Support images and text

Implementation: pyperclip + pyautogui

Technical Implementation Details

Python Environment Setup

# Create virtual environment
python3 -m venv ~/.nanobot/workspace/macos-automation/.venv

# Activate
source ~/.nanobot/workspace/macos-automation/.venv/bin/activate

# Install dependencies
pip install pyautogui opencv-python-headless numpy Pillow pyperclip

# macOS specific
# (AppleScript is built-in, no installation needed)

# Windows specific
pip install pywinauto pygetwindow

Key Libraries Reference

Library	Version	Purpose
`pyautogui`	0.9.54+	Screenshot, mouse/keyboard control
`opencv-python-headless`	4.11.0.84+	Image recognition, computer vision
`numpy`	2.4.2+	Numerical operations for OpenCV
`Pillow`	12.1.1+	Image processing
`pyperclip`	Latest	Clipboard operations
`pywinauto`	Latest	Windows window management
`pygetwindow`	Latest	Cross-platform window control

Platform-Specific Notes

macOS Specifics

Permissions Required:

Accessibility: System Settings > Privacy & Security > Accessibility
Screen Recording: System Settings > Privacy & Security > Screen Recording

AppleScript Quirks:

Some modern apps (e.g., Chrome) may have limited AppleScript support
Window titles may be truncated or localized
Some operations require app to be frontmost

Coordinate System:

Origin (0, 0) at top-left
Retina displays: pyautogui automatically handles scaling

Windows Specifics

Administrator Privileges:

Some operations (e.g., interacting with elevated windows) may require admin rights

High DPI Displays:

Windows scaling may affect coordinate accuracy
Use pyautogui.size() to get actual screen dimensions

Window Handle (HWND):

Windows provides low-level window handles for precise control
pywinauto provides both high-level and low-level access

Error Handling Patterns

import pyautogui
import time

# Pattern 1: Retry with backoff
def retry_with_backoff(func, max_retries=3, base_delay=1):
    for i in range(max_retries):
        try:
            return func()
        except Exception as e:
            if i == max_retries - 1:
                raise
            delay = base_delay * (2 ** i)
            print(f"Retry {i+1}/{max_retries} after {delay}s: {e}")
            time.sleep(delay)

# Pattern 2: Safe operations with fallback
def safe_screenshot(output_path):
    try:
        screenshot = pyautogui.screenshot()
        screenshot.save(output_path)
        return output_path
    except Exception as e:
        print(f"Screenshot failed: {e}")
        return None

# Pattern 3: Coordinate boundary checking
def safe_click(x, y, max_x=None, max_y=None):
    """安全点击，确保坐标在屏幕范围内"""
    if max_x is None or max_y is None:
        max_x, max_y = pyautogui.size()
    
    x = max(0, min(x, max_x - 1))
    y = max(0, min(y, max_y - 1))
    
    pyautogui.click(x, y)

Usage Examples by Scenario

Scenario 1: Automated Testing

"""
自动化 UI 测试示例
测试一个假设的登录页面
"""
import pyautogui
import time

def test_login_flow():
    # 1. 截取初始状态
    initial_screenshot = pyautogui.screenshot()
    initial_screenshot.save("test_01_initial.png")
    
    # 2. 查找并点击登录按钮
    button_location = pyautogui.locateOnScreen(
        "login_button.png",
        confidence=0.9
    )
    if button_location:
        center = pyautogui.center(button_location)
        pyautogui.click(center.x, center.y)
        time.sleep(1)
    
    # 3. 输入用户名
    pyautogui.typewrite("testuser@example.com", interval=0.01)
    pyautogui.press('tab')
    
    # 4. 输入密码
    pyautogui.typewrite("TestPassword123", interval=0.01)
    
    # 5. 点击提交
    pyautogui.press('return')
    time.sleep(2)
    
    # 6. 验证结果
    result_screenshot = pyautogui.screenshot()
    result_screenshot.save("test_02_result.png")
    
    # 检查是否出现成功提示
    success_indicator = pyautogui.locateOnScreen(
        "success_message.png",
        confidence=0.8
    )
    
    if success_indicator:
        print("✅ 测试通过：登录成功")
        return True
    else:
        print("❌ 测试失败：未找到成功提示")
        return False

# 运行测试
if __name__ == "__main__":
    test_login_flow()

Scenario 2: Data Entry Automation

"""
数据录入自动化示例
将 Excel 数据自动填入网页表单
"""
import pyautogui
import pandas as pd
import time

def automate_data_entry(excel_file, form_template):
    """
    从 Excel 读取数据并自动填入表单
    
    Args:
        excel_file: Excel 文件路径
        form_template: 表单字段与 Excel 列的映射
    """
    # 1. 读取 Excel 数据
    df = pd.read_excel(excel_file)
    print(f"读取到 {len(df)} 条记录")
    
    # 2. 遍历每条记录
    for index, row in df.iterrows():
        print(f"\n正在处理第 {index + 1} 条记录...")
        
        # 3. 填写每个字段
        for field_name, column_name in form_template.items():
            value = row.get(column_name, '')
            
            # 查找表单字段（需要提前准备字段截图）
            field_location = pyautogui.locateOnScreen(
                f"form_field_{field_name}.png",
                confidence=0.8
            )
            
            if field_location:
                # 点击字段
                center = pyautogui.center(field_location)
                pyautogui.click(center.x, center.y)
                time.sleep(0.2)
                
                # 输入值
                pyautogui.hotkey('ctrl', 'a')  # 全选
                pyautogui.typewrite(str(value), interval=0.01)
                time.sleep(0.2)
            else:
                print(f"  ⚠️ 未找到字段: {field_name}")
        
        # 4. 提交表单
        submit_btn = pyautogui.locateOnScreen(
            "submit_button.png",
            confidence=0.8
        )
        if submit_btn:
            center = pyautogui.center(submit_btn)
            pyautogui.click(center.x, center.y)
            print("  ✅ 已提交")
            time.sleep(2)  # 等待提交完成
        else:
            print("  ⚠️ 未找到提交按钮")
        
        # 5. 准备下一条记录
        # 可能需要点击"添加新记录"或返回列表
        time.sleep(1)
    
    print("\n🎉 所有记录处理完成！")

# 使用示例
if __name__ == "__main__":
    # 表单模板：字段名 -> Excel 列名
    form_template = {
        "name": "姓名",
        "email": "邮箱",
        "phone": "电话",
        "address": "地址"
    }
    
    automate_data_entry("data.xlsx", form_template)

Scenario 3: Screen Monitoring & Alerting

"""
屏幕监控与告警示例
监控特定区域变化，发现变化时发送通知
"""
import pyautogui
import cv2
import numpy as np
import time
from datetime import datetime

def monitor_screen_region(region, template_image=None, check_interval=5, callback=None):
    """
    监控屏幕特定区域的变化
    
    Args:
        region: (left, top, width, height) 监控区域
        template_image: 要查找的模板图像路径（可选）
        check_interval: 检查间隔（秒）
        callback: 发现变化时的回调函数
    
    Returns:
        监控会话对象（可调用 stop() 停止）
    """
    class MonitorSession:
        def __init__(self):
            self.running = True
            self.baseline = None
        
        def stop(self):
            self.running = False
    
    session = MonitorSession()
    
    print(f"🔍 开始监控区域: {region}")
    print(f"⏱️  检查间隔: {check_interval}秒")
    print("按 Ctrl+C 停止监控\n")
    
    try:
        while session.running:
            # 捕获当前区域
            current = pyautogui.screenshot(region=region)
            current_array = np.array(current)
            
            if template_image:
                # 模式1: 查找模板图像
                template_location = pyautogui.locateOnScreen(
                    template_image,
                    confidence=0.8
                )
                
                if template_location:
                    print(f"✅ [{datetime.now()}] 找到模板图像: {template_location}")
                    if callback:
                        callback('template_found', {
                            'location': template_location,
                            'screenshot': current
                        })
            else:
                # 模式2: 检测变化
                if session.baseline is None:
                    session.baseline = current_array
                    print(f"📸 [{datetime.now()}] 已建立基准图像")
                else:
                    # 计算差异
                    diff = cv2.absdiff(session.baseline, current_array)
                    diff_gray = cv2.cvtColor(diff, cv2.COLOR_RGB2GRAY)
                    diff_score = np.mean(diff_gray)
                    
                    if diff_score > 10:  # 阈值可调
                        print(f"⚠️  [{datetime.now()}] 检测到变化! 差异分数: {diff_score:.2f}")
                        if callback:
                            callback('change_detected', {
                                'diff_score': diff_score,
                                'screenshot': current,
                                'baseline': session.baseline
                            })
                        # 更新基准
                        session.baseline = current_array
            
            time.sleep(check_interval)
    
    except KeyboardInterrupt:
        print("\n🛑 监控已停止")
    
    return session

# 使用示例
def alert_callback(event_type, data):
    """告警回调函数示例"""
    if event_type == 'template_found':
        print(f"🎯 模板出现在: {data['location']}")
        # 可以在这里发送通知、发送邮件、执行操作等
    elif event_type == 'change_detected':
        print(f"📊 变化强度: {data['diff_score']}")
        # 保存差异图像
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        data['screenshot'].save(f"change_{timestamp}.png")

if __name__ == "__main__":
    # 示例1: 监控屏幕变化
    print("=== 监控屏幕变化 ===")
    monitor = monitor_screen_region(
        region=(0, 0, 1920, 1080),  # 全屏
        check_interval=5,           # 每5秒检查一次
        callback=alert_callback
    )
    
    # 10分钟后停止（实际使用可以一直运行）
    # time.sleep(600)
    # monitor.stop()
    
    # 示例2: 查找特定图像
    # monitor = monitor_screen_region(
    #     region=(0, 0, 1920, 1080),
    #     template_image="target_button.png",  # 要查找的图像
    #     check_interval=2,
    #     callback=alert_callback
    # )

Advanced Techniques

Handling Multiple Monitors

import pyautogui

def get_all_screen_sizes():
    """获取所有显示器尺寸（仅 Windows 支持多显示器详细信息）"""
    # macOS 返回主屏尺寸
    # Windows 可以使用 pygetwindow 或 win32api 获取多显示器信息
    
    primary = pyautogui.size()
    print(f"主屏幕尺寸: {primary}")
    
    # Windows 示例（需要安装 pywin32）
    try:
        import win32api
        monitors = win32api.EnumDisplayMonitors()
        for i, monitor in enumerate(monitors):
            print(f"显示器 {i+1}: {monitor[2]}")
    except ImportError:
        pass
    
    return primary

def screenshot_specific_monitor(monitor_num=0):
    """截图指定显示器（实验性功能）"""
    # 目前 pyautogui 主要支持主显示器
    # 多显示器支持需要平台特定代码
    pass

Performance Optimization

import cv2
import numpy as np
import pyautogui
import time
from functools import lru_cache

class ScreenCache:
    """屏幕缓存优化器"""
    
    def __init__(self, cache_duration=0.5):
        self.cache_duration = cache_duration
        self.last_capture = None
        self.last_capture_time = 0
    
    def get_screenshot(self, region=None):
        """获取截图（带缓存）"""
        current_time = time.time()
        
        # 检查缓存是否有效
        if (self.last_capture is not None and 
            current_time - self.last_capture_time < self.cache_duration and
            region is None):
            return self.last_capture
        
        # 捕获新截图
        screenshot = pyautogui.screenshot(region=region)
        
        if region is None:
            self.last_capture = screenshot
            self.last_capture_time = current_time
        
        return screenshot
    
    def clear_cache(self):
        """清除缓存"""
        self.last_capture = None
        self.last_capture_time = 0

class FastImageFinder:
    """快速图像查找器（使用多尺度金字塔）"""
    
    def __init__(self, scales=[0.8, 0.9, 1.0, 1.1, 1.2]):
        self.scales = scales
    
    def find_multi_scale(self, template_path, screenshot=None, confidence=0.8):
        """
        多尺度图像查找
        
        Returns:
            (x, y, scale) 或 None
        """
        if screenshot is None:
            screenshot = pyautogui.screenshot()
        
        template = cv2.imread(template_path)
        if template is None:
            return None
        
        screenshot_cv = cv2.cvtColor(np.array(screenshot), cv2.COLOR_RGB2BGR)
        
        for scale in self.scales:
            # 缩放模板
            scaled_template = cv2.resize(
                template,
                None,
                fx=scale,
                fy=scale,
                interpolation=cv2.INTER_AREA
            )
            
            # 模板匹配
            result = cv2.matchTemplate(
                screenshot_cv,
                scaled_template,
                cv2.TM_CCOEFF_NORMED
            )
            
            _, max_val, _, max_loc = cv2.minMaxLoc(result)
            
            if max_val >= confidence:
                h, w = scaled_template.shape[:2]
                center_x = max_loc[0] + w // 2
                center_y = max_loc[1] + h // 2
                return (center_x, center_y, scale)
        
        return None

# 使用示例
cache = ScreenCache()
finder = FastImageFinder()

# 快速截图（带缓存）
screenshot = cache.get_screenshot()

# 多尺度图像查找
result = finder.find_multi_scale("button.png", screenshot)
if result:
    x, y, scale = result
    print(f"找到图像: ({x}, {y}), 缩放: {scale}")

Security Considerations

"""
安全最佳实践
"""

import pyautogui
import hashlib
import time

class SecureAutomation:
    """安全自动化包装器"""
    
    def __init__(self):
        self.action_log = []
        self.max_retries = 3
        self.rate_limit_delay = 0.1  # 操作间隔
    
    def log_action(self, action, details):
        """记录操作日志"""
        timestamp = time.strftime("%Y-%m-%d %H:%M:%S")
        log_entry = {
            'timestamp': timestamp,
            'action': action,
            'details': details,
            'hash': hashlib.md5(f"{timestamp}{action}{details}".encode()).hexdigest()[:8]
        }
        self.action_log.append(log_entry)
    
    def safe_click(self, x, y, description=""):
        """安全点击（带验证）"""
        try:
            # 验证坐标在屏幕范围内
            screen_width, screen_height = pyautogui.size()
            if not (0 <= x < screen_width and 0 <= y < screen_height):
                raise ValueError(f"坐标 ({x}, {y}) 超出屏幕范围")
            
            # 执行点击
            pyautogui.moveTo(x, y, duration=0.2)
            time.sleep(self.rate_limit_delay)
            pyautogui.click()
            
            # 记录日志
            self.log_action('click', f"({x}, {y}) - {description}")
            
            return True
            
        except Exception as e:
            self.log_action('click_failed', f"({x}, {y}) - Error: {str(e)}")
            return False
    
    def safe_typewrite(self, text, interval=0.01):
        """安全输入（敏感信息不记录）"""
        try:
            pyautogui.typewrite(text, interval=interval)
            self.log_action('typewrite', f"输入 {len(text)} 个字符 [内容已隐藏]")
            return True
        except Exception as e:
            self.log_action('typewrite_failed', f"Error: {str(e)}")
            return False
    
    def get_action_report(self):
        """生成操作报告"""
        total = len(self.action_log)
        successful = sum(1 for log in self.action_log if 'failed' not in log['action'])
        failed = total - successful
        
        report = f"""
=== 自动化操作报告 ===
总操作数: {total}
成功: {successful}
失败: {failed}
成功率: {(successful/total*100):.1f}%

详细日志:
"""
        for log in self.action_log:
            report += f"[{log['timestamp']}] [{log['hash']}] {log['action']}: {log['details']}\n"
        
        return report

# 使用示例
secure = SecureAutomation()

# 执行安全操作
secure.safe_click(500, 400, "登录按钮")
secure.safe_typewrite("username@example.com")
secure.safe_click(500, 450, "密码输入框")
secure.safe_typewrite("********")
secure.safe_click(500, 500, "提交按钮")

# 生成报告
print(secure.get_action_report())

Troubleshooting Guide

Common Issues and Solutions

1. Permission Errors

Symptom: pyautogui fails with permission errors or captures black screenshots.

macOS Solution:

Open System Settings > Privacy & Security > Accessibility
Add your terminal application (e.g., Terminal.app, iTerm.app, or the Python executable)
Repeat for Screen Recording permission

Windows Solution:

Run as Administrator if needed
Check Windows Defender or antivirus isn't blocking

2. Coordinate Inaccuracy

Symptom: Clicks or screenshots miss the intended target.

Possible Causes:

High DPI / Retina display scaling
Multiple monitors with different resolutions
Window decorations or taskbar affecting coordinates

Solution:

import pyautogui

# Debug: Print screen info
print(f"Screen size: {pyautogui.size()}")
print(f"Mouse position: {pyautogui.position()}")

# Handle high DPI (Windows)
import ctypes
ctypes.windll.user32.SetProcessDPIAware()  # Windows only

3. Image Recognition Failures

Symptom: locateOnScreen returns None even when image is visible.

Common Causes:

Resolution mismatch (captured image at different scale)
Color depth differences
Transparency or alpha channel issues
Confidence threshold too high

Solutions:

import pyautogui
import cv2
import numpy as np

# Solution 1: Lower confidence
location = pyautogui.locateOnScreen('button.png', confidence=0.7)  # Default is 0.9

# Solution 2: Multi-scale matching (see FastImageFinder class in Performance section)
finder = FastImageFinder(scales=[0.5, 0.75, 1.0, 1.25, 1.5])
result = finder.find_multi_scale('button.png')

# Solution 3: Convert to grayscale for matching
screenshot = pyautogui.screenshot()
screenshot_cv = cv2.cvtColor(np.array(screenshot), cv2.COLOR_RGB2GRAY)
template = cv2.imread('button.png', cv2.IMREAD_GRAYSCALE)

result = cv2.matchTemplate(screenshot_cv, template, cv2.TM_CCOEFF_NORMED)
min_val, max_val, min_loc, max_loc = cv2.minMaxLoc(result)

if max_val >= 0.8:
    print(f"找到匹配，置信度: {max_val}")
    h, w = template.shape
    center_x = max_loc[0] + w // 2
    center_y = max_loc[1] + h // 2
    pyautogui.click(center_x, center_y)

4. Slow Performance

Symptom: Operations are slow, high CPU usage, or noticeable delays.

Optimization Strategies:

Reduce Screenshot Frequency
- Cache screenshots when possible
- Use region-specific captures instead of full screen
Optimize Image Matching
- Resize large images before matching
- Use grayscale matching when color isn't important
- Set appropriate confidence levels
Batch Operations
- Group multiple actions together
- Minimize unnecessary delays

See the "Performance Optimization" section for detailed code examples.

5. Application-Specific Issues

Browser Automation:

Modern browsers may block automation
Use Chrome DevTools Protocol instead of pyautogui for web
Consider Playwright or Selenium for complex web automation

Game/Graphics Applications:

DirectX/OpenGL apps may not be capturable by standard screenshot
May require specialized tools (e.g., OBS Studio's capture API)

Protected Content:

DRM-protected content (Netflix, etc.) cannot be screenshotted
This is a system-level restriction

Integration with Other Tools

With ChatGPT/AI Assistants

This skill is designed to work with AI assistants like nanobot. Here's how to integrate:

# Example: AI assistant using this skill

def ai_assisted_automation(user_request):
    """
    AI 助手使用自动化技能
    
    Args:
        user_request: 用户的自然语言请求
    """
    # 1. AI 解析用户意图
    intent = parse_intent(user_request)
    
    if intent == 'screenshot':
        # 2. 执行截图
        screenshot = pyautogui.screenshot()
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        path = f"screenshot_{timestamp}.png"
        screenshot.save(path)
        return f"已截图并保存到: {path}"
    
    elif intent == 'click_button':
        # 2. 查找并点击按钮
        button_name = extract_button_name(user_request)
        location = pyautogui.locateOnScreen(f"{button_name}.png")
        if location:
            pyautogui.click(pyautogui.center(location))
            return f"已点击按钮: {button_name}"
        else:
            return f"未找到按钮: {button_name}"
    
    # ... 其他意图处理

With CI/CD Pipelines

# Example: GitHub Actions using this skill for visual testing

name: Visual Regression Tests

on: [push, pull_request]

jobs:
  visual-test:
    runs-on: macos-latest  # or windows-latest
    
    steps:
    - uses: actions/checkout@v3
    
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.11'
    
    - name: Install dependencies
      run: |
        pip install pyautogui opencv-python-headless numpy Pillow
    
    - name: Run visual tests
      run: python tests/visual_regression.py
    
    - name: Upload screenshots
      uses: actions/upload-artifact@v3
      with:
        name: screenshots
        path: screenshots/

With Monitoring Systems

# Example: Integration with Prometheus/Grafana for screen monitoring

from prometheus_client import Gauge, start_http_server
import pyautogui
import time

# Define metrics
screen_change_gauge = Gauge('screen_change_score', 'Screen change detection score')
template_match_gauge = Gauge('template_match_confidence', 'Template matching confidence')

start_http_server(8000)

def monitoring_loop():
    baseline = None
    
    while True:
        # Capture screen
        current = pyautogui.screenshot()
        current_array = np.array(current)
        
        if baseline is not None:
            # Calculate change
            diff = cv2.absdiff(baseline, current_array)
            diff_score = np.mean(diff)
            screen_change_gauge.set(diff_score)
        
        baseline = current_array
        
        # Check for template
        try:
            location = pyautogui.locateOnScreen('alert_icon.png', confidence=0.8)
            if location:
                template_match_gauge.set(1.0)
            else:
                template_match_gauge.set(0.0)
        except:
            template_match_gauge.set(0.0)
        
        time.sleep(5)

monitoring_loop()

Future Roadmap

Planned Features

Linux Support
- X11 and Wayland compatibility
- xdotool and scrot integration
- mss for multi-monitor support
AI-Powered Recognition
- Integration with OpenAI GPT-4V or Google Gemini for visual understanding
- Natural language element finding ("click the blue submit button")
- OCR-free text extraction using vision models
Mobile Device Support
- Android: ADB (Android Debug Bridge) integration
- iOS: WebDriverAgent via Appium
- Screenshot and touch simulation
Cloud Integration
- AWS Lambda support for serverless automation
- Azure Functions and GCP Cloud Functions compatibility
- Distributed screenshot processing
Advanced Analytics
- Built-in A/B testing framework for UI changes
- Heatmap generation from user interactions
- Performance regression detection

Contributing

We welcome contributions! Please see the Contributing Guide for details on:

Code style and formatting
Testing requirements
Documentation standards
Pull request process

License

This skill is licensed under the MIT License. See LICENSE for details.

Last Updated: 2026-03-06
Version: 1.0.0
Maintainer: nanobot skills team