This is a PySpark migration and optimization toolkit built on SQLGlot. It converts SQL between dialects (PostgreSQL, Oracle, Redshift, MySQL, Snowflake) and generates PySpark DataFrame API code from SQL queries. The AWS Glue integration generates complete job templates, handles DynamicFrame conversions, and analyzes S3 partitioning strategies. You also get code review tools that scan existing PySpark for performance issues, suggest join strategies, and detect duplication across hundreds of files with concurrent batch processing. Reach for this when migrating legacy SQL workloads to Spark or when you need to generate Glue jobs without writing boilerplate. It won't handle recursive CTEs natively but provides Spark SQL equivalents and guidance for edge cases.
SQL migration assistance, AWS Glue job generation, and Spark code optimization — as an MCP server.
pip install -e .
pyspark-mcp # starts the MCP server
Add to ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"pyspark": {
"command": "pyspark-mcp",
"args": []
}
}
}
Add to ~/.hermes/config.yaml:
mcp:
servers:
pyspark:
command: pyspark-mcp
enabled_tools: all
docker compose up -d
convert_sql_to_pyspark — Convert SQL to PySpark with dialect detectionanalyze_sql_context — Analyze SQL complexity and suggest approachgenerate_aws_glue_job_template — Generate complete Glue job scriptsconvert_dataframe_to_dynamic_frame — DataFrame ↔ DynamicFrame conversiongenerate_data_catalog_table_definition — Data Catalog table definitionsgenerate_incremental_processing_job — Incremental/CDC job generationanalyze_s3_optimization_opportunities — S3 layout and partitioning analysisreview_pyspark_code — Code review with performance recommendationsoptimize_pyspark_code — Suggest optimizations for existing coderecommend_join_strategy — Broadcast vs shuffle join recommendationssuggest_partitioning_strategy — Partitioning recommendationsbatch_process_files — Process multiple SQL files concurrentlybatch_process_directory — Convert entire directoriespython -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
# Test
pytest tests/ -v --cov=pyspark_tools
# Format
black pyspark_tools tests
isort pyspark_tools tests
# Lint
flake8 pyspark_tools tests
pyspark_tools/
├── server.py # FastMCP server + tool definitions
├── sql_converter.py # SQLGlot-based transpilation + DataFrame API generation
├── aws_glue_integration.py # Glue job templates, DynamicFrame, Data Catalog
├── advanced_optimizer.py # Performance analysis + optimization suggestions
├── batch_processor.py # Concurrent file processing
├── code_reviewer.py # PySpark code review patterns
├── duplicate_detector.py # Code deduplication
├── data_source_analyzer.py # Data source analysis
└── file_utils.py # File I/O utilities
MIT — see LICENSE.
mcp-name: io.github.AnnasMazhar/pyspark-mcp
silenceper/mcp-k8s
azure/containerization-assist
io.github.evozim/aws-builder
reza-gholizade/k8s-mcp-server
flux159/mcp-server-kubernetes