Pronunciation & Voice Coach

STDIOregistry active

Summary

Turns your MCP client into a voice-driven English coach. You speak into your mic, faster-whisper transcribes locally, and you get pronunciation, grammar, and fluency feedback inline with the assistant's reply. The converse tool handles natural back-and-forth with light corrections. Switch to practice mode when you want phoneme-level drill feedback with IPA alignment, minimal pairs, and prosody checks for stress and intonation. The optional phoneme extra adds wav2vec2 forced alignment to catch when Whisper rewrites rare words. Useful if you're learning English and want coaching without leaving your coding environment. Everything runs on-device, nothing leaves your machine.

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

mcp-server-pronunciation

Accuracy and safety notice

This project is a local language-learning practice tool. It may contain bugs, runtime errors, inaccurate transcripts, inaccurate pronunciation feedback, or platform-specific recording issues. Pronunciation feedback is a coaching signal, not a standardized-test, clinical, employment, or high-stakes assessment. Review outputs carefully before relying on them. See DISCLAIMER.md.

An MCP (Model Context Protocol) server that lets you talk to your MCP assistant by voice while getting English pronunciation, grammar, and fluency feedback in the same turn. Use it for casual voice chat with light coaching, or switch to drill mode when you want to practice a specific sentence.

Built for Codex CLI, Claude Desktop, Claude Code, Cursor, VS Code, and other MCP clients. Everything runs locally — audio is captured with your mic, transcribed by faster-whisper on-device, and never leaves your machine.

mcp-name: io.github.JuhongPark/pronunciation

Why

Voice MCP servers today treat speech as a typing replacement. English tutor MCP servers are text-only. This one combines the two: you speak freely, your assistant replies, and feedback on what you just said (pronunciation, grammar, fluency) surfaces inside the same tool call so the assistant can weave it into a natural reply — or stay out of the way when you're just chatting.

Features

Voice conversation with your MCP assistant. Speak, auto-stop on silence, then let the assistant read your transcript and respond.
Phoneme-level drill feedback (when a reference sentence is given): Needleman-Wunsch word alignment, per-word expected vs produced IPA, learner-profile hints, minimal-pair drills, and prosody checks (word stress, final-rise intonation, intra-clause pauses).
Extensible learner-profile support: the current rule pack includes Korean-L1 pronunciation-pattern hints and Korean-language tips. Contributions for additional L1 profiles are welcome.
Whisper-bias mitigation via optional [phoneme] extra: wav2vec2 CTC forced alignment verifies whether the user actually produced each reference word, so rare proper nouns and domain-specific terms that Whisper rewrites toward more common alternatives no longer surface as mispronunciations.
Inline English feedback in conversation: pronunciation, grammar (common irregular-verb errors), and fluency (pace + long pauses).
Drill mode (practice, quick_practice, retry) for focused sentence practice.
Local-only: Whisper model runs on your machine, audio never leaves it.
Cross-platform: macOS, Linux, Windows, and WSL2 (recording auto-routes through Windows).
Fast startup: lazy imports + background model pre-load keep the MCP handshake under a second.

Requirements

Python 3.11+
A working microphone
~150 MB disk space for the default Whisper model (base.en)
Additional ~360 MB if you install the optional [phoneme] extra (wav2vec2 weights for forced alignment)
MCP spec: targets 2025-06-18 via the official Python SDK (mcp>=1.2)

Installation

Stable release

Install the latest stable release:

uvx mcp-server-pronunciation

For pip users:

pip install mcp-server-pronunciation

To pin this release explicitly:

uvx mcp-server-pronunciation@0.3.0

Run doctor before relying on the server in a live session:

mcp-server-pronunciation doctor

General install commands

# Recommended: uvx (no global install, cached between runs)
uvx mcp-server-pronunciation

# Or install as a uv tool
uv tool install mcp-server-pronunciation

# Or pip
pip install mcp-server-pronunciation

# Optional: forced-alignment upgrade for Whisper-bias mitigation + tighter
# phoneme-level feedback. Adds ~200 MB of torch CPU wheels.
pip install 'mcp-server-pronunciation[phoneme]'

Linux: install PortAudio first

sounddevice ships PortAudio inside the wheel on macOS and Windows, but on Linux you need the system library:

# Debian / Ubuntu
sudo apt-get install libportaudio2

# Fedora / RHEL
sudo dnf install portaudio

# Arch
sudo pacman -S portaudio

# PipeWire-only systems may also need
sudo apt-get install pipewire-alsa

First-time check

Before wiring the server into an MCP client, run the preflight:

uvx mcp-server-pronunciation doctor

Optional — pre-download the Whisper model (~150 MB) so the first call is instant:

uvx mcp-server-pronunciation pull-model base.en

Add to your MCP client

Codex CLI

codex mcp add pronunciation -- uvx mcp-server-pronunciation

Claude Code

claude mcp add pronunciation -- uvx mcp-server-pronunciation

Claude Desktop

Edit claude_desktop_config.json:

{
  "mcpServers": {
    "pronunciation": {
      "command": "uvx",
      "args": ["mcp-server-pronunciation"]
    }
  }
}

On macOS, if Claude Desktop can't find uvx (spawn uvx ENOENT), use an absolute path. Find it with which uvx in your terminal.

Cursor

Add to ~/.cursor/mcp.json:

{
  "mcpServers": {
    "pronunciation": {
      "command": "uvx",
      "args": ["mcp-server-pronunciation"]
    }
  }
}

VS Code (with MCP support)

Add to .vscode/mcp.json or your user settings:

{
  "servers": {
    "pronunciation": {
      "type": "stdio",
      "command": "uvx",
      "args": ["mcp-server-pronunciation"]
    }
  }
}

Usage Examples

1. Voice chat with feedback

You: "Let's have a voice chat. I'll ask you about the weekend. Use the converse tool."

Assistant (calls converse): records your speech, transcribes it, notes that you said "buyed" instead of "bought"

Assistant: "Oh nice — what kind of apples did you buy? And by the way, the past tense of 'buy' is 'bought' — small thing, but I noticed it."

2. Drill a specific sentence

You: "Give me a sentence to practice with 'th' sounds."

Assistant (calls suggest_sentence with focus=th): "Try this: The three brothers thought thoroughly about their future."

You: "Record me reading it."

Assistant (calls practice with that reference): returns an alignment table (match / sub / ins / del) with per-word acoustic confidence when the [phoneme] extra is installed, phoneme-level issues with expected vs produced IPA, learner-profile hints when applicable, minimal-pair drills, and prosody notes (word stress, final-rise intonation, intra-clause pauses).

3. Retry after feedback

You: "Let me try again."

Assistant (calls retry): re-records the same target sentence and compares

Tools

Tool	Purpose
`converse`	Primary. Record + transcribe + quick feedback + assistant guidance for natural voice-chat-with-coaching.
`practice`	Drill mode: record user reading a specific reference sentence, return detailed assessment.
`quick_practice`	Pick a random sentence (by phoneme focus + difficulty) and drill it.
`retry`	Re-record the last sentence and compare the new attempt against the previous one.
`open_voice_panel`	Open the MCP Apps voice panel when the client supports embedded UI.
`analyze_uploaded_audio`	Analyze WAV audio uploaded by the voice panel and store it as the latest voice capture.
`start_voice_capture`	Start recording in the background and return a session id immediately.
`voice_capture_status`	Check whether a background capture is recording, analyzing, done, cancelled, or failed.
`wait_for_voice_capture`	Wait for a background capture to finish and return transcript + feedback.
`latest_voice_capture`	Return the most recent background voice capture result.
`cancel_voice_capture`	Mark a background capture as cancelled before analysis starts.
`suggest_sentence`	Return a practice sentence without recording.
`record`	Record audio and save a WAV file (raw, no analysis).
`assess`	Assess the last recording (or a specified WAV) without re-recording. When given a reference, runs the full drill pipeline (alignment, phoneme diff, learner-profile hints, prosody).
`check_mic`	List available audio input devices.

Tools that assess speech also return structured MCP output with transcript, clarity_pct, speaking_rate_wpm, top_issue, next_action, retry_comparison, the full machine-readable assessment, and the rendered report_markdown. MCP clients can use the structured result to offer a retry, surface the top issue, or build a richer practice UI without parsing Markdown.

Visible voice-capture workflow

For MCP clients without an embedded voice UI, use the background capture tools to keep the user informed:

start_voice_capture(duration=8, mode="conversation")
voice_capture_status(session_id)
wait_for_voice_capture(session_id, timeout=30)
latest_voice_capture()

The status response includes recording, analyzing, done, error, or cancelled, plus elapsed time, transcript, clarity, speaking rate, feedback markdown, and the full structured assessment when available. On WSL2, keep duration short because PowerShell recording may wait for the full requested duration before analysis begins.

MCP Apps voice panel

Clients that support MCP Apps can call open_voice_panel to render the ui://pronunciation/voice-panel resource. The panel requests browser microphone access, records locally in the browser, uploads a WAV clip through analyze_uploaded_audio, and displays the returned transcript and feedback.

The uploaded clip is stored in the same voice session registry as MCP-only recordings, so assistants can call latest_voice_capture after the panel finishes and respond to both the development note and the pronunciation feedback. Clients without MCP Apps support should use the visible voice-capture workflow above.

Prompt Shortcuts

MCP clients that expose server prompts can start common workflows directly:

Prompt	Purpose
`start_voice_chat`	Start a local voice conversation with light feedback.
`daily_practice`	Run a short suggested-sentence practice loop.
`practice_focus`	Start a drill for a chosen focus and difficulty.
`troubleshoot_mic`	Inspect microphone devices and recording settings.

Configuration

Whisper model

Set MCP_PRONUNCIATION_MODEL to pick a different model size:

# Default — fast, English-only (~150 MB)
export MCP_PRONUNCIATION_MODEL=base.en

# Smaller / faster (~75 MB)
export MCP_PRONUNCIATION_MODEL=tiny.en

# More accurate (~470 MB)
export MCP_PRONUNCIATION_MODEL=small.en

# Multilingual options (larger)
export MCP_PRONUNCIATION_MODEL=small
export MCP_PRONUNCIATION_MODEL=medium

Available: tiny, tiny.en, base, base.en, small, small.en, medium, medium.en, large-v3, large-v3-turbo. For English-only use, the .en variants are faster and more accurate at a given size.

GPU (CUDA 12 + cuDNN 9) is auto-detected when available; otherwise runs on CPU with int8 quantization.

Cache location

By default Whisper weights are cached in ~/.cache/huggingface/hub/. Override with HF_HUB_CACHE:

export HF_HUB_CACHE=/path/to/cache

Startup preload

By default the server preloads the Whisper model in the background after the MCP handshake starts. Set MCP_PRONUNCIATION_PRELOAD=0 for registry inspection, Docker smoke tests, or other environments that only need tool discovery and should avoid model downloads:

export MCP_PRONUNCIATION_PRELOAD=0

Temporary recordings

Recordings are written as temporary WAV files so assess can inspect the last recording. By default they are removed when the server process exits:

export MCP_PRONUNCIATION_AUDIO_RETENTION=session

Set MCP_PRONUNCIATION_AUDIO_RETENTION=keep if you want temporary recordings to remain on disk for manual inspection.

Microphone and auto-stop controls

By default the server uses your system default microphone. Native sounddevice recording stops after 1.5 seconds of detected silence. WSL2 records through Windows PowerShell and may wait for the full requested duration, so use a short duration value for quick voice checks. You can override native recording behavior:

# Use a specific input device index or name from the `check_mic` tool
export MCP_PRONUNCIATION_INPUT_DEVICE=1

# Options: low, normal, high
# high helps soft speakers; low is better in noisy rooms
export MCP_PRONUNCIATION_VAD_SENSITIVITY=high

# Seconds of silence before auto-stop, clamped to 0.3-5.0
export MCP_PRONUNCIATION_SILENCE_DURATION=2.0

Run check_mic to see the default input device, available device indexes, and the active VAD settings.

Model override in MCP clients

# Codex CLI
codex mcp add --env MCP_PRONUNCIATION_MODEL=small.en pronunciation -- uvx mcp-server-pronunciation

# Claude Code
claude mcp add pronunciation -e MCP_PRONUNCIATION_MODEL=small.en -- uvx mcp-server-pronunciation

Phoneme analysis extras

Installing mcp-server-pronunciation[phoneme] enables wav2vec2-based CTC forced alignment. It verifies which reference words the user acoustically produced, regardless of how Whisper's language-model-weighted decoder rewrote them — so rare proper nouns and domain terms no longer surface as false mispronunciations. On first run the extra downloads ~360 MB of weights into ~/.cache/torch/hub/ (override via TORCH_HOME). Inference is CPU-only by default and runtime-quantized to int8 (~95 MB RAM).

Without the extra, assess / practice still run the full pipeline except for the forced-alignment step: you get Needleman-Wunsch word alignment against the Whisper hypothesis, CMUdict phoneme-sequence diff, learner-profile hints, and prosody.

Platform Support

Platform	Recording method	Status
macOS	sounddevice (bundled PortAudio)	Supported
Linux	sounddevice (needs `libportaudio2`)	Supported
Windows	sounddevice (bundled PortAudio)	Supported
WSL2	PowerShell MCI (winmm.dll)	Supported

WSL2 note: WSLg's PulseAudio does not forward microphone audio from the Windows host. This server detects WSL2 automatically and records through PowerShell on the Windows side instead. WSL2 recording may wait for the full requested duration instead of auto-stopping on silence.

Troubleshooting

`uvx mcp-server-pronunciation doctor` is your first stop

It reports on PortAudio, input devices, Whisper model cache, pronunciation resources, optional forced-alignment dependencies, free disk space, and Python version. Run it whenever something feels off.

`sounddevice` import fails on Linux

You're missing libportaudio2. See the install section above. After installing:

uvx mcp-server-pronunciation doctor

No audio captured / empty recording

macOS: System Settings → Privacy & Security → Microphone. Grant access to the app that launched your MCP client, such as Codex CLI, Claude Desktop, or Claude Code.
Linux: Check pavucontrol (PulseAudio) or pw-cli list-objects (PipeWire) for input levels. On PipeWire-only systems, install pipewire-alsa.
WSL2: Test your mic in Windows Settings → Sound → Input. The server records through Windows, not through WSLg.

First run is slow

The Whisper model downloads on first use (~150 MB for base.en). Pre-download it once:

uvx mcp-server-pronunciation pull-model base.en

Subsequent runs reuse the cached weights. If startup still feels slow, try MCP_PRONUNCIATION_MODEL=tiny.en.

Claude Desktop on macOS: `spawn uvx ENOENT`

Claude Desktop launches MCP servers from a GUI-only environment without ~/.local/bin on PATH. Use the absolute path to uvx in your config (/Users/YOU/.local/bin/uvx or wherever which uvx reports).

Known Limitations

This is a stable package release, but the pronunciation and prosody feedback remain experimental coaching signals. Bugs, runtime errors, inaccurate feedback, and platform-specific recording issues can still occur.
Pronunciation scores are coaching signals, not standardized-test, clinical, or native-speaker-equivalence judgments.
Whisper can still mishear rare names, domain terms, short clips, quiet audio, or heavily accented speech. The optional [phoneme] extra reduces some reference-sentence false positives but does not eliminate them.
Prosody feedback is heuristic. Pitch tracking can be unreliable with noisy audio, very short utterances, vocal fry, overlapping speech, or clipped recordings.
Learner-profile hints are intentionally rule-based. The current package includes Korean-L1 hints, but they can miss errors, over-trigger on ASR mistakes, and should be treated as targeted practice aids. Contributions for additional L1 profiles are welcome.
First-time setup may download model or pronunciation resources. Run doctor and pull-model before relying on the server in a live session.
Temporary WAV recordings are written under the system temp directory so that the last recording can be assessed. By default they are removed when the server exits. Set MCP_PRONUNCIATION_AUDIO_RETENTION=keep if you want to inspect them later.

Benchmark Status

This project is moving toward benchmark-backed scoring. Planned public benchmark work is tracked in ROADMAP.md, the testing methodology lives in docs/TESTING.md, and the current benchmark helper docs live in docs/BENCHMARKS.md. The primary candidate is Speechocean762 because it has a permissive CC BY 4.0 license and multi-level expert pronunciation scores. L2-ARCTIC is useful for phone-error and learner-profile research checks, including Korean-L1 subset review, but its non-commercial license means it should remain optional and separate from default release claims.

Publication Status

The source repository is public. PyPI, GitHub Release, and MCP Registry publication steps are tracked in docs/PUBLICATION.md.

Privacy

All audio processing happens locally on your machine.
Recordings are temporary .wav files under your system temp directory ($TMPDIR) and are removed when the server exits unless MCP_PRONUNCIATION_AUDIO_RETENTION=keep is set.
The Whisper model runs locally — no audio data is sent to any external service.
When the optional [phoneme] extra is installed, the wav2vec2 forced aligner also runs locally. Weights are downloaded once from the PyTorch Hub.
No telemetry. No analytics. No network calls except the one-time model weight downloads (Whisper from Hugging Face, wav2vec2 from PyTorch Hub).

Development

git clone https://github.com/JuhongPark/mcp-server-pronunciation.git
cd mcp-server-pronunciation
uv sync --extra dev
uv run pytest -v
uv run ruff check .
uv run ruff format --check .

To work on the optional wav2vec2 forced-alignment path, install the phoneme extra as well:

uv sync --extra dev --extra phoneme

Support

Issues: https://github.com/JuhongPark/mcp-server-pronunciation/issues

License

MIT. See LICENSE.

Third-party components (all MIT / permissive):

faster-whisper — MIT
OpenAI Whisper models — MIT
CTranslate2 — MIT
sounddevice — MIT
PortAudio — MIT
cmudict — BSD
g2p-en — Apache 2.0
librosa — ISC
Optional ([phoneme] extra): PyTorch — BSD, torchaudio — BSD, wav2vec2 weights — MIT

Featured

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

Configuration

MCP_PRONUNCIATION_MODEL

faster-whisper model size. Default: base.en. Options: tiny.en, base.en, small.en, medium.en, large-v3, large-v3-turbo.

HF_HUB_CACHE

Override Hugging Face Hub cache directory where Whisper weights are stored.

MCP_PRONUNCIATION_PRELOAD

Set to 0, false, no, or off to disable background Whisper model preload during registry inspection or Docker smoke tests.

MCP_PRONUNCIATION_AUDIO_RETENTION

Temporary WAV retention policy. Use session to delete recordings when the server exits, or keep for manual inspection.

TORCH_HOME

Override PyTorch cache directory for optional wav2vec2 forced-alignment weights.

Registryactive

Packagemcp-server-pronunciation

TransportSTDIO

UpdatedMay 3, 2026

View on GitHub

mcp-server-pronunciation

Accuracy and safety notice

This project is a local language-learning practice tool. It may contain bugs, runtime errors, inaccurate transcripts, inaccurate pronunciation feedback, or platform-specific recording issues. Pronunciation feedback is a coaching signal, not a standardized-test, clinical, employment, or high-stakes assessment. Review outputs carefully before relying on them. See DISCLAIMER.md.

mcp-name: io.github.JuhongPark/pronunciation

Why

Features

Voice conversation with your MCP assistant. Speak, auto-stop on silence, then let the assistant read your transcript and respond.
Phoneme-level drill feedback (when a reference sentence is given): Needleman-Wunsch word alignment, per-word expected vs produced IPA, learner-profile hints, minimal-pair drills, and prosody checks (word stress, final-rise intonation, intra-clause pauses).
Extensible learner-profile support: the current rule pack includes Korean-L1 pronunciation-pattern hints and Korean-language tips. Contributions for additional L1 profiles are welcome.
Whisper-bias mitigation via optional [phoneme] extra: wav2vec2 CTC forced alignment verifies whether the user actually produced each reference word, so rare proper nouns and domain-specific terms that Whisper rewrites toward more common alternatives no longer surface as mispronunciations.
Inline English feedback in conversation: pronunciation, grammar (common irregular-verb errors), and fluency (pace + long pauses).
Drill mode (practice, quick_practice, retry) for focused sentence practice.
Local-only: Whisper model runs on your machine, audio never leaves it.
Cross-platform: macOS, Linux, Windows, and WSL2 (recording auto-routes through Windows).
Fast startup: lazy imports + background model pre-load keep the MCP handshake under a second.

Requirements

Python 3.11+
A working microphone
~150 MB disk space for the default Whisper model (base.en)
Additional ~360 MB if you install the optional [phoneme] extra (wav2vec2 weights for forced alignment)
MCP spec: targets 2025-06-18 via the official Python SDK (mcp>=1.2)

Installation

Stable release

Install the latest stable release:

uvx mcp-server-pronunciation

For pip users:

pip install mcp-server-pronunciation

To pin this release explicitly:

uvx mcp-server-pronunciation@0.3.0

Run doctor before relying on the server in a live session:

mcp-server-pronunciation doctor

General install commands

# Recommended: uvx (no global install, cached between runs)
uvx mcp-server-pronunciation

# Or install as a uv tool
uv tool install mcp-server-pronunciation

# Or pip
pip install mcp-server-pronunciation

# Optional: forced-alignment upgrade for Whisper-bias mitigation + tighter
# phoneme-level feedback. Adds ~200 MB of torch CPU wheels.
pip install 'mcp-server-pronunciation[phoneme]'

Linux: install PortAudio first

sounddevice ships PortAudio inside the wheel on macOS and Windows, but on Linux you need the system library:

# Debian / Ubuntu
sudo apt-get install libportaudio2

# Fedora / RHEL
sudo dnf install portaudio

# Arch
sudo pacman -S portaudio

# PipeWire-only systems may also need
sudo apt-get install pipewire-alsa

First-time check

Before wiring the server into an MCP client, run the preflight:

uvx mcp-server-pronunciation doctor

Optional — pre-download the Whisper model (~150 MB) so the first call is instant:

uvx mcp-server-pronunciation pull-model base.en

Add to your MCP client

Codex CLI

codex mcp add pronunciation -- uvx mcp-server-pronunciation

Claude Code

claude mcp add pronunciation -- uvx mcp-server-pronunciation

Claude Desktop

Edit claude_desktop_config.json:

{
  "mcpServers": {
    "pronunciation": {
      "command": "uvx",
      "args": ["mcp-server-pronunciation"]
    }
  }
}

On macOS, if Claude Desktop can't find uvx (spawn uvx ENOENT), use an absolute path. Find it with which uvx in your terminal.

Cursor

Add to ~/.cursor/mcp.json:

{
  "mcpServers": {
    "pronunciation": {
      "command": "uvx",
      "args": ["mcp-server-pronunciation"]
    }
  }
}

VS Code (with MCP support)

Add to .vscode/mcp.json or your user settings:

{
  "servers": {
    "pronunciation": {
      "type": "stdio",
      "command": "uvx",
      "args": ["mcp-server-pronunciation"]
    }
  }
}

Usage Examples

1. Voice chat with feedback

You: "Let's have a voice chat. I'll ask you about the weekend. Use the converse tool."

Assistant (calls converse): records your speech, transcribes it, notes that you said "buyed" instead of "bought"

Assistant: "Oh nice — what kind of apples did you buy? And by the way, the past tense of 'buy' is 'bought' — small thing, but I noticed it."

2. Drill a specific sentence

You: "Give me a sentence to practice with 'th' sounds."

Assistant (calls suggest_sentence with focus=th): "Try this: The three brothers thought thoroughly about their future."

You: "Record me reading it."

Assistant (calls practice with that reference): returns an alignment table (match / sub / ins / del) with per-word acoustic confidence when the [phoneme] extra is installed, phoneme-level issues with expected vs produced IPA, learner-profile hints when applicable, minimal-pair drills, and prosody notes (word stress, final-rise intonation, intra-clause pauses).

3. Retry after feedback

You: "Let me try again."

Assistant (calls retry): re-records the same target sentence and compares

Tools

Tool	Purpose
`converse`	Primary. Record + transcribe + quick feedback + assistant guidance for natural voice-chat-with-coaching.
`practice`	Drill mode: record user reading a specific reference sentence, return detailed assessment.
`quick_practice`	Pick a random sentence (by phoneme focus + difficulty) and drill it.
`retry`	Re-record the last sentence and compare the new attempt against the previous one.
`open_voice_panel`	Open the MCP Apps voice panel when the client supports embedded UI.
`analyze_uploaded_audio`	Analyze WAV audio uploaded by the voice panel and store it as the latest voice capture.
`start_voice_capture`	Start recording in the background and return a session id immediately.
`voice_capture_status`	Check whether a background capture is recording, analyzing, done, cancelled, or failed.
`wait_for_voice_capture`	Wait for a background capture to finish and return transcript + feedback.
`latest_voice_capture`	Return the most recent background voice capture result.
`cancel_voice_capture`	Mark a background capture as cancelled before analysis starts.
`suggest_sentence`	Return a practice sentence without recording.
`record`	Record audio and save a WAV file (raw, no analysis).
`assess`	Assess the last recording (or a specified WAV) without re-recording. When given a reference, runs the full drill pipeline (alignment, phoneme diff, learner-profile hints, prosody).
`check_mic`	List available audio input devices.

Visible voice-capture workflow

For MCP clients without an embedded voice UI, use the background capture tools to keep the user informed:

start_voice_capture(duration=8, mode="conversation")
voice_capture_status(session_id)
wait_for_voice_capture(session_id, timeout=30)
latest_voice_capture()

MCP Apps voice panel

Prompt Shortcuts

MCP clients that expose server prompts can start common workflows directly:

Prompt	Purpose
`start_voice_chat`	Start a local voice conversation with light feedback.
`daily_practice`	Run a short suggested-sentence practice loop.
`practice_focus`	Start a drill for a chosen focus and difficulty.
`troubleshoot_mic`	Inspect microphone devices and recording settings.

Configuration

Whisper model

Set MCP_PRONUNCIATION_MODEL to pick a different model size:

# Default — fast, English-only (~150 MB)
export MCP_PRONUNCIATION_MODEL=base.en

# Smaller / faster (~75 MB)
export MCP_PRONUNCIATION_MODEL=tiny.en

# More accurate (~470 MB)
export MCP_PRONUNCIATION_MODEL=small.en

# Multilingual options (larger)
export MCP_PRONUNCIATION_MODEL=small
export MCP_PRONUNCIATION_MODEL=medium

GPU (CUDA 12 + cuDNN 9) is auto-detected when available; otherwise runs on CPU with int8 quantization.

Cache location

By default Whisper weights are cached in ~/.cache/huggingface/hub/. Override with HF_HUB_CACHE:

export HF_HUB_CACHE=/path/to/cache

Startup preload

export MCP_PRONUNCIATION_PRELOAD=0

Temporary recordings

Recordings are written as temporary WAV files so assess can inspect the last recording. By default they are removed when the server process exits:

export MCP_PRONUNCIATION_AUDIO_RETENTION=session

Set MCP_PRONUNCIATION_AUDIO_RETENTION=keep if you want temporary recordings to remain on disk for manual inspection.

Microphone and auto-stop controls

# Use a specific input device index or name from the `check_mic` tool
export MCP_PRONUNCIATION_INPUT_DEVICE=1

# Options: low, normal, high
# high helps soft speakers; low is better in noisy rooms
export MCP_PRONUNCIATION_VAD_SENSITIVITY=high

# Seconds of silence before auto-stop, clamped to 0.3-5.0
export MCP_PRONUNCIATION_SILENCE_DURATION=2.0

Run check_mic to see the default input device, available device indexes, and the active VAD settings.

Model override in MCP clients

# Codex CLI
codex mcp add --env MCP_PRONUNCIATION_MODEL=small.en pronunciation -- uvx mcp-server-pronunciation

# Claude Code
claude mcp add pronunciation -e MCP_PRONUNCIATION_MODEL=small.en -- uvx mcp-server-pronunciation

Phoneme analysis extras

Platform Support

Platform	Recording method	Status
macOS	sounddevice (bundled PortAudio)	Supported
Linux	sounddevice (needs `libportaudio2`)	Supported
Windows	sounddevice (bundled PortAudio)	Supported
WSL2	PowerShell MCI (winmm.dll)	Supported

Troubleshooting

`uvx mcp-server-pronunciation doctor` is your first stop

It reports on PortAudio, input devices, Whisper model cache, pronunciation resources, optional forced-alignment dependencies, free disk space, and Python version. Run it whenever something feels off.

`sounddevice` import fails on Linux

You're missing libportaudio2. See the install section above. After installing:

uvx mcp-server-pronunciation doctor

No audio captured / empty recording

macOS: System Settings → Privacy & Security → Microphone. Grant access to the app that launched your MCP client, such as Codex CLI, Claude Desktop, or Claude Code.
Linux: Check pavucontrol (PulseAudio) or pw-cli list-objects (PipeWire) for input levels. On PipeWire-only systems, install pipewire-alsa.
WSL2: Test your mic in Windows Settings → Sound → Input. The server records through Windows, not through WSLg.

First run is slow

The Whisper model downloads on first use (~150 MB for base.en). Pre-download it once:

uvx mcp-server-pronunciation pull-model base.en

Subsequent runs reuse the cached weights. If startup still feels slow, try MCP_PRONUNCIATION_MODEL=tiny.en.

Claude Desktop on macOS: `spawn uvx ENOENT`

Known Limitations

This is a stable package release, but the pronunciation and prosody feedback remain experimental coaching signals. Bugs, runtime errors, inaccurate feedback, and platform-specific recording issues can still occur.
Pronunciation scores are coaching signals, not standardized-test, clinical, or native-speaker-equivalence judgments.
Whisper can still mishear rare names, domain terms, short clips, quiet audio, or heavily accented speech. The optional [phoneme] extra reduces some reference-sentence false positives but does not eliminate them.
Prosody feedback is heuristic. Pitch tracking can be unreliable with noisy audio, very short utterances, vocal fry, overlapping speech, or clipped recordings.
Learner-profile hints are intentionally rule-based. The current package includes Korean-L1 hints, but they can miss errors, over-trigger on ASR mistakes, and should be treated as targeted practice aids. Contributions for additional L1 profiles are welcome.
First-time setup may download model or pronunciation resources. Run doctor and pull-model before relying on the server in a live session.
Temporary WAV recordings are written under the system temp directory so that the last recording can be assessed. By default they are removed when the server exits. Set MCP_PRONUNCIATION_AUDIO_RETENTION=keep if you want to inspect them later.

Benchmark Status

Publication Status

The source repository is public. PyPI, GitHub Release, and MCP Registry publication steps are tracked in docs/PUBLICATION.md.

Privacy

All audio processing happens locally on your machine.
Recordings are temporary .wav files under your system temp directory ($TMPDIR) and are removed when the server exits unless MCP_PRONUNCIATION_AUDIO_RETENTION=keep is set.
The Whisper model runs locally — no audio data is sent to any external service.
When the optional [phoneme] extra is installed, the wav2vec2 forced aligner also runs locally. Weights are downloaded once from the PyTorch Hub.
No telemetry. No analytics. No network calls except the one-time model weight downloads (Whisper from Hugging Face, wav2vec2 from PyTorch Hub).

Development

git clone https://github.com/JuhongPark/mcp-server-pronunciation.git
cd mcp-server-pronunciation
uv sync --extra dev
uv run pytest -v
uv run ruff check .
uv run ruff format --check .

To work on the optional wav2vec2 forced-alignment path, install the phoneme extra as well:

uv sync --extra dev --extra phoneme

Support

Issues: https://github.com/JuhongPark/mcp-server-pronunciation/issues

License

MIT. See LICENSE.

Third-party components (all MIT / permissive):

faster-whisper — MIT
OpenAI Whisper models — MIT
CTranslate2 — MIT
sounddevice — MIT
PortAudio — MIT
cmudict — BSD
g2p-en — Apache 2.0
librosa — ISC
Optional ([phoneme] extra): PyTorch — BSD, torchaudio — BSD, wav2vec2 weights — MIT

Pronunciation & Voice Coach

mcp-server-pronunciation

Why

Features

Requirements

Installation

Stable release

General install commands

Linux: install PortAudio first

First-time check

Add to your MCP client

Codex CLI

Claude Code

Claude Desktop

Cursor

VS Code (with MCP support)

Usage Examples

1. Voice chat with feedback

2. Drill a specific sentence

3. Retry after feedback

Tools

Visible voice-capture workflow

MCP Apps voice panel

Prompt Shortcuts

Configuration

Whisper model

Cache location

Startup preload

Temporary recordings

Microphone and auto-stop controls

Model override in MCP clients

Phoneme analysis extras

Platform Support

Troubleshooting

uvx mcp-server-pronunciation doctor is your first stop

sounddevice import fails on Linux

No audio captured / empty recording

First run is slow

Claude Desktop on macOS: spawn uvx ENOENT

Known Limitations

Benchmark Status

Publication Status

Privacy

Development

Support

License

Configuration

Pronunciation & Voice Coach

mcp-server-pronunciation

Why

Features

Requirements

Installation

Stable release

General install commands

Linux: install PortAudio first

First-time check

Add to your MCP client

Codex CLI

Claude Code

Claude Desktop

Cursor

VS Code (with MCP support)

Usage Examples

1. Voice chat with feedback

2. Drill a specific sentence

3. Retry after feedback

Tools

Visible voice-capture workflow

MCP Apps voice panel

Prompt Shortcuts

Configuration

Whisper model

Cache location

Startup preload

Temporary recordings

Microphone and auto-stop controls

Model override in MCP clients

Phoneme analysis extras

Platform Support

`uvx mcp-server-pronunciation doctor` is your first stop

`sounddevice` import fails on Linux

Claude Desktop on macOS: `spawn uvx ENOENT`

`uvx mcp-server-pronunciation doctor` is your first stop

`sounddevice` import fails on Linux

Claude Desktop on macOS: `spawn uvx ENOENT`