Generate natural speech from text using VoxCPM2 — a 2B parameter multilingual TTS model running on our self-hosted GPU infrastructure. Design any voice using text descriptions, or clone a voice from a reference audio sample. 30 languages, 48kHz output.
curl -X POST https://api.pixelapi.dev/v1/tts/generate \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "text=(A warm elderly man, gentle voice) Hello, welcome to our store." \
-F "language=en"
# Response: { "id": "uuid", "status": "queued", "is_clone": false }
curl -X POST https://api.pixelapi.dev/v1/tts/generate \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "text=This is my cloned voice speaking." \
-F "[email protected]" \
-F "language=en"
# Response: { "id": "uuid", "status": "queued", "is_clone": true }
curl https://api.pixelapi.dev/v1/tts/status/YOUR_JOB_ID \
-H "Authorization: Bearer YOUR_API_KEY"
# Response: { "id": "uuid", "status": "completed", "output_url": "https://..." }
Use parentheses in the text field to describe the voice you want. VoxCPM2 generates speech matching your description.
| Description | Voice Type |
|---|---|
(A warm elderly man) | Deep, wise, kind old man |
(A cheerful young girl) | High-pitched, energetic child |
(A professional news anchor) | Clear, measured, authoritative |
(A dramatic movie trailer narrator) | Deep, intense, cinematic |
(A British gentleman, refined) | RP British accent, cultured |
(A whisper, secretive) | Quiet, conspiratorial |
Combine descriptions freely: (A wise elderly professor, warm and thoughtful)
Generate speech from text. Returns a job ID — poll /v1/tts/status/{id} for completion.
| Parameter | Type | Description |
|---|---|---|
text | string | Text to synthesize. Include voice description in parentheses, e.g. (A warm elderly man) Hello everyone |
language | string | Language hint: auto, en, zh, hi, es, fr, de, ja, ko, ru, ar, etc. |
voice_ref | file | Optional WAV/MP3 reference audio for voice cloning. Min 5 seconds, 16kHz+. If omitted, uses Voice Design (text description). |
style | string | Optional style modifier: (cheerful), (sad slow), (whispering), (happy) |
cfg_value | float | Cloning strength 0.5–5.0 (default: 2.0). Higher = closer to reference voice. |
inference_timesteps | int | Quality: 4=fast, 10=balanced, 20=best quality (default: 10) |
Poll for generation status. When complete, returns output_url with the audio file.
List supported languages.
VoxCPM2 supports 30 languages including: English, Hindi, Chinese, Spanish, French, German, Japanese, Korean, Russian, Arabic, Portuguese, Italian, Dutch, Polish, Turkish, Vietnamese, Thai, Indonesian, Malay, and more.
| Operation | Credits | USD | Notes |
|---|---|---|---|
| Voice Design (text → speech) | 50 credits/min | $0.050/min | No reference audio needed |
| Voice Cloning (reference → speech) | 100 credits/min | $0.100/min | Requires 16kHz+ reference WAV |
import requests
import time
API_KEY = "your_api_key"
BASE = "https://api.pixelapi.dev"
# 1. Generate speech with Voice Design
resp = requests.post(f"{BASE}/v1/tts/generate",
headers={"Authorization": f"Bearer {API_KEY}"},
data={
"text": "(A warm elderly man, gentle voice) Hello, welcome to our store.",
"language": "en",
"inference_timesteps": 10,
}
)
job = resp.json()
print(f"Job: {job['id']}, credits: {job['credits_used']}")
# 2. Poll until complete
while True:
status = requests.get(f"{BASE}/v1/tts/status/{job['id']}",
headers={"Authorization": f"Bearer {API_KEY}"}
).json()
print(f"Status: {status['status']}")
if status["status"] == "completed":
print(f"Audio URL: {status['output_url']}")
break
elif status["status"] == "failed":
print(f"Error: {status.get('error')}")
break
time.sleep(2)
# Clone a voice from reference audio
resp = requests.post(f"{BASE}/v1/tts/generate",
headers={"Authorization": f"Bearer {API_KEY}"},
data={
"text": "This is my cloned voice speaking.",
"language": "en",
"cfg_value": 2.0, # cloning strength
},
files={"voice_ref": open("speaker.wav", "rb")}
)
job = resp.json()
print(f"Clone job: {job['id']}")
| Feature | PixelAPI VoxCPM | ElevenLabs | edge-tts |
|---|---|---|---|
| Voice Design (text → voice) | ✅ 50 credits/min | $0.30/min | ❌ |
| Voice Cloning | ✅ 100 credits/min | $0.30/min + sub | ❌ |
| Languages | 30 | 29 | 100+ |
| Self-hosted | ✅ Own GPUs | ❌ API | ✅ Microsoft |
| Commercial License | ✅ Apache 2.0 | ✅ | ✅ |
| Real-time on consumer GPU | ✅ RTX 6000 Ada | N/A | ✅ |
Voice Design and Voice Cloning share the same concurrent job limits as other tools. Pro plan users get higher concurrency. Generation time is typically 3–15 seconds depending on text length.