🎙️ Voice Design & TTS API

Generate natural speech from text using VoxCPM2 — a 2B parameter multilingual TTS model running on our self-hosted GPU infrastructure. Design any voice using text descriptions, or clone a voice from a reference audio sample. 30 languages, 48kHz output.

Voice Design generates speech from text descriptions — no reference audio needed. Voice Cloning requires a reference audio file. Both run on RTX 6000 Ada GPUs for real-time generation.

Quick Start

Voice Design (text → speech)

curl -X POST https://api.pixelapi.dev/v1/tts/generate \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "text=(A warm elderly man, gentle voice) Hello, welcome to our store." \
  -F "language=en"
# Response: { "id": "uuid", "status": "queued", "is_clone": false }

Voice Cloning (reference → speech)

curl -X POST https://api.pixelapi.dev/v1/tts/generate \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "text=This is my cloned voice speaking." \
  -F "[email protected]" \
  -F "language=en"
# Response: { "id": "uuid", "status": "queued", "is_clone": true }

Check Status

curl https://api.pixelapi.dev/v1/tts/status/YOUR_JOB_ID \
  -H "Authorization: Bearer YOUR_API_KEY"
# Response: { "id": "uuid", "status": "completed", "output_url": "https://..." }

Voice Description Guide

Use parentheses in the text field to describe the voice you want. VoxCPM2 generates speech matching your description.

DescriptionVoice Type
(A warm elderly man)Deep, wise, kind old man
(A cheerful young girl)High-pitched, energetic child
(A professional news anchor)Clear, measured, authoritative
(A dramatic movie trailer narrator)Deep, intense, cinematic
(A British gentleman, refined)RP British accent, cultured
(A whisper, secretive)Quiet, conspiratorial

Combine descriptions freely: (A wise elderly professor, warm and thoughtful)

API Endpoints

POST /v1/tts/generate

Generate speech from text. Returns a job ID — poll /v1/tts/status/{id} for completion.

ParameterTypeDescription
textstringText to synthesize. Include voice description in parentheses, e.g. (A warm elderly man) Hello everyone
languagestringLanguage hint: auto, en, zh, hi, es, fr, de, ja, ko, ru, ar, etc.
voice_reffileOptional WAV/MP3 reference audio for voice cloning. Min 5 seconds, 16kHz+. If omitted, uses Voice Design (text description).
stylestringOptional style modifier: (cheerful), (sad slow), (whispering), (happy)
cfg_valuefloatCloning strength 0.5–5.0 (default: 2.0). Higher = closer to reference voice.
inference_timestepsintQuality: 4=fast, 10=balanced, 20=best quality (default: 10)

GET /v1/tts/status/{generation_id}

Poll for generation status. When complete, returns output_url with the audio file.

GET /v1/tts/languages

List supported languages.

Supported Languages

VoxCPM2 supports 30 languages including: English, Hindi, Chinese, Spanish, French, German, Japanese, Korean, Russian, Arabic, Portuguese, Italian, Dutch, Polish, Turkish, Vietnamese, Thai, Indonesian, Malay, and more.

Pricing

OperationCreditsUSDNotes
Voice Design (text → speech)50 credits/min$0.050/minNo reference audio needed
Voice Cloning (reference → speech)100 credits/min$0.100/minRequires 16kHz+ reference WAV
2x Cheaper than ElevenLabs: ElevenLabs charges $0.30/min for TTS. Our Voice Design is 6× cheaper at $0.050/min. Voice Cloning is 3× cheaper at $0.100/min.

Python SDK Example

import requests
import time

API_KEY = "your_api_key"
BASE = "https://api.pixelapi.dev"

# 1. Generate speech with Voice Design
resp = requests.post(f"{BASE}/v1/tts/generate",
    headers={"Authorization": f"Bearer {API_KEY}"},
    data={
        "text": "(A warm elderly man, gentle voice) Hello, welcome to our store.",
        "language": "en",
        "inference_timesteps": 10,
    }
)
job = resp.json()
print(f"Job: {job['id']}, credits: {job['credits_used']}")

# 2. Poll until complete
while True:
    status = requests.get(f"{BASE}/v1/tts/status/{job['id']}",
        headers={"Authorization": f"Bearer {API_KEY}"}
    ).json()
    print(f"Status: {status['status']}")
    if status["status"] == "completed":
        print(f"Audio URL: {status['output_url']}")
        break
    elif status["status"] == "failed":
        print(f"Error: {status.get('error')}")
        break
    time.sleep(2)

Voice Cloning Example

# Clone a voice from reference audio
resp = requests.post(f"{BASE}/v1/tts/generate",
    headers={"Authorization": f"Bearer {API_KEY}"},
    data={
        "text": "This is my cloned voice speaking.",
        "language": "en",
        "cfg_value": 2.0,  # cloning strength
    },
    files={"voice_ref": open("speaker.wav", "rb")}
)
job = resp.json()
print(f"Clone job: {job['id']}")
Legal Notice: Voice Cloning requires you to have rights to the reference audio. Do not clone voices without consent.

Comparison with Alternatives

FeaturePixelAPI VoxCPMElevenLabsedge-tts
Voice Design (text → voice)✅ 50 credits/min$0.30/min
Voice Cloning✅ 100 credits/min$0.30/min + sub
Languages3029100+
Self-hosted✅ Own GPUs❌ API✅ Microsoft
Commercial License✅ Apache 2.0
Real-time on consumer GPU✅ RTX 6000 AdaN/A

Rate Limits

Voice Design and Voice Cloning share the same concurrent job limits as other tools. Pro plan users get higher concurrency. Generation time is typically 3–15 seconds depending on text length.

Need help? Email support@pixelapi.dev or check the full API reference.