🎙️ Voice Design & TTS API

Generate natural speech from text using VoxCPM2 — a 2B parameter multilingual TTS model running on our self-hosted GPU infrastructure. Design any voice using text descriptions, or clone a voice from a reference audio sample. 30 languages, 48kHz output.

Voice Design generates speech from text descriptions — no reference audio needed. Voice Cloning requires a reference audio file. Both run on RTX 6000 Ada GPUs for real-time generation.

Quick Start

Voice Design (text → speech)

curl -X POST https://api.pixelapi.dev/v1/tts/generate \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "text=(A warm elderly man, gentle voice) Hello, welcome to our store." \
  -F "language=en"
# Response: { "id": "uuid", "status": "queued", "is_clone": false }

Voice Cloning (reference → speech)

curl -X POST https://api.pixelapi.dev/v1/tts/generate \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "text=This is my cloned voice speaking." \
  -F "[email protected]" \
  -F "language=en"
# Response: { "id": "uuid", "status": "queued", "is_clone": true }

Check Status

curl https://api.pixelapi.dev/v1/tts/status/YOUR_JOB_ID \
  -H "Authorization: Bearer YOUR_API_KEY"
# Response: { "id": "uuid", "status": "completed", "output_url": "https://..." }

Voice Description Guide

Use parentheses in the text field to describe the voice you want. VoxCPM2 generates speech matching your description.

Description	Voice Type
`(A warm elderly man)`	Deep, wise, kind old man
`(A cheerful young girl)`	High-pitched, energetic child
`(A professional news anchor)`	Clear, measured, authoritative
`(A dramatic movie trailer narrator)`	Deep, intense, cinematic
`(A British gentleman, refined)`	RP British accent, cultured
`(A whisper, secretive)`	Quiet, conspiratorial

Combine descriptions freely: (A wise elderly professor, warm and thoughtful)

API Endpoints

POST /v1/tts/generate

Generate speech from text. Returns a job ID — poll /v1/tts/status/{id} for completion.

Parameter	Type	Description
`text`	string	Text to synthesize. Include voice description in parentheses, e.g. `(A warm elderly man) Hello everyone`
`language`	string	Language hint: `auto`, `en`, `zh`, `hi`, `es`, `fr`, `de`, `ja`, `ko`, `ru`, `ar`, etc.
`voice_ref`	file	Optional WAV/MP3 reference audio for voice cloning. Min 5 seconds, 16kHz+. If omitted, uses Voice Design (text description).
`style`	string	Optional style modifier: `(cheerful)`, `(sad slow)`, `(whispering)`, `(happy)`
`cfg_value`	float	Cloning strength 0.5–5.0 (default: 2.0). Higher = closer to reference voice.
`inference_timesteps`	int	Quality: `4`=fast, `10`=balanced, `20`=best quality (default: 10)

GET /v1/tts/status/{generation_id}

Poll for generation status. When complete, returns output_url with the audio file.

GET /v1/tts/languages

List supported languages.

Supported Languages

VoxCPM2 supports 30 languages including: English, Hindi, Chinese, Spanish, French, German, Japanese, Korean, Russian, Arabic, Portuguese, Italian, Dutch, Polish, Turkish, Vietnamese, Thai, Indonesian, Malay, and more.

Pricing

Operation	Credits	USD	Notes
Voice Design (text → speech)	50 credits/min	$0.050/min	No reference audio needed
Voice Cloning (reference → speech)	100 credits/min	$0.100/min	Requires 16kHz+ reference WAV

2x Cheaper than ElevenLabs: ElevenLabs charges $0.30/min for TTS. Our Voice Design is 6× cheaper at $0.050/min. Voice Cloning is 3× cheaper at $0.100/min.

Python SDK Example

import requests
import time

API_KEY = "your_api_key"
BASE = "https://api.pixelapi.dev"

# 1. Generate speech with Voice Design
resp = requests.post(f"{BASE}/v1/tts/generate",
    headers={"Authorization": f"Bearer {API_KEY}"},
    data={
        "text": "(A warm elderly man, gentle voice) Hello, welcome to our store.",
        "language": "en",
        "inference_timesteps": 10,
    }
)
job = resp.json()
print(f"Job: {job['id']}, credits: {job['credits_used']}")

# 2. Poll until complete
while True:
    status = requests.get(f"{BASE}/v1/tts/status/{job['id']}",
        headers={"Authorization": f"Bearer {API_KEY}"}
    ).json()
    print(f"Status: {status['status']}")
    if status["status"] == "completed":
        print(f"Audio URL: {status['output_url']}")
        break
    elif status["status"] == "failed":
        print(f"Error: {status.get('error')}")
        break
    time.sleep(2)

Voice Cloning Example

# Clone a voice from reference audio
resp = requests.post(f"{BASE}/v1/tts/generate",
    headers={"Authorization": f"Bearer {API_KEY}"},
    data={
        "text": "This is my cloned voice speaking.",
        "language": "en",
        "cfg_value": 2.0,  # cloning strength
    },
    files={"voice_ref": open("speaker.wav", "rb")}
)
job = resp.json()
print(f"Clone job: {job['id']}")

Legal Notice: Voice Cloning requires you to have rights to the reference audio. Do not clone voices without consent.

Comparison with Alternatives

Feature	PixelAPI VoxCPM	ElevenLabs	edge-tts
Voice Design (text → voice)	✅ 50 credits/min	$0.30/min	❌
Voice Cloning	✅ 100 credits/min	$0.30/min + sub	❌
Languages	30	29	100+
Self-hosted	✅ Own GPUs	❌ API	✅ Microsoft
Commercial License	✅ Apache 2.0	✅	✅
Real-time on consumer GPU	✅ RTX 6000 Ada	N/A	✅

Rate Limits

Voice Design and Voice Cloning share the same concurrent job limits as other tools. Pro plan users get higher concurrency. Generation time is typically 3–15 seconds depending on text length.

Need help? Email support@pixelapi.dev or check the full API reference.