The best voice cloning API for developers who need production-quality AI speech at scale. POST any text and get back natural-sounding audio — in 30+ languages — using either a text description prompt or a reference audio recording. Two modes, one endpoint. $0.05 per request for voice design, $0.10 per request for voice cloning. 500 free credits, no credit card required.
Most voice APIs force you to pick from a fixed library of preset voices. PixelAPI gives you two better options:
Describe the voice you want in plain English: (warm elderly man, slow pace) Hello. No reference audio needed. Generates a unique synthetic voice on the fly from your description. Supports style cues: cheerful, whispering, sad slow, formal, gentle.
Upload a WAV or MP3 reference recording (minimum 5 seconds, 16 kHz+, max 10 MB). The API replicates the speaker's timbre, accent, and rhythm. Control cloning strength via cfg_value (0.5–5.0) and quality via inference_timesteps (4–20).
Sign up, copy your key from the dashboard, and POST your text. The endpoint returns a generation id; poll until status=completed, then download your audio from output_url. Maximum 500 characters of text per request.
# Voice Design — describe the voice in text
curl -X POST https://api.pixelapi.dev/v1/tts/generate \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "text=Hello, welcome to our store. We have great deals today." \
-F "language=en" \
-F "voice_description=warm, professional male narrator"
# Response: {"id": "uuid", "status": "queued", "credits_used": 50.0, ...}
# Poll until completed
curl https://api.pixelapi.dev/v1/tts/status/UUID \
-H "Authorization: Bearer YOUR_API_KEY"
# Response: {"status": "completed", "output_url": "https://..."}
# Voice Cloning — add a reference recording
curl -X POST https://api.pixelapi.dev/v1/tts/generate \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "text=This script will be read in my cloned voice." \
-F "language=en" \
-F "voice_ref=@my_voice_sample.wav"
import requests, time
API_KEY = "YOUR_API_KEY"
headers = {"Authorization": f"Bearer {API_KEY}"}
# Voice Design
resp = requests.post(
"https://api.pixelapi.dev/v1/tts/generate",
headers=headers,
data={
"text": "Hello, welcome to our store. We have great deals today.",
"language": "en",
"voice_description": "warm, professional male narrator",
}
)
job = resp.json()
# Poll for result
while True:
status = requests.get(
f"https://api.pixelapi.dev/v1/tts/status/{job['id']}",
headers=headers
).json()
if status["status"] == "completed":
audio_url = status["output_url"] # download from here
break
time.sleep(2)
# Voice Cloning — swap data= for files=
with open("my_voice_sample.wav", "rb") as ref:
resp = requests.post(
"https://api.pixelapi.dev/v1/tts/generate",
headers=headers,
data={"text": "Clone this script.", "language": "en"},
files={"voice_ref": ref}
)
import FormData from 'form-data';
import fetch from 'node-fetch';
const API_KEY = process.env.PIXELAPI_KEY;
// Voice Design
const form = new FormData();
form.append('text', 'Hello, welcome to our store.');
form.append('language', 'en');
form.append('voice_description', 'warm, professional male narrator');
const res = await fetch('https://api.pixelapi.dev/v1/tts/generate', {
method: 'POST',
headers: { 'Authorization': `Bearer ${API_KEY}`, ...form.getHeaders() },
body: form,
});
const job = await res.json();
// Poll for result
let status;
do {
await new Promise(r => setTimeout(r, 2000));
status = await fetch(`https://api.pixelapi.dev/v1/tts/status/${job.id}`, {
headers: { 'Authorization': `Bearer ${API_KEY}` }
}).then(r => r.json());
} while (status.status !== 'completed');
console.log(status.output_url); // download audio from this URL
<?php
$apiKey = getenv("PIXELAPI_KEY");
// Voice Design
$ch = curl_init("https://api.pixelapi.dev/v1/tts/generate");
curl_setopt_array($ch, [
CURLOPT_POST => true,
CURLOPT_HTTPHEADER => ["Authorization: Bearer $apiKey"],
CURLOPT_POSTFIELDS => [
"text" => "Hello, welcome to our store.",
"language" => "en",
"voice_description" => "warm, professional male narrator",
],
CURLOPT_RETURNTRANSFER => true,
]);
$job = json_decode(curl_exec($ch), true);
curl_close($ch);
// Poll for result
do {
sleep(2);
$ch = curl_init("https://api.pixelapi.dev/v1/tts/status/{$job['id']}");
curl_setopt_array($ch, [
CURLOPT_HTTPHEADER => ["Authorization: Bearer $apiKey"],
CURLOPT_RETURNTRANSFER => true,
]);
$status = json_decode(curl_exec($ch), true);
curl_close($ch);
} while ($status["status"] !== "completed");
echo $status["output_url"]; // download audio from this URL
require 'net/http'
require 'json'
api_key = ENV["PIXELAPI_KEY"]
http = Net::HTTP.new("api.pixelapi.dev", 443)
http.use_ssl = true
# Voice Design
req = Net::HTTP::Post.new("/v1/tts/generate")
req["Authorization"] = "Bearer #{api_key}"
req.set_form([
["text", "Hello, welcome to our store."],
["language", "en"],
["voice_description", "warm, professional male narrator"],
], "multipart/form-data")
job = JSON.parse(http.request(req).body)
# Poll for result
loop do
sleep 2
status_req = Net::HTTP::Get.new("/v1/tts/status/#{job['id']}")
status_req["Authorization"] = "Bearer #{api_key}"
status = JSON.parse(http.request(status_req).body)
if status["status"] == "completed"
puts status["output_url"] # download audio from this URL
break
end
end
package main
import (
"bytes"; "encoding/json"; "fmt"
"io"; "mime/multipart"; "net/http"; "time"; "os"
)
func main() {
apiKey := os.Getenv("PIXELAPI_KEY")
client := &http.Client{}
// Voice Design
body := &bytes.Buffer{}
w := multipart.NewWriter(body)
w.WriteField("text", "Hello, welcome to our store.")
w.WriteField("language", "en")
w.WriteField("voice_description", "warm, professional male narrator")
w.Close()
req, _ := http.NewRequest("POST", "https://api.pixelapi.dev/v1/tts/generate", body)
req.Header.Set("Authorization", "Bearer "+apiKey)
req.Header.Set("Content-Type", w.FormDataContentType())
resp, _ := client.Do(req)
var job map[string]interface{}
json.NewDecoder(resp.Body).Decode(&job)
resp.Body.Close()
// Poll for result
id := job["id"].(string)
for {
time.Sleep(2 * time.Second)
req, _ = http.NewRequest("GET", "https://api.pixelapi.dev/v1/tts/status/"+id, nil)
req.Header.Set("Authorization", "Bearer "+apiKey)
resp, _ = client.Do(req)
var status map[string]interface{}
json.NewDecoder(resp.Body).Decode(&status)
resp.Body.Close()
if status["status"] == "completed" {
fmt.Println(status["output_url"]) // download audio from this URL
break
}
}
}
PixelAPI's voice cloning API is priced at exactly half the cost of the nearest equivalent tier on major competitors. The table below compares flat per-request billing (PixelAPI) against per-minute and per-character models used by rivals.
| Provider | Free tier | Voice design / TTS | Voice cloning | Languages |
|---|---|---|---|---|
| PixelAPI | 500 credits, no card | $0.05/request | $0.10/request | 30+ |
| ElevenLabs | Limited free tier | ~$0.30/min (Scale) | Included in plan | 30+ |
| OpenAI TTS-1 | Pay-as-you-go | $15.00/1M chars | Not available | ~6 |
| Play.ht | Limited | see play.ht/pricing | see play.ht/pricing | 140+ |
| Murf.ai | Trial only | see murf.ai/pricing | Enterprise | 20+ |
Pricing verified from each rival's public pricing page March 2026. ElevenLabs per-minute rate sourced from competitive pricing audit. OpenAI TTS-1 rate from platform.openai.com/docs/pricing. PixelAPI's per-request price is set at exactly half the equivalent rate of the leading competitor per our pricing principle — never above (not competitive), never below (signals low quality).
The output_url field in the completed status response is a signed, time-limited URL pointing directly to your generated audio file. Download it server-side or stream it to end users.
Every status response includes credits_used, created_at, and completed_at timestamps — ready to pipe into your billing, logging, or analytics systems.
The same endpoint handles English, Mandarin, Hindi, Spanish, French, German, Japanese, Korean, Arabic, and 20+ more. Set language=auto to detect from the text, or specify a code for explicit control.
For voice cloning jobs, cfg_value (0.5–5.0) controls how closely the output locks to the reference speaker. Lower values add naturalness; higher values tighten the clone. inference_timesteps (4–20) trades generation speed for audio quality.
The Voice Cloning API powers these production workflows. Each link goes to an industry-specific setup guide:
Narrate audiobooks, articles, and news summaries. Multi-chapter scripts split into 500-char segments.
Product description voiceovers, promotional audio for Shopify and WooCommerce listings.
Ad voiceovers, brand voice cloning, multilingual campaign audio at scale.
Automated property listing audio tours. Clone the agent's voice for consistent branding.
Short-form audio for Reels, TikTok, and Shorts. Voice design for rapid content production.
Lookbook narration, product walk-through audio, multilingual retail content.
More use-case guides: all industries →
Trigger voice generation from any Zap. No-code pipeline for content teams.
Drag-and-drop TTS module in Make scenarios.
Auto-generate audio for CMS collection items on publish.
Product audio on upload via webhook. Accessible storefronts, zero manual work.
Wix Automations hook to generate voiceovers for new blog posts.
Server-side audio generation in Next.js API routes. Edge-compatible polling pattern.
Flat per-request pricing vs ElevenLabs' credit tiers. No monthly subscription required. Equivalent voice quality at a fraction of the cost for moderate volumes.
API-first vs Murf's studio-first approach. No UI lock-in — integrate directly into your pipeline. Voice cloning on the free trial, not enterprise-gated.
Simple per-request billing vs Play.ht's character-based plans. No seat limits. REST API identical to PixelAPI's other audio and image endpoints.
OpenAI offers six preset voices, no cloning. PixelAPI adds voice design from text description, voice cloning from reference audio, and 30+ language support in the same endpoint.
Default 60 requests/minute on the free tier, 600 requests/minute on paid tiers. Exceeding the limit returns HTTP 429 with a Retry-After header. Recommended: exponential backoff starting at 2 seconds, doubling on each retry up to 30 seconds maximum.
Additional status codes to handle:
402 Insufficient credits — top up your balance at /pricing or use trial credits.400 Text cannot be empty — the text field is required and must not be blank.400 Reference audio must be under 10 MB — compress or trim the reference file and retry.503 Server busy — the request queue is temporarily full; retry after the Retry-After header delay.# Retry with exponential backoff (Python)
import requests, time
def generate_speech(text, headers, **kwargs):
delay = 2
for attempt in range(5):
resp = requests.post(
"https://api.pixelapi.dev/v1/tts/generate",
headers=headers,
data={"text": text, **kwargs}
)
if resp.status_code == 429:
time.sleep(int(resp.headers.get("Retry-After", delay)))
delay = min(delay * 2, 30)
continue
resp.raise_for_status()
return resp.json()
raise RuntimeError("Max retries exceeded")
Step-by-step guide to uploading a reference recording, tuning cfg_value, and achieving a high-quality clone.
Segment long manuscripts, batch-generate audio per chapter, and concatenate to a finished audiobook.
Voice design tips for ad copy, explainer videos, and on-hold messages — no reference audio needed.
POST your text to https://api.pixelapi.dev/v1/tts/generate with your API key and optionally a voice_description prompt. The endpoint returns a generation id; poll GET /v1/tts/status/{id} until status=completed, then download your audio from output_url. See the Quick Start section above for code in six languages.
$0.05 per request for voice design (text prompt to speech) and $0.10 per request for voice cloning (reference audio upload). New accounts get 500 free credits — enough for 10 voice design jobs or 5 voice clone jobs — with no credit card required. Credits never expire.
Voice design uses a text description — voice_description=warm elderly man, gentle pace — to generate a synthetic voice on the fly. No reference audio needed. Voice cloning takes an actual WAV or MP3 recording (minimum 5 seconds, 16 kHz+, max 10 MB) and replicates the speaker's timbre, accent, and rhythm. Voice design costs $0.05/request; voice cloning costs $0.10/request.
30+ languages: English, Mandarin Chinese, Hindi, Spanish, French, German, Japanese, Korean, Russian, Arabic, Portuguese, Italian, Dutch, Polish, Turkish, Vietnamese, Thai, Indonesian, Malay, Bengali, Tamil, Telugu, Marathi, Ukrainian, Swedish, Norwegian, Danish, Finnish, Greek, Hebrew, and Swahili. Use language=auto for automatic detection from the input text.
500 characters per request — roughly 30 seconds of speech at a natural speaking pace (~150 words/minute). For longer content such as audiobooks or podcast scripts, split the text into 500-character segments and send sequential requests, then concatenate the audio files in your application.
The output_url field in the completed status response is a signed URL pointing to the generated audio file. Download it directly from your application once status equals completed.
Include the voice_ref field as a file upload (multipart/form-data) containing a WAV or MP3 recording sampled at 16 kHz or higher. The file must be at least 5 seconds long and under 10 MB. Longer, cleaner recordings — 30+ seconds in a quiet room — produce noticeably better clones. Background noise and music in the reference degrade clone quality.
Default 60 requests/minute on the free tier, 600 requests/minute on paid tiers. Exceeding the limit returns HTTP 429 with a Retry-After header. For high-volume batch workloads — audiobook generation, IVR prompt refreshes — contact [email protected] with your expected volume for a custom limit.
Yes. Every new account starts with 500 free credits — no credit card required. That covers 10 voice design requests or 5 voice clone requests. Credits never expire; unused credits roll over as long as the account remains active. You can test on real workloads before paying anything.
Yes. For voice design, embed style cues directly in the text field: (cheerful) Good morning! or (slow, whispering) This is a secret. For voice cloning, cfg_value (0.5–5.0, default 2.0) controls cloning strength: lower values add naturalness, higher values lock tighter to the reference speaker. inference_timesteps (4–20, default 10) trades generation speed for audio quality — use 20 for production audiobooks, 4 for previews.