← PixelAPI

I built a reel maker that actually understands the story — and it costs 3 cents

By Om Prakash · 2026-06-11 · Open the tool · API docs

I have spent the last six months watching customers use every AI reel maker on the market — OpusClip, Vizard, Klap, Munch, Pictory, Submagic — and the same thing keeps happening. They upload a 30-minute podcast, the tool spits out five "high-engagement" 60-second clips, and not one of them tells a coherent story. They are just the loudest moments stitched into a vertical box.

That is fine if your goal is volume. It is useless if your goal is to make someone watch your reel, understand what your thing does, and click the link.

So I built Reel Director. It is live now on PixelAPI. Drop any video, get back one 60-second reel that follows the actual structure of a trailer: a hook that names a problem, a moment that shows what your thing does, a moment that shows how to use it, and a closer that tells the viewer what to do next. The model picks where to cut. The cuts always land on full-stop sentence boundaries — no mid-word edits, no "Mr Bea—" splices.

POST /v1/video/auto-reel is 30 credits per reel. That is $0.03. OpusClip's cheapest tier works out to about $0.10 per reel. Vizard is around $0.05 for a two-minute source. Munch is closer to $0.46 for the same source. I am not exaggerating those numbers — they come from each provider's published pricing pages, divided by what their plans actually include.

What is different about it

Most reel makers work like this: run an "engagement score" model over the transcript, find the spikes, cut around them. Maybe sprinkle in a face-tracking crop. The result is a montage of intense moments with no through-line.

Reel Director works like this:

1. whisper-large-v3 transcribes your source, producing sentence-level cues with millisecond timestamps. This is the same model that powers production transcription at most serious shops. It catches Hindi, Marathi, Tamil, Mandarin, Arabic, Spanish, all 99 languages it was trained on. Pass language=hi to force a hint, or leave it blank and it auto-detects.

2. The transcript is split into every contiguous window that runs 8 to 16 seconds. Each window is scored for editorial fit per quartile of the video timeline. The opening 25% of the video is scored as a HOOK candidate — windows that contain problem-statement language ("look at any empty room…", "we wait weeks for…") rank higher than windows that start with a product name. The closing 25% is scored as a CTA candidate — windows that contain "three ways", "first", "try it free", or that hug the end of the video, rank higher than windows that talk about features.

3. The best candidates per quartile, plus 8 sampled frames spread across the source, are handed to Qwen2.5-VL-7B — a vision-language model that reads both the text and the pictures at once. The VLM picks one window per quartile. The bucket structure means the four picks cannot overlap, cannot come out of order, and cannot all clump into the first 30 seconds. You always get a HOOK from Q1, a WHAT from Q2, a HOW from Q3, a CTA from Q4.

4. ffmpeg renders the four windows in order with crossfades, gently EQs the voice for clarity (-2 dB at 200 Hz, +2 dB at 2.8 kHz), and layers a sidechain-ducked background music bed. Sidechain compression is the broadcast-standard trick: the music gets quieter exactly when someone speaks, then comes back up between phrases. It is what makes a podcast trailer sound like a podcast trailer instead of someone shouting over a song.

5. Optional PixelAPI logo bumper. Output at 16:9 (landscape — YouTube), 9:16 (vertical — Reels, Shorts, TikTok), or 1:1 (square — Instagram feed). Your pick at submission time.

What I learned the hard way

I tried the obvious thing first. Hand 8 frames to the VLM, ask it to "pick the best 4 windows for a viral reel". It hallucinated timestamps past the end of the video. I gave it a numbered transcript and asked for cue index ranges. It made up indices that did not exist. I gave it a list of candidate windows tagged Q1 / Q2 / Q3 / Q4 with example output, and the 7B model blindly copied the literal string "Q.1" from my example instead of picking real CIDs.

What finally worked was: integer CIDs from a small pre-filtered set, broken into four buckets, with the example showing concrete numbers that actually exist in the bucket lists. And a per-bucket editorial heuristic score baked into the prompt — so the VLM is biased toward the editorially-strongest candidate unless the frame samples tell it otherwise.

The 7B is not big enough to do this reasoning from scratch. But it is plenty smart enough to choose between 6 well-framed candidates per bucket. That is the unlock — you constrain the problem until it fits the model.

What it costs me to run this

I want to be honest about the unit economics because Indian-startup margins are thin and I do not want to bullshit anyone.

- Whisper-large-v3 on an RTX 3060: ~50 seconds for a 2-minute source. ~170 W average draw. Call it 2.4 watt-hours per reel.
- Qwen2.5-VL-7B on an RTX 4070 Ti SUPER: ~1.5 seconds warm, ~50 seconds cold. ~150 W during inference. Call it 0.1 watt-hours per reel.
- ffmpeg render: ~40 seconds CPU on a Ryzen 5800X. ~50 W. Call it 0.6 watt-hours.

Total: about 3 watt-hours per reel. At Indian commercial electricity rates that is ₹0.03 per reel in raw power. Add ~13 MB of egress traffic, a tiny gateway round-trip, and the 18% GST on what I charge — and 30 credits ($0.03) is comfortably profitable while still being two-to-fifteen times cheaper than the rest of the field.

There is no subscription. You pay per reel. You stop paying when you stop generating. That is the only honest pricing model when your costs are entirely variable.

How it compares

ProviderCheapest paid tierCost per reel (~2-min source)Story-arc cuts?Sentence-bounded?
OpusClip Starter$15/mo for 150 credits~$0.10No (engagement clips)No
Munch Candy Bar$23/mo for 100 min~$0.46NoApprox
Vizard Creator$20/mo for 800 min~$0.05NoNo
Pictory Standard$25/mo for 200 min~$0.25NoNo
Klap Starter$29/moNoNo
Submagic Starter$16/moNoYes (captions only)
PixelAPI Reel DirectorPay-as-you-go$0.03Yes (HOOK→WHAT→HOW→CTA)Yes (whisper-large-v3)

How to use it

Three ways, same backend:

curl

curl -X POST https://api.pixelapi.dev/v1/video/auto-reel \
  -H "Authorization: Bearer YOUR_KEY" \
  -F "source_url=https://youtu.be/abcdef12345" \
  -F "aspect=16:9" \
  -F "bgm=calm" \
  -F "include_logo=true"

You get back a generation_id. Poll GET /v1/video/auto-reel/{id} every few seconds. When status flips to done, hit /download and you have a 60-second mp4.

Web — drop a video at pixelapi.dev/tools/reel-director.html. Click Enter Tool, upload, watch, download.

Android — install Lensora. The Reel Director tile is coming in the next build — for now use the web flow from your phone browser, which works fine.

Where this falls short

This is a launch, not a finished product. Here is what I know is rough:

Vertical 9:16 from a 16:9 source pillarboxes with black bars right now. I have a YuNet face-track smart-crop ready to wire in for the next release. Until then, if you need a clean 9:16, shoot in 9:16.

Background music library is one track per style. "calm" is the default and it works for talk-heavy sources. "epic" is the dramatic option. "none" gives you voice-only. I will expand this once I have a sense of which sources gravitate to which scores.

The editorial-score heuristic favours sources that look like product launches. If your video is a fiction reading, a music performance, or a stand-up set, the HOOK→WHAT→HOW→CTA arc may not be the right shape. I will add other arcs (joke setup → punchline, problem → twist → resolution) once I see real-world usage patterns.

Whisper-large is not free latency. A 10-minute source takes about 4 minutes of compute. Not bad, but it is not instant. If you need sub-30-second turnaround, the next release will let you skip whisper if you supply your own SRT.

Try it on something you actually want a reel from.

Free credits on signup. Open Reel Director →

What I am asking for

Try it on something you actually want a reel from. Tell me where it picked weird windows. Tell me what kind of source you ran it on. Tell me what you wish the BGM library had. I am at [email protected] and @pixelapidev — and every PixelAPI account gets free credits on signup so you can run a few reels without paying anything.

The internet does not need another tool that turns long videos into engagement-score montages. It needs one that picks four moments that tell a story. That is the wager. I will let the output speak for itself.

— Om