Photo to 3D-render video in one API call: meet Lensora Studio

May 8, 2026 · 5 min read

A single photo goes in. An MP4 of the subject as a real 3D object — turning, dollying, or sweeping past the camera on a brand-new background — comes out. Two HTTP calls. Eighty credits. About four minutes of wall-clock.

That is the brief for Lensora Studio, the newest endpoint on PixelAPI. This post walks through what it does, the design choices behind it, and the slab-shaped detour we took to get the 3D step right.

Lensora Studio output: a vintage twin-lens reflex camera rendered as a 3D model, sitting on a marble countertop in a sunlit kitchen. The whole scene was synthesized from one photo of the camera.

A real Rolleiflex photo went in. This is one frame of the turntable MP4 that came out — the kitchen background was generated from a one-line prompt.

What it does, end to end

You hand the API a photo. It does four things back to back:

Detect. Object detection returns up to eight foreground proposals — bounding box, label, category — so a user can pick which thing to transform. Useful for messy frames, packshots that include props, or detection over-segmenting a logo into pieces.
Cut and rebackground. The chosen subject is segmented, and you choose what sits behind it: leave it transparent, drop in your own backplate URL, or describe the scene in plain English ("on a marble countertop with soft natural light").
3D. The cropped subject is rebuilt as a full 360° mesh with PBR textures. Real volume, real depth — not a flat plate that pretends to rotate.
Render. The mesh is composited over your new background and rendered as a 24 fps MP4 from one of three camera moves: turntable (full 360°), dolly (straight zoom-in), or cinematic (180° arc with depth-of-field).

You get back four artifacts every time: the hero MP4, a downloadable GLB you can drop into Blender / Unity / Three.js, a static composited still, and the alpha-cutout PNG.

The two-call shape

Step one is a multipart upload that returns object proposals plus a session_id:

curl -X POST https://api.pixelapi.dev/v1/studio/init \
  -H "Authorization: Bearer $PIXELAPI_KEY" \
  -F "[email protected]"

{
  "session_id": "2a91884c-...",
  "objects": [
    {"label": "vintage twin-lens reflex camera",
     "category": "product",
     "bbox": [0.18, 0.12, 0.79, 0.93]},
    {"label": "entire image (no crop)",
     "category": "full_frame",
     "bbox": [0.0, 0.0, 1.0, 1.0]}
  ],
  "credits_used": 5
}

Step two picks an object, picks a background, picks a camera, and returns a job_id immediately while the pipeline runs in the background:

curl -X POST https://api.pixelapi.dev/v1/studio/transform \
  -H "Authorization: Bearer $PIXELAPI_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "2a91884c-...",
    "object_index": 0,
    "background": {
      "type": "prompt",
      "prompt": "on a marble countertop with soft natural light"
    },
    "camera_preset": "cinematic"
  }'

You poll /v1/studio/result/{job_id} every few seconds. The step field walks through cropping → removing-bg → generating-bg → compositing → generating-3d → rendering-video → done so you can show real progress in your UI.

The full Python example is in the docs — sub-fifty lines including the polling loop and the GLB download.

The slab problem

Here is the part that ate two days.

The first version of the 3D step worked fine on the simple smoke tests we had. The output looked solid in catalog-style shots — clean object, isolated against a backdrop. So we shipped the canary and ran an end-to-end test on a Rolleiflex camera photo we had been using as a reference image for half a year.

The turntable opened on the front of the camera. Beautiful. Then it rotated 90°. And we saw a sliver. A thin sliver — barely visible at this angle. The model was a slab.

We measured. The thin axis of the bounding box was 15.7% of the longest axis. For a Rolleiflex — a roughly cube-shaped object that should be near 1:1:1 — that is a flat pancake.

The catch: from the front it looked perfect. The model had taken the input photo and built something that was mostly an extruded postcard. Texture was sharp on the front face, geometry was almost zero on the others. Three of our camera presets — turntable, dolly, cinematic — would all eventually expose the slab back or edge-on. We were going to ship a beautiful product gallery for thirty seconds, then a five-minute argument with a customer.

So we did the thing we did not want to do. We swapped the 3D engine for one that uses sparse-structure flow over a 3D occupancy grid instead of single-image-from-front extrusion. Re-validated on the same canary:

Old engine, thin axis: 15.7% of longest. Slab.
New engine, thin axis: 59.9% of longest. Real 3D.

That is a 3.8× depth recovery. Above 40% — our internal threshold for "this is a 3D shape, not a 3D image." And visually unmistakable:

Same Rolleiflex camera, viewed from a different angle in the cinematic render. The full chunky body is visible — top, side controls, depth — proving the 3D output has real volume rather than being a flat plate that rotates.

Same camera, side-on profile. Real depth, side controls visible, no slab artifact. This is the cinematic preset mid-arc.

Auto-orient and arc-clamp

We added one more thing for safety. Before each render, the renderer now measures the bounding-box extents of the mesh and rotates it so the largest face points square to the camera at angle 0. This means:

The dolly preset never goes edge-on regardless of how the source photo was framed.
The turntable preset always opens on the hero face, then sweeps around.
The cinematic preset gets a clean front-on opening shot before the arc.

And as a defensive belt: if any future input ever produces a thin-axis ratio under 30% despite the new 3D engine, the renderer falls back to a ±60° rocking arc instead of a full sweep, so a slab — if one ever sneaks back in — can never be visible from a bad angle.

The auto-orient pass costs us nothing — it's a single 4×4 transform on the mesh — and it papers over a class of bugs we'd otherwise have to debug per-input.

Pricing

Step	Credits	USD
`/v1/studio/init` (detect)	5	$0.005
`/v1/studio/transform` (full pipeline)	75	$0.075
End-to-end	80	$0.08

No subscription. The transform credits are auto-refunded on any failure or timeout in the pipeline.

When to use this

Lensora Studio is a fit when:

You have product photos and you want catalog rotation videos without Blender.
You're building an e-commerce listing tool and rotation MP4s lift conversion.
You want a quick "turn this into a 3D-render ad" button and don't want to chain four separate APIs yourself.
You're prototyping AR previews and need GLB files alongside an MP4 thumbnail.

It is not a fit when:

The subject is a person, an animal, or flat-lay clothing. The 3D engine needs a discrete, rigid object — packshots, tools, electronics, accessories work great. People and pets do not, in this version.
You need 4K. Output is 768×768 today. Higher resolutions are on the roadmap.
You want a 30-second video. Each preset is 4 seconds at 24 fps. You can chain renders for longer pieces, but the unit deliverable is short.

Try it

Try it in the browser → API docs Get an API key

If you hit edge cases (interesting geometry, unusual subjects, a corner case our slab-detector missed), I'd love to hear about them. The canary that exposed the original slab was a Rolleiflex sitting on a desk for unrelated reasons — sometimes the bug only shows up on the photo you weren't expecting to test against.