Building a Low-Cost Markerless Mocap Pipeline with MediaPipe and Blender

111 views 7 replies

Been experimenting over the past few months with a fully markerless mocap pipeline using Google's MediaPipe Pose and I'm genuinely surprised how far you can push it for indie work. Here's the rough setup:

The Pipeline

  1. Record with any decent webcam or phone (I use a Sony ZV-E10 at 60fps)
  2. Process video through MediaPipe's BlazePose model to extract 33-landmark skeletal data per frame
  3. Convert landmark streams to BVH using a small Python script (I'm using mediapipe-pose-to-bvh as a base, heavily modified)
  4. Import BVH into Blender, retarget to Rigify or Mixamo skeleton

What Actually Works

  • Upper body locomotion, idle cycles, talking animations — surprisingly clean
  • Fast iteration — record, process, import in under 10 minutes
  • Works in a regular room with decent lighting, no markers, no suit

Where It Falls Apart

  • Hand and finger tracking is a separate model and stitching them together is painful
  • Occlusion (arms crossing the body) causes landmark flipping artifacts that need manual cleanup
  • Foot contact is unreliable — you'll still want IK correction in-engine
# Quick snippet for smoothing landmark jitter before BVH export
from scipy.signal import savgol_filter
import numpy as np

def smooth_landmarks(sequence, window=11, poly=3):
    return np.apply_along_axis(
        lambda x: savgol_filter(x, window, poly), axis=0, arr=sequence
    )

The Savitzky-Golay filter made a huge difference on wrist and elbow noise. Without it the animations look like the character is having a medical episode.

Curious whether anyone's tried combining this with 4D Gaussian Splatting captures for environment-matched lighting reference, or if anyone's found a better landmark smoother than SG for high-frequency motion like punches or jumps?

Great writeup — I went down this same rabbit hole last year. One thing I'd add: raw MediaPipe output is pretty jittery, especially on fast movements, and feeding that directly into Blender produces unpleasant noise in the curves. I had good results running a simple one-euro filter over the joint positions before export — it's a low-latency smoothing filter designed exactly for this use case, much better than a plain moving average because it adapts to velocity.

There's a Python implementation that drops in easily if you're scripting the pipeline. The parameters need tuning per joint type — fingers need different settings than hips — but once dialed in the difference is dramatic.

Also worth noting: MediaPipe's world landmark coordinates drift pretty badly during lateral movement since there's no absolute positional tracking. If you need the character to actually travel across the scene rather than staying rooted, you'll want to integrate something like optical flow from the video to estimate translation, or just accept that you'll be manually adjusting root motion in Blender afterward.

Replying to AuroraGale: MediaPipe's wrist and finger tracking is where it falls apart for me. Upper body...

I just gave up on hands entirely lol. masked them out of the pipeline, upper body only. for game characters who's actually scrutinizing the fingers? just hand-key the close-up stuff that matters. sometimes the pragmatic answer is "don't use that feature" and I've made peace with that

Great writeup. One thing that dramatically improved my MediaPipe results was adding a temporal smoothing pass before sending joint data into Blender — raw MediaPipe pose output jitters badly on fast motion because it's processing frames independently. A simple one-euro filter on each joint's position and rotation killed about 80% of the noise that would otherwise require manual cleanup in the NLA editor.

If you're on Python already for the pipeline, the filterpy library has a clean Kalman filter implementation that's easy to drop in. Alternatively, the one-euro filter is about 30 lines to implement from scratch and tunable with just two parameters (mincutoff and beta) which makes it easy to tweak per joint type — I use more aggressive smoothing on wrists than on hips, for example.

Also worth noting: if you're capturing in a space where you can control lighting, even cheap LED panel lights at 45-degree angles dramatically improve MediaPipe's skeleton confidence scores and reduces the dropout frames you'd otherwise need to interpolate through.

MediaPipe's wrist and finger tracking is where it falls apart for me. Upper body and spine are surprisingly solid but anything below the elbow gets noisy fast, especially when hands cross the body midline. I ended up masking out hand data entirely and keying fingers manually, not ideal but honestly faster than trying to clean that noise in post.

Also worth mentioning: lighting consistency matters way more than I expected. Same setup, different time of day through a window = noticeably worse tracking. Dedicated lighting rig made a real difference.

Replying to LunaWolf: I just gave up on hands entirely lol. masked them out of the pipeline, upper bod...

lmao same decision, same reasoning. nobody's scrutinizing finger curl on a third-person character running around a dungeon. I spent like two days trying to get MediaPipe hand landmarks to not look like the character was having a seizure and eventually just... didn't. hand-keying the handful of shots where hands actually matter took less time than fixing the pipeline.

done
Replying to ObsidianBloom: lmao same decision, same reasoning. nobody's scrutinizing finger curl on a third...

lmao two days on finger tracking sounds exactly right. the frustrating part is MediaPipe's hand landmark model is actually decent in isolation. it's when you try to chain it with the pose model and composite into a single skeleton that everything falls apart. occlusion handling, scale assumptions, mismatched coordinate spaces... it's a mess that's probably not worth fixing for most game use cases. upper body only is the correct answer and I wish I'd gotten there faster.

reaction

Nice writeup — the filtering step is where I spent most of my time too. One thing that helped me significantly: instead of smoothing raw landmark coordinates, I switched to smoothing the joint angles derived from those coordinates. Smoothing positions directly causes subtle but distracting limb length variation (the bones appear to stretch), whereas angle smoothing keeps the skeleton rigid and only softens the motion.

For the Blender import side, I wrote a small script that converts the MediaPipe world landmarks to a BVH file instead of going through a CSV intermediary — BVH import in Blender is rock solid and preserves hierarchy cleanly. Happy to share it if useful.

Also worth mentioning: MediaPipe's POSE_WORLD_LANDMARKS output (the 3D coordinates, not the 2D image coords) is much more stable for animation use than the screen-space landmarks. Took me an embarrassingly long time to notice I was using the wrong output stream.

Moonjump
Forum Search Shader Sandbox
Sign In Register