DeepSight

← back demo part 1 demo part 2

An emotional response simulation engine for video: upload a clip, run it through Meta's TribeV2 multimodal brain encoder on Modal, and get back timestamp-level scene splits, speech alignment, predicted fMRI heatmaps, and a five-class emotion timeline — afraid, calm, delighted, depressed, excited.

Video preprocessing pipeline

Before TribeV2 runs, Modal orchestrates parallel workers: PySceneDetect segments footage into visually distinct timestamps (four threshold sweeps so at least one sensitivity catches cuts on varied footage), and faster-whisper extracts speech with word-level timestamps so scene boundaries never split mid-sentence. Frames land in GCS; aligned scene packets persist in Supabase.

TribeV2 brain encoding

Meta's facebook/tribev2 model takes multimodal video input and outputs predicted fMRI vertex activity — a heatmap of what the brain might look like while watching each segment. Deployed on Modal with an H100, persistent volume caching, and aggressive ffmpeg preprocess (256px, 15fps). VJEPA2 video encoding inside TribeV2 is the dominant bottleneck: ~5 minutes on H100 for a 2-minute clip.

NeuroEmo emotion classifier

A custom TinyMLP with StandardScaler and PCA sits on top of TribeV2's 20k+ vertex predictions, reduced through Harvard-Oxford and Schaefer atlas ROI features (296 dims → PCA 128 → 5 classes). Trained on NeuroEmo (ds005700) — real fMRI collected while participants watched emotional film clips. MLP v2 work is underway to align training features with production TribeV2 outputs.

Analytics workspace

Next.js frontend with a resizable mission-control layout: video stage, interactive timeline with per-scene emotion bars and transcript splits, and a GLB-based 3D brain viewer with sectional activation overlay. Scrub the timeline and watch fMRI heatmaps and emotion probabilities update per scene.

Infrastructure and deployment

Next.js on Vercel, Supabase for auth and Postgres state machine, GCS for video and frame storage, Modal for all ML workers (scene detection, Whisper on L4, TribeV2 on H100). Built in a ~17-day sprint — migrated from GCP VMs to Modal for faster ML deploy cycles. Modal credit limits on H100 were the main iteration bottleneck during development.