Computer Vision / ML·2023 – 2024·Built at Tessact (6-engineer team, $2M funded)
Face Tracking Service
Replaced AWS Rekognition at 40× lower cost — $6 → $0.15/hr
Sole engineer for this service on a 6-person team. Owned architecture, ML model selection, production deployment, and integration across all Tessact products. This work directly led to promotion to SDE II in January 2026.
Cost per video hour
$6 → $0.15 (40×)
Monthly savings
$2,500/month
Throughput improvement
10× faster
Accuracy
95%+ on film/TV
Clients
JioHotstar, SunTV, Jeevanvidya
Overview
Tessact was using AWS Rekognition for face detection and identification in long-form video content, paying $6 per video hour. I architected a custom face analysis service using InsightFace that cut the cost to $0.15 per video hour (40× cheaper) while delivering 10× faster throughput. At 100 hours/week, this saved $2,500/month with headroom for 10× growth. The service is now used by JioHotstar, Jeevanvidya, and SunTV.
Architecture & Design
- Architected a face analysis pipeline covering detection, identification, tracking, and clustering: each stage passes results to the next via a shared data store, enabling modular replacement of individual components.
- Productized as a standalone REST API service on GCP Cloud Run — clean interface allows any Tessact product to query faces for any video without knowing the internals. Used across the video editing pipeline, multimodal AI system, and the newer AI repurposing platform.
- Used ANN (Approximate Nearest Neighbor) indexing for embedding lookup, enabling sub-second identification across large face databases even for long-form content with hundreds of unique speakers.
ML Model Selection
- Evaluated multiple face detection models for production accuracy vs speed: SCRFD-10G-GNKPS (InsightFace), YOLO variants, and RetinaFace. Selected InsightFace buffalo_l (SCRFD-10G-GNKPS detector, GLINTR100 embeddings) for best accuracy/speed tradeoff on media-quality video.
- Benchmarked face clustering algorithms for speaker consolidation across long videos: HDBSCAN, Chinese Whispers, and Agglomerative clustering. Used PCA visualization to validate cluster quality on real film/TV datasets. Chinese Whispers clustering gave the most stable identities for media content.
- Validated tracking robustness by benchmarking YOLO+BoT-SORT/ByteTrack+ReID against InsightFace pipeline across test cases with occlusion, profile views, and lighting shifts. InsightFace pipeline won on consistency.
- Achieved 95%+ identity accuracy across film/TV content including multi-actor scenes, cross-cutting, and varied lighting conditions.
Cost Analysis vs AWS Rekognition
- AWS Rekognition: $0.10/image at ~60 frames/minute = $6.00/video hour. No batching discounts at Tessact's scale.
- Custom InsightFace service: GPU inference on GCP Cloud Run with N4 instances. Amortized cost at Tessact's volume: ~$0.15/video hour including compute, storage, and networking.
- 40× cost reduction. At 100 hours/week processed: $600/week → $15/week, saving $2,457/month.
- Custom service also delivers 10× faster throughput: Rekognition at 1:1 real-time ratio vs custom service at 1/10 (1 hour of video processed in 6 minutes).
Production Integration
- Integrated as a shared microservice across all Tessact pipeline products — both the legacy multimodal AI pipeline and the newer AI video repurposing platform use it.
- The face tracking output feeds directly into speaker-aware reframing: identifies which speaker is active in each frame for dynamic crop centering.
- Service handles concurrent requests across multiple client videos with GCP Cloud Run auto-scaling.
Tech Stack
Python
InsightFace
SCRFD-10G-GNKPS
GLINTR100
Chinese Whispers Clustering
ANN Indexing
FastAPI
Docker
GCP Cloud Run
PostgreSQL
NumPy
OpenCV