MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction
Unsupervised Learning-Based Approach for 3D Face Reconstruction
In this video, the speaker presents a model-based face auto-encoder that integrates both optimization-based and learning-based approaches within a single framework to obtain 3D models of faces from monocular input images.
Problem Statement
- The problem is to obtain a 3D model of the face from a monocular input image.
- The 3D reconstruction should define the geometry and skin reflectance of the face in the scene.
Optimization-Based Approaches
- Previous approaches used analysis by synthesis to optimize for the best 3D reconstruction given an input image or video.
- These approaches can obtain high-quality reconstructions but are computationally expensive due to large numbers of unknowns and constraints.
- The objective function which these problems try to minimize is non-convex, so they can get stuck in local minima.
Learning-Based Approaches
- Recently, some learning-based approaches have been proposed that directly learn a regressor which gives us the 3D reconstruction from the input image.
- These approaches are typically faster than optimization-based ones but lack training data of input images and their corresponding 3D reconstructions.
Model-Based Face Auto Encoder
- This approach integrates both optimization-based and learning-based approaches within a single framework.
- It lets us train unsupervised on real images and exploit advantages of both paradigms.
Pipeline Overview
- An input image goes through a convolutional encoder which gives parameters of a low-dimensional parametric model that defines the reconstruction.
- Once we have these parameters, we can compute the 3D reconstruction using semantics that define how each parameter influences it.
- The rendering passes through an image formation layer which projects it onto an image plane giving us a synthetically rendered image.
- A loss function compares this image with the input image and uses this loss to train our encoder.
Network Architecture
- The convolutional encoder is not the main contribution, but in experiments, VGG face or AlexNet with a fully connected layer at the end is used.
- The encoder gives us 257 parameters using which we can compute the 3D model of the face.
- These parameters define rigid pose, global shape, geometric deformations, skin reflectance, expressions of the face and scene illumination.
Unsupervised 3D Face Reconstruction
The speaker presents an unsupervised approach to train for 3D face reconstruction using a statistical regularizer from the parameters of the model. They show that their method works well on in-the-wild illumination settings, different expressions, and even under occlusions.
Results on Test Set
- Final reconstruction overlaid on input image
- Geometry, reflectance, and illumination channels of the reconstruction are shown
- Works well on in-the-wild illumination settings, different expressions, and even under occlusions by beer or strands of hair
Comparison with Other Approaches
- Compared to Richard Sunita's learning-based approach trained on synthetic images, this approach shows better generalizability to real images.
- Compared to Garrido et al.'s optimization-based approach, this approach obtains similar quality results while being orders of magnitude faster.
- Optimization-based techniques get stuck in local minima without key point based alignment between the model and input image. However, this method gives good reconstructions even without such alignment.
Applicability to Other Problems
- The presented ideas could be applicable to other problems beyond 3D face reconstruction.
- It could be possible to apply this approach to other domains without much resource on say reflectance models.
Q&A Session
- Concern about how this approach handles occlusions or eyeglasses. Skin reflectance is not modeled in the parametric model but the photometric loss function used is robust enough to handle occlusions to a certain extent.
- Possibility of modeling reflectance using this approach for harder problems.