MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction

MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction

Unsupervised Learning-Based Approach for 3D Face Reconstruction

In this video, the speaker presents a model-based face auto-encoder that integrates both optimization-based and learning-based approaches within a single framework to obtain 3D models of faces from monocular input images.

Problem Statement

  • The problem is to obtain a 3D model of the face from a monocular input image.
  • The 3D reconstruction should define the geometry and skin reflectance of the face in the scene.

Optimization-Based Approaches

  • Previous approaches used analysis by synthesis to optimize for the best 3D reconstruction given an input image or video.
  • These approaches can obtain high-quality reconstructions but are computationally expensive due to large numbers of unknowns and constraints.
  • The objective function which these problems try to minimize is non-convex, so they can get stuck in local minima.

Learning-Based Approaches

  • Recently, some learning-based approaches have been proposed that directly learn a regressor which gives us the 3D reconstruction from the input image.
  • These approaches are typically faster than optimization-based ones but lack training data of input images and their corresponding 3D reconstructions.

Model-Based Face Auto Encoder

  • This approach integrates both optimization-based and learning-based approaches within a single framework.
  • It lets us train unsupervised on real images and exploit advantages of both paradigms.

Pipeline Overview

  1. An input image goes through a convolutional encoder which gives parameters of a low-dimensional parametric model that defines the reconstruction.
  1. Once we have these parameters, we can compute the 3D reconstruction using semantics that define how each parameter influences it.
  1. The rendering passes through an image formation layer which projects it onto an image plane giving us a synthetically rendered image.
  1. A loss function compares this image with the input image and uses this loss to train our encoder.

Network Architecture

  • The convolutional encoder is not the main contribution, but in experiments, VGG face or AlexNet with a fully connected layer at the end is used.
  • The encoder gives us 257 parameters using which we can compute the 3D model of the face.
  • These parameters define rigid pose, global shape, geometric deformations, skin reflectance, expressions of the face and scene illumination.

Unsupervised 3D Face Reconstruction

The speaker presents an unsupervised approach to train for 3D face reconstruction using a statistical regularizer from the parameters of the model. They show that their method works well on in-the-wild illumination settings, different expressions, and even under occlusions.

Results on Test Set

  • Final reconstruction overlaid on input image
  • Geometry, reflectance, and illumination channels of the reconstruction are shown
  • Works well on in-the-wild illumination settings, different expressions, and even under occlusions by beer or strands of hair

Comparison with Other Approaches

  • Compared to Richard Sunita's learning-based approach trained on synthetic images, this approach shows better generalizability to real images.
  • Compared to Garrido et al.'s optimization-based approach, this approach obtains similar quality results while being orders of magnitude faster.
  • Optimization-based techniques get stuck in local minima without key point based alignment between the model and input image. However, this method gives good reconstructions even without such alignment.

Applicability to Other Problems

  • The presented ideas could be applicable to other problems beyond 3D face reconstruction.
  • It could be possible to apply this approach to other domains without much resource on say reflectance models.

Q&A Session

  • Concern about how this approach handles occlusions or eyeglasses. Skin reflectance is not modeled in the parametric model but the photometric loss function used is robust enough to handle occlusions to a certain extent.
  • Possibility of modeling reflectance using this approach for harder problems.
Video description

ICCV17 | 777 | MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction Ayush Tewari (MPI Informatics), Michael Zollhoefer (MPI Informatics), Hyeongwoo Kim (MPI Informatics), Pablo Garrido (MPI Informatics), Florian Bernard (University of Luxembourg), Patrick Perez (Technicolor), Christian Theobalt (MPI Informatics) In this work we propose a novel model-based deep convolutional autoencoder that addresses the highly challenging problem of reconstructing a 3D human face from a single in-the-wild color image. To this end, we combine a convolutional encoder network with an expert-designed generative model that serves as decoder. The core innovation is the differentiable parametric decoder that encapsulates image formation analytically based on a generative model. Our decoder takes as input a code vector with exactly defined semantic meaning that encodes detailed face pose, shape, expression, skin reflectance and scene illumination. Due to this new way of combining CNN-based with model-based face reconstruction, the CNN-based encoder learns to extract semantically meaningful parameters from a single monocular input image. For the first time, a CNN encoder and an expert-designed generative model can be trained end-to-end in an unsupervised manner, which renders training on very large (unlabeled) real world data feasible. The obtained reconstructions compare favorably to current state-of-the-art approaches in terms of quality and richness of representation.