Facebook use 800 Tesla V100 GPUs to train AI to turn any 2D photo into a 3D photo

    Facebook has supported 3D photos for a couple of years now and I have a couple of friends that use them frequently to great effect. Not all phones though, are capable of taking a 3D photos, lacking the dual-cameras necessary to derive depth information. Thanks to some new AI work by Facebook, that’s about to change.

    Photos taken from a single lens, can now have state-of-the-art machine learning applied to them, produce 3D photos. The system works on virtually any image, even decades-old photos or even paintings.

    Converting a 2D photo into a 3D Photo, required overcoming a variety of technical challenges, such as training a model that correctly infers 3D positions of an extremely wide variety of subject matter and optimizing the system so that it works on-device on typical mobile processors in a fraction of a second.

    To overcome these challenges, Facebook trained a convolutional neural network (CNN) on millions of pairs of public 3D images and their accompanying depth maps, and leveraged a variety of mobile-optimization techniques previously developed by Facebook AI, such as FBNet and ChamNet.

    The mobile experience

    Given a standard RGB image, the 3D Photos CNN can estimate a distance from the camera for each pixel. this was accomplished through four means:

    • A network architecture built with a set of parameterizable, mobile-optimized neural building blocks.
    • Automated architecture search to find an effective configuration of these blocks, enabling the system to perform the task in under a second on a wide range of devices.
    • Quantization-aware training to leverage high-performance INT8 quantization on mobile while minimizing potential quality degradation from the quantization process.
    • Large amounts of training data derived from public 3D photos.

    Automated architecture search

    In order to find an effective architecture configuration, the team at Facebook AI, automated the search process using ChamNet, an algorithm developed by Facebook AI. The ChamNet algorithm iteratively samples points from the search space to train an accuracy predictor.

    This accuracy predictor is used to accelerate a genetic search to find a model that maximizes predicted accuracy while satisfying specified resource constraints. A search space that varies the channel expansion factor and number of output channels per block, resulting in 3.4×1022 possible architectures.

    The search is completed in approximately 3 days using 800 Tesla V100 GPUs, setting and then adjusting a FLOP constraint on the model architecture in order to achieve different operating points.

    Quantization-aware training

    By default, our model is trained using single-precision floating point weights and activations, but FB found significant advantages to quantizing both weights and activations to be only 8 bits. In particular, int8 weights require only a quarter of the storage required of float32 weights, thereby reducing the number of bytes that must be transferred to the device on first use.

    Each of these images started as a regular 2D image and was transformed to 3D with our depth estimation neural network.

    Int8-based operators also have much higher throughput compared with their float32 counterparts, thanks to well-tuned libraries such as Facebook AI’s QNNPACK, which has been integrated into PyTorch. FB used quantization-aware training (QAT) to avoid an unacceptable drop in quality due to quantization. QAT, which is now available as part of PyTorch, simulates quantization during training and supports back propagation, thereby eliminating the gap between training and production performance.

    Finding new ways to create 3D experiences

    In addition to refining and improving the depth estimation algorithm, Facebook engineers are working toward enabling high-quality depth estimation for videos taken with mobile devices.

    Videos pose a noteworthy challenge, since each frame depth must be consistent with the next. But it is also an opportunity to improve performance, since multiple observations of the same objects can provide additional signal for highly accurate depth estimations.

    Video-length depth estimation will open up a variety of innovative content creation tools to our users. As the performance of the neural network improves, Facebook is also exploring how to leveraging depth estimation, surface normal estimation, and spatial reasoning in real-time applications such as augmented reality.

    Facebook says this work will help them better understand the content of 2D images more generally. Improved understanding of 3D scenes could also help robots navigate and interact with the physical world. Facebook hopes that by sharing details about our 3D Photos system, they will help the AI community make progress in these areas and create new experiences that leverage advanced 3D understanding.

    More information at

    Jason Cartwright
    Jason Cartwright
    Creator of techAU, Jason has spent the dozen+ years covering technology in Australia and around the world. Bringing a background in multimedia and passion for technology to the job, Cartwright delivers detailed product reviews, event coverage and industry news on a daily basis. Disclaimer: Tesla Shareholder from 20/01/2021

    Leave a Reply


    Latest posts


    Related articles