Tesla created a new Twitter account @Tesla_AI in May and today it got to work (we expect using humans so far), posting tweets about Tesla’s AI progress across their FSD software and upcoming humanoid robot, Teslabot.
Perhaps the most interesting of the flurry of Tweets was this one that centred around Dojo, Tesla’s custom-designed silicon, optimised for AI training.
Tesla uses training compute to train Machine Learning models for its FSD software and Tesla customers experience the benefit of training on increasingly large datasets with improvements to the software over time. In theory, the faster the data engine spins, the faster Tesla will achieve and deliver on its promise of Full Self Driving, so having more efficient and faster compute with Dojo is incredibly important.
Tesla Owners Silicon Valley reposted the Tweet to which Elon Musk replied, suggesting that a Dojo 2 is coming. While all computer chips go through iterations and improvements over time, I believe this is the first confirmation we’ve had that a Dojo version 2 is in the works.
Musk first clarifies that Dojo V1 is highly optimised for vast amounts of video training. We know that Tesla is taking the video feeds from the cameras around the car and stitching that together, to create a 3D vector-space environment of the world around the vehicle. This is much like a video game but created dynamically, on the fly, using computer vision, rather than being pre-programmed.
Musk has said previously Dojo will only be a success if the engineers want to turn off the current GPU cluster. That cluster is made up of thousands of Nvidia A100 GPUs, the current best-in-class compute for AI training. What Dojo aims to do is take that model of a dedicated architecture for a job, even further, while being more power efficient, economical and scalable.
As Tesla applies its AI smarts from the car to the robot, Tesla needs to solve what they call General Purpose AI and the compute required to train the model(s) is substantial, something Musk suggests will be addressed by Dojo V2.
Musk went on to confirm that Dojo has been in operation at Tesla for a few months ‘running useful tasks’. This suggests it is not yet running the full data engine required to re-training AI models for FSD builds.
The diagram shows that come July 2023, Tesla will Start Dojo Production. This indicates that have the design locked away and are ready to start manufacturing at scale. Tesla designed Dojo, but chip fabs are very specialised and are likely being manufactured by TMSC or Samsung.
When Tesla presented Dojo to the world back in 2021 at AI day, they showed off the ability to plug multiple of these together, and then stack them vertically in a Dojo Cabinet. Dojo Cabinets would then be connected through high-speed links allowing a matrix of Dojo chips to work together at an almost unprecedented speed. At the time this was all concept, but given Tesla is about to commit the capital to send these to production, it seems they’re ready to rock and roll with Dojo.
Musk went on to reply @Darewecan’s question about which hardware revision Tesla expects to achieve safer than human driving with. Musk confirmed the proposed timeline was his best guess.
That timeline suggests that HW3 (the computer and cameras in most Teslas on the road today, will be capable of being 2-3 times safer than a human. The most recent revision, hardware 4 would be around 5-6x safer than a human and a future hardware 5 revision could be as much as 10x safer than a human driver.
I’m actually not sure which one of these lets you go to sleep in the car.
Tesla’s timeline shows them taking a Top 5 position in the world for compute by as soon as February 2024. If that does transpire, it will be an amazing ramp and almost certainly take the place of the current GPU cluster for AI training at Tesla.
Tesla’s timeline shared today continues out to October 24, increasing to a capacity of the equivalent of 400,000 A100 GPUs to produce a combined 100 Ex-Flops.
I was interesting to compare this to the top-performing Supercomputers in the world. Before we get into the list, it is worth noting that not all are created equal with some destined for computational problems, rather than machine learning.
According to Top500.org, here’s a summary of the capabilities and performance of the Top 10 supercomputers on the planet.
Here is a summary of the system in the Top 10:
This HPE Cray EX system is the first US system with a performance exceeding one Exaflop/s. It is installed at the Oak Ridge National Laboratory (ORNL) in Tennessee, USA, where it is operated for the Department of Energy (DOE). It currently has achieved 1.194 Eflop/s using 8,699,904 cores. The HPE Cray EX architecture combines 3rd Gen AMD EPYC™ CPUs optimized for HPC and AI, with AMD Instinct™ 250X accelerators, and Slingshot-11 interconnect.
Installed at the RIKEN Center for Computational Science (R-CCS) in Kobe, Japan. It has 7,630,848 cores which allowed it to achieve an HPL benchmark score of 442 Pflop/s.
3. The LUMI system
Another HPE Cray EX system installed at EuroHPC center at CSC in Finland is the No. 3 with a performance of 0.3091 Eflop/s. The European High-Performance Computing Joint Undertaking (EuroHPC JU) is pooling European resources to develop top-of-the-range Exascale supercomputers for processing big data. One of the pan-European pre-Exascale supercomputers, LUMI, is located in CSC’s data center in Kajaani, Finland.
Installed at a different EuroHPC site in CINECA, Italy. It is an Atos BullSequana XH2000 system with Xeon Platinum 8358 32C 2.6GHz as main processors, NVIDIA A100 SXM4 40 GB as accelerators, and Quad-rail NVIDIA HDR100 Infiniband as interconnect. It achieved a Linpack performance of 238.7 Pflop/s.
An IBM-built system at the Oak Ridge National Laboratory (ORNL) in Tennessee, USA, is again listed at the No. 5 spot worldwide with a performance of 148.8 Pflop/s on the HPL benchmark, which is used to rank the TOP500 list. Summit has 4,356 nodes, each one housing two POWER9 CPUs with 22 cores each and six NVIDIA Tesla V100 GPUs each with 80 streaming multiprocessors (SM). The nodes are linked together with a Mellanox dual-rail EDR InfiniBand network.
A system at the Lawrence Livermore National Laboratory, CA, USA is at No. 6. Its architecture is very similar to the #5 system’s Summit. It is built with 4,320 nodes with two POWER9 CPUs and four NVIDIA Tesla V100 GPUs. Sierra achieved 94.6 Pflop/s.
7. Sunway TaihuLight
A system developed by China’s National Research Center of Parallel Computer Engineering & Technology (NRCPC) and installed at the National Supercomputing Center in Wuxi, which is in China’s Jiangsu province is listed at the No. 7 position with 93 Pflop/s.
Based on the HPE Cray “Shasta” platform and a heterogeneous system with AMD EPYC-based nodes and 1,536 NVIDIA A100 accelerated nodes. Perlmutter achieved 64.6 Pflop/s
An NVIDIA DGX A100 SuperPOD installed inhouse at NVIDIA in the USA. The system is based on AMD EPYC processor with NVIDIA A100 for acceleration and a Mellanox HDR InfiniBand as network and achieved 63.4 Pflop/s.
10. Tianhe-2A (Milky Way-2A)
A system developed by China’s National University of Defense Technology (NUDT) and deployed at the National Supercomputer Center in Guangzhou, China is now listed as the No. 10 system with 61.4 Pflop/s.