WritingOpenAIOpenAIpublished May 16, 2018seen 6d

AI and compute

Open original ↗

Captured source

source ↗
published May 16, 2018seen 6dcaptured 3dhttp 200method exa

AI and compute | OpenAI

May 16, 2018

Conclusion

AI and compute

Loading…

Share

We’re releasing an analysis showing that since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4-month doubling time (by comparison, Moore’s Law had a 2-year doubling period). Since 2012, this metric has grown by more than 300,000x (a 2-year doubling period would yield only a 7x increase). Improvements in compute have been a key component of AI progress, so as long as this trend continues, it’s worth preparing for the implications of systems far outside today’s capabilities.

AlexNet to AlphaGo Zero: 300,000x increase in compute

Loading...

  • Download charts

Overview

Three factors drive the advance of AI: algorithmic innovation, data (which can be either supervised data or interactive environments), and the amount of compute available for training. Algorithmic innovation and data are difficult to track, but compute is unusually quantifiable, providing an opportunity to measure one input to AI progress. Of course, the use of massive compute sometimes just exposes the shortcomings of our current algorithms. But at least within many current domains, more compute seems to lead predictably to better performance⁠, and is often complementary to algorithmic advances.

For this analysis, we believe the relevant number is not the speed of a single GPU, nor the capacity of the biggest datacenter, but the amount of compute that is used to train a single model—this is the number most likely to correlate to how powerful our best models are. Compute per model differs greatly from total bulk compute because limits on parallelism⁠(both hardware and algorithmic) have constrained how big a model can be or how much it can be usefully trained. Of course, important breakthroughs are still made with modest amounts⁠ of compute—this analysis just covers compute capability.

The trend represents an increase by roughly a factor of 10 each year. It’s been partly driven by custom hardware that allows more operations to be performed per second for a given price (GPUs and TPUs), but it’s been primarily propelled by researchers repeatedly finding ways to use more chips in parallel and being willing to pay the economic cost of doing so.

Eras

Looking at the graph we can roughly see four distinct eras:

  • Before 2012: It was uncommon to use GPUs for ML, making any of the results in the graph difficult to achieve.
  • 2012 to 2014: Infrastructure to train on many GPUs was uncommon, so most results used 1-8 GPUs rated at 1-2 TFLOPS for a total of 0.001-0.1 pfs-days.
  • 2014 to 2016: Large-scale results used 10-100 GPUs rated at 5-10 TFLOPS, resulting in 0.1-10 pfs-days. Diminishing returns on data parallelism meant that larger training runs had limited value.
  • 2016 to 2017: Approaches that allow greater algorithmic parallelism such as huge batch sizes⁠, architecture search⁠, and expert iteration⁠, along with specialized hardware such as TPU’s and faster interconnects, have greatly increased these limits, at least for some applications.

AlphaGoZero/AlphaZero is the most visible public example of massive algorithmic parallelism, but many other applications at this scale are now algorithmically possible, and may already be happening in a production context.

Looking forward

We see multiple reasons to believe that the trend in the graph could continue. Many hardware startups⁠ are developing AI-specific chips, some of which claim they will achieve a substantial increase in FLOPS/Watt (which is correlated to FLOPS/$) over the next 1–2 years. There may also be gains from simply reconfiguring hardware to do the same number of operations for less economic cost⁠. On the parallelism side, many of the recent algorithmic innovations described above could in principle be combined multiplicatively—for example, architecture search and massively parallel SGD.

On the other hand, cost will eventually limit the parallelism side of the trend and physics will limit the chip efficiency side. We believe the largest training runs today employ hardware that cost in the single digit millions of dollars to purchase (although the amortized cost is much lower). But the majority of neural net compute today is still spent on inference (deployment), not training, meaning companies can repurpose or afford to purchase much larger fleets of chips for training. Therefore, if sufficient economic incentive exists, we could see even more massively parallel training runs, and thus the continuation of this trend for several more years. The world’s total hardware budget is 1 trillion dollars⁠ a year, so absolute limits remain far away. Overall, given the data above, the precedent for exponential trends in computing, work on ML specific hardware, and the economic incentives at play, we think it’d be a mistake to be confident this trend won’t continue in the short term.

Past trends are not sufficient to predict how long the trend will continue into the future, or what will happen while it continues. But even the reasonable potential for rapid increases in capabilities means it is critical to start addressing both safety⁠ and malicious use of AI⁠ today. Foresight is essential to responsible policymaking⁠ and responsible technological development, and we must get out ahead of these trends rather than belatedly reacting to them.

If you’d like to help make sure that AI progress benefits all of humanity⁠, join us⁠ at OpenAI. Our research and engineering roles range from machine learning researchers⁠ to policy researchers⁠ to infrastructure engineers⁠.

Loading...

Addendum: Compute used in older headline results

We’ve updated our analysis⁠ with data that span 1959 to 2012. Looking at the data as a whole, we clearly see two distinct eras of training AI systems in terms of compute-usage: (a) a first era, from 1959 to 2012, which is defined by results that roughly track Moore’s law, and (b) the modern era, from 2012 to now, of results using computational power that substantially outpaces macro trends. The history of investment in AI broadly is usually told as a story of booms and busts, but we don’t see that reflected in the historical trend of compute used by learning systems. It seems that AI winters and periods of excitement had a small effect on compute used to train modelsB over the last half-century.

Two distinct eras of compute usage in training AI systems

Loading...

  • Download…

Excerpt shown — open the source for the full document.