Scalable Reinforcement Learning on Cloud TPU



Instadeep used reinforcement learning on Cloud TPUs to improve DeepPCB, its AI-driven Printed Circuit Board (PCB) design product.


Armand Picard, Software Engineer

Donal Byrne, Senior Research Engineer

Alexandre Laterre, Research Lead 

Vaibhav Singh, Product Manager, Google Cloud TPU

The long-term goal of Artificial Intelligence (AI) is to solve complex real-life problems.  On the path to intelligence, Reinforcement Learning (RL) — a type of Machine Learning algorithm that learns optimal decision-making by interacting with an environment to maximise a reward — has made significant contributions. First through groundbreaking achievements in video games, exemplified by the defeat of world champions in games like Go, StarCraft 2 and Dota2 against RL-based systems. More recently, Reinforcement Learning has accomplished remarkable feats in the real-world such as navigating stratospheric balloons, controlling tokamak plasma of a nuclear fusion reactor and even discovering novel algorithms with AlphaDev

However, despite notable successes, RL is not yet widely adopted for real-world applications.  One major decisive factor is the engineering challenge; high-performing RL models often require large-scale simulation and training of the decision-making process. For example, OpenAI Five was trained at an unprecedented scale, using thousands of GPUs over 10 months at a rate of 1 million frames per second [1]. This ultimately enabled its remarkable performance on Dota 2. These results echo the scaling laws recently discovered in Natural Language Processing and Computer Vision, behind the success of PaLM2 and ChatGPT. Indeed, “scale” has become an essential component for achieving success in Artificial Intelligence.

In this article, we will explore how Cloud TPUs can help scale Reinforcement Learning for both research and industry. In particular, this enabled an astounding 235x throughput increase for DeepPCB, an AI-based Printed Circuit Board (PCB) design product developed by InstaDeep, cutting training costs by almost 90%.

Scaling Reinforcement Learning

RL agents learn by acting in a simulated environment to discover how to solve a problem without external  examples or guidance.  This process of trial-and-error means RL training infrastructure needs to both simulate a vast number of agent-environment interactions, and then to update the agent—often a neural network—based on the gathered experiences.


Figure 1: Standard RL training loop. An agent takes in the latest information from the environment (observation and reward) and uses this to choose the next action to take. The action is passed to the environment, which then carries out the next step in the simulation. New information is passed back to the agent which then learns from observations and rewards. 

Multiple techniques have been employed to enhance data collection throughput. A popular method involves decomposing the agent into an “Actor”, responsible for generating and collecting data through an environment, and a “Learner”, responsible for using the actors data to update the agent. This architecture has its variants, each with distinct pros and cons. For instance, SEED-RL focuses on a centralised inference design, sharing a hardware accelerator for batch inference and learning. This allows SEED-RL to reduce the communication overhead of model retrieval and accelerate acting inference by utilising hardware accelerators. Menger takes an alternative approach, focusing on localised inference. Actors are replicated across a pool of cpus to generate data that is then provided to the learner through a sharded replay buffer. This removes the overhead of sending large batches of observations to a centralised server, at the cost of sending model updates to the cpu-based actors.

Overall, the efficiency of an RL system is inseparable from its hardware deployment. An effective distributed architecture must consider multiple factors such as the available hardware (e.g. accelerator type, connectivity, processor clock-speed, shared memory access), the environment complexity (e.g. simulation time), as well as the algorithm itself (e.g. on/off-policy). In order to maximise computational efficiency, the system must address the challenging task of balancing data communication, resource utilisation and algorithmic constraints. 

What are Google Cloud TPUs? 

Tensor Processing Units (TPUs) are purpose-built AI accelerators designed for large-scale, high-performance computing applications, such as training large machine learning models. TPUs stand out due to their optimised domain-specific architecture, designed to accelerate tensor operations underpinning modern neural network computations. This includes high-memory bandwidth, dedicated Matrix Processing Units (MXU) for dense linear algebra, and specialised cores to speed up sparse models.

TPU pods are clusters of interconnected TPU chips. These leverage high-speed interconnects allowing smooth communication and data sharing between chips, thereby creating a system that offers immense parallelism and computational power. Google’s TPUv4 pods can combine up to 4,096 of these chips, delivering a staggering peak performance of 1.1 ExaFLOPS [2]. TPU v4 also entails optical circuit switch (OCS) to dynamically configure this 4096 chip cluster to provide smaller TPU slices. Additionally, thanks to Google’s integrated approach to data centre infrastructure, TPUs can be  2-6 times more energy-efficient than other Domain Specific Architectures (DSA’s) run in conventional facilities—cutting carbon emissions by up to 20 times [2].

Using Cloud TPU to Scale Reinforcement Learning

Figure 1: Sebulba architecture depicting the placement of the Learner Cores (Yellow) and Actor Cores (Red) on the TPU, as well as the environments on host CPU’s (Green). The environments send their observations and rewards to the Actor cores that then send in response the next actions to take. While doing so, the actors build up a buffer of trajectories. Concurrently, the learner pulls batches of trajectories from the actors buffer and carries out learning updates, pushing new model params to the actor devices. 

TPU pods offer an innovative solution to the unique challenges of scaling RL systems. Their highly interconnected architecture and specialised cores allows for rapid data transfer and parallel processing, without the additional overhead present in traditional RL architectures. One such approach is the Sebulba Architecture introduced in Podracer, which utilises an actor-learner decomposition to efficiently generate experience. The Sebulba architecture is designed to support arbitrary environments and co-locates acting and learning on a single TPU machine, maximising the utilisation of TPU resources.

Sebulba divides the 8 available TPU cores of a single machine into two groups: actors and learners. Acting involves interacting with batches of environments in order to generate experiences for the learning process. Sebulba uses multiple Python threads for this purpose, with each thread responsible for a batch of environments. These threads run in parallel, sending batches of observations to a TPU core, which then selects the next actions. The environments are stepped using a shared pool of C++ threads to minimise the impact of Python’s Global Interpreter Lock (GIL). To ensure the TPU cores remain active while environments are being stepped, multiple Python threads are used for each actor core.

The learning part of Sebulba involves processing the experience generated by the actors to improve the agent’s decision-making. Each actor thread accumulates a batch of fixed-length trajectories on the TPU core, divides it into smaller shards, and sends them to the learner cores through a fast device-to-device communication channel. A single learner thread on the host then processes the data, which is already distributed across the learner cores. Using the JAX `pmap` operation, each learner core applies the update function to its shard of experience. The parameter updates are averaged across all learner cores using JAX’s `pmean/psum` primitives, keeping the learner cores in sync. After each update, the new parameters are sent to the actor cores, allowing the actor threads to use the latest parameters for the next inference step.

The Sebulba architecture is ideal for large-scale RL as it addresses many of the engineering bottlenecks we face when scaling.

  1. Reduced Communication Overhead: By co-locating acting and learning on the same TPU, coupled with fast device-to-device communication, Sebulba minimises common bottlenecks associated with data transfer such as parameter updates and sending collected data to the learner.
  1. High Parallelisation: The Sebulba architecture leverages the parallel processing capabilities of TPUs, complemented by the pool of C++ environments. This allows Sebulba to efficiently handle multiple environments at once. The concurrent processing of both trajectory data from the environments and learning steps significantly accelerates the overall reinforcement learning process.
  1. Scalability: The Sebulba architecture is designed to seamlessly scale. As RL tasks become more demanding, Sebulba can easily adapt to larger TPU configurations, paving the way for real-world applications.

Figure 2: The effects of scaling batch size ranging from 32-256 across TPU v2 (yellow), v3 (red) and v4 (blue).

We tested the architecture on the classic Atari Benchmark, using the industry favoured PPO algorithm. Figure 2 shows the Frames Per Second (FPS) Sebulba reaches across different Cloud TPU generations, peaking at over 300k FPS on a single TPUv4. Increasing the batch size consistently improves agent’s throughput but produces diminishing returns as we scale. The available on-chip memory can become a limiting factor. Earlier TPU generations are unable to support large batch sizes, i.e. 256 per TPU device.

The Sebulba architecture simplicity enables seamless scalability by replicating the single node configuration multiple times across an entire TPU pod. As before, each replica steps its own pool of environments, using its actor cores for inference and its learner cores to process the trajectories generated on its slice of the pod. The only difference being that the gradients computed to update the model parameters are synchronised, using JAX’s collective operations, across all the learner cores of the entire TPU pod rather than just within the individual TPU.

Figure 3: Sebulba on a TPUv4-68 where the cores of each of the 8 replicas are connected with high speed interconnects that allows for linear scaling.

The experimental results confirm the scaling properties of the Sebulba architecture and the communication efficiency of the TPUs, achieving a near-perfect scaling factor of 0.998. As we increase the number of TPU cores, it reaches an impressive 3.43 million FPS on the Atari benchmark when using a TPUv4-128, as illustrated in Figure 4.

Sebulba’s ability to scale not only speeds up convergence but also increases the capacity to generate and process data, enabling a larger effective batch size for the system. Figure 5 shows the effect of this benefit in action: as we scale the number of replicas, the time to convergence (reaching score of 400 for Breakout) drops to a few minutes and the final training score improves.

These results echo the level of scaling observed in flagship large scale RL projects such as AlphaStar and OpenAI 5. To surpass human level performance, an increasingly greater amount of data is needed. By simplifying your ability to scale agents, Sebulba can accelerate your research and development, by reducing experimentation time and boosting overall performance.

Figure 4: Linear scaling (Factor of 0.995) of the FPS when training PPO on Atari Breakout over multiple TPU hosts.

Figure 5: Learning curves of PPO on Atari Breakout, replicated over multiple TPU hosts and the effect on convergence time. 

Solving PCB Routing with Large-Scale Reinforcement Learning

To further evaluate Sebulba and its scaling efficiency on Cloud TPUs, we’ve applied it to the Place & Route problem for Printed Circuit Boards (PCBs). This problem can be framed as a Reinforcement Learning task whose objective is to optimally wire the components of a PCB while meeting manufacturing standards and passing Design Rules Checks (DRC). InstaDeep has developed DeepPCB, an AI-based product enabling electrical engineers to have their PCBs optimally routed without human-intervention. To accomplish this, Its product team developed a high-speed simulation engine to efficiently train reinforcement learning agents. However, due to the complexity and size of the problem, commonly used RL libraries fall short in terms of training throughput. When connecting the DeepPCB simulator to Sebulba, we observed a 15x-235x acceleration compared to running with its previous legacy distributed architecture.

Figure 6 highlights the dramatic speedup Sebulba offers over the baseline legacy system used for DeepPCB. The baseline system takes ~24 hours for a complete training and costs approximately $260, when using a high-end GPU on Google Cloud Platform. Switching to the Sebulba architecture on Google Cloud TPUs slashes both cost and time. The best configuration cuts training time to just 6 minutes at a mere $20, resulting in an impressive 13x cost drop. 

In addition, thanks to Sebulba’s linear scaling on Cloud TPUs and the fixed price-per-chip, the training cost remains constant as we scale up to larger TPU pods, all while significantly reducing the time to convergence. Indeed, although doubling the system size doubles the price per hour, this is offset by cutting the time to convergence in half.

With DeepPCB as a case study, we’ve seen how Cloud TPUs offer cost-effective solutions to real-world decision-making problems. By harnessing the full potential of TPU, we’re boosting the team’s ability to speed up experiments and enhance system performances. This is critical for research and engineering teams, enabling them to deliver new products, services, and research breakthroughs that were previously out of reach.

Alongside this post, we are pleased to open-source the codebase that was used to generate these results. This shall provide a great starting point for researchers and industry practitioners eager to integrate Reinforcement Learning into practical applications.


We would like to extend our thanks to Google’s TPU Research Cloud (TRC) Program for having supported this work with Cloud TPUs. We additionally would like to thank (1) Vaibhav Singh for his support and guidance in our collaboration with the Google Cloud team, (2) Imed Ben El Heni for the integration of InstaDeep’s DeepPCB simulation engine, (3) Nidhal Liouane and Aliou Kayantao for the visualisations, (4) Matteo Hessel and the DeepMind team for the original work, and (5) the authors of the EnvPool library used in this project. Finally, this work wouldn’t have been possible without the support and guidance of our co-author and InstaDeep AI Research Lead, Alex Laterre, and the emotional support of Shiro, Alex’s dog 🐕


  1. Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique P. d.O. Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, Susan Zhang  (2019). Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680.
  2. Norman P. Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, and David Patterson. (2023). “TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings”. arXiv preprint arXiv:2304.01433.