InstaDeep’s in-house research team contributes to the latest advancements in AI – from the fundamentals of machine learning to robotics and deep reinforcement learning. Our published research has been presented at world-leading conferences such as NeurIPS 2018.

Ranked Reward: Enabling Self-Play Reinforcement Learning for Combinatorial Optimization


A. Laterre,  Y. Fu,  M. K. Jabri,  A-S. Cohen,  D. Kas,  K. Hajjar,  T. S. Dahl,  A. Kerkeni,  K. Beguir

Adversarial self-play in two-player games has delivered impressive results when used with reinforcement learning algorithms that combine deep neural networks and tree search. Algorithms like AlphaZero and Expert Iteration learn tabula-rasa, producing highly informative training data on the fly. However, the self-play training strategy is not directly applicable to single-player games. Recently, several practically important combinatorial optimisation problems, such as the travelling salesman problem and the bin packing problem, have been reformulated as reinforcement learning problems, increasing the importance of enabling the benefits of self-play beyond two-player games. We present the Ranked Reward (R2) algorithm which accomplishes this by ranking the rewards obtained by a single agent over multiple games to create a relative performance metric. Results from applying the R2 algorithm to instances of a two-dimensional and three-dimensional bin packing problems show that it outperforms generic Monte Carlo tree search, heuristic algorithms and integer programming solvers. We also present an analysis of the ranked reward mechanism, in particular, the effects of problem instances with varying difficulty and different ranking thresholds.

Explicit Sequence Proximity Models for Hidden State Identification


Anil Kota, Sharath Chandra, Parag Khanna

| Torbjørn S. Dahl

Sequence similarity is a critical concept for comparing short- and long-term memory in order to identify hidden states in partially observable Markov decision processes. While connectionist algorithms can learn a range of ad-hoc proximity functions, they do not reveal insights and generic principles that could improve overall algorithm efficiency.

Our work uses the instance-based Nearest Sequence Memory (NSM) algorithm as a basis for exploring different explicit sequence proximity models including the original NSM proximity model and two new models, temporally discounted proximity and Laplacian proximity. The models were compared using three benchmark problems, two discrete grid world problems and one continuous space navigation problem. The results show that more forgiving proximity models perform better than stricter models and that the difference between the models is more pronounced in the continuous navigation problem than in the discrete grid world problems.