InstaDeep and iCompass Announce TunBERT – The First AI-based Tunisian Dialect System

Published

Categories

Download in PDF:

Download English Version

TUNIS, TUNISIA, MARCH 15.03.2021: InstaDeep and iCompass today proudly announced a collaboration on a Natural Language Processing (NLP) model for underrepresented languages aimed at applying the latest advances in AI and Machine Learning (ML) to explore and strengthen research in the fast-emerging Tunisian AI tech ecosystem.

The NLP project consists of developing a language model for Tunisian dialect, TunBERT, and evaluating it on several tasks such as sentiment analysis, dialect classification, reading comprehension, and  question-answering. “We’re excited to reveal TunBERT, a joint research project between iCompass and InstaDeep that redefines state-of-the-art for the Tunisian dialect. This work also highlights the positive results that are achieved when leading AI startups collaborate, benefiting the Tunisian tech ecosystem as a whole”, said Karim Beguir, CEO and Co-Founder of InstaDeep.

Empower underrepresented languages

Bidirectional Encoder Representations from Transformers (BERT) has become a state-of-the-art model for language understanding. With its success, available models have been trained on Indo-European languages such as English, French, German etc., but similar research for underrepresented languages remains sparse and in its early stage. Along with jointly writing and debugging the code, iCompass and InstaDeep’s research engineers have launched multiple successful experiments. “This fruitful collaboration aims to push forward and advance the development of AI research in the emerging and prominent field of NLP and language models. Our ultimate goal is to empower Tunisian talent and foster an environment where AI innovation can grow, and together our teams are pushing boundaries” said Dr. Hatem Haddad, CTO and Co-Founder of iCompass.

NeMo toolkit

TunBERT is developed based on NVIDIA’s NeMo toolkit, which the research team used to adapt and fine-tuned the neural network on relevant data to pre-train the language model on a large-scale Tunisian corpus, taking advantage of the BERT model that was optimised by NVIDIA. TunBERT’s pretraining and fine-tuning steps converged faster and in a distributed and optimised way thanks to the use of multiple NVIDIA V100 GPUs. This implementation provided more efficient training using Tensor Core mixed precision capabilities and the NeMo Toolkit. Through this approach, the contextualized text representation models learned an effective embedding of the natural language, making it machine-understandable and achieving tremendous performance results. Comparing the NVIDIA-optimised BERT model results to the original BERT implementation shows that the NVIDIA-optimised BERT-model performs better on the different downstream tasks, while using the same compute power.

NVIDIA GTC 

A member of NVIDIA Inception, – an acceleration program designed to nurture AI startups, InstaDeep has been accepted to present this research at the upcoming NVIDIA GPU Technology Conference (GTC) in April, in a talk titled “Building a Pre-Trained Contextualized Text Representation Model for Underrepresented Languages: Tunisian Dialect Use Case”. The session will be jointly presented by Nourchene Ferchichi of InstaDeep and Dr. Hatem Haddad of iCompass.  GTC is a free event and will take place online between 12-16 April, 2021. Register here to attend.