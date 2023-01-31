To preserve these 'rare' languages and ensure their online presence, the researchers at Microsoft India lab are working to create digital ecosystems for Indian languages that do not have online presence so far.

Someone who speaks Hindi or English or any other popular language in India can easily Google search their queries in the language they know and are comfortable in. But what about someone from the Munda community in Jharkhand or a Gondi speaker from Madhya Pradesh? Even if these people know English, the threat of their languages not existing in the digital space, and fast disappearing in the physical world is real.

To preserve 'rare' languages, researchers at Microsoft India are working to create digital ecosystems for Indian languages that do not have online presence so far.

Project ELLORA, Enabling Low Resource Languages, launched in 2015 works with two purposes - One is to preserve a language for posterity and secondly to ensure that the users of these languages can interact in the digital world.

Before delving deeper into what project ELLORA is all about, here is some data to understand how languages are slowly being run down by the fast paced world of today.

A language is lost every 2 weeks!

It sounds unreal but unfortunately the truth of the modern world is indeed that a language is lost somewhere in the world every 14 days. As many as 88 percent of the world's languages do not have enough presence on the internet which also means that over 20 per cent of the world's population which means almost 1.2 billion people can't use their language to navigate the digital world.

India alone has so many languages that are dying and have no presence in the digital world. India is home to about 1635 languages and 197 are classified as vulnerable by UNESCO. There are many more that have died due to complete apathy in the previous years.

Take for instance the case of Boa Sr. She was the last link to a 65,000-year-old pre-Neolithic culture on the Andaman Islands in the Indian Ocean. When she died in 2010, the Bo language died, too, becoming extinct.

Project ELLORA

To bring 'rare' Indian languages online, Microsoft launched project ELLORA or Enabling Low Resource Languages in 2015. Under the project, researchers build digital resources of the languages. They say their goal is to preserve a language for posterity so that users of these languages can "participate and interact in the digital world".

"The way I define my job for myself is that no person in this world should be excluded from using any technology because they speak a different language," says Kalika Bali of Microsoft Research (MSR) India, as quoted on the company's website.

Bali is an expert in Natural Language Processing, a subfield that focuses on training computer systems to understand spoken and written languages.

Creating a language database

The team at Microsoft Research Lab in India is working with local communities and native speakers to create the base datasets that will be used to build AI technologies for these languages.

"By involving the community in the data collection process, they [researchers] hope to create a dataset that is both accurate and culturally relevant," the company noted.

The first step of Project ELLORA was to map out what resources were already available, such as printed material like literature and the extent of a digital presence.

For Mundari, it started as a simple vocabulary game

Mundas is a community of about a million people spread across the eastern Indian states of Jharkhand, Orissa and West Bengal. While the language does have a written script, it has negligible digital content and it does not have any online presence either.

For Mundari, the researchers collaborated with IIT Kharagpur in 2018 and sponsored a study to find what the community needs to keep the language alive. IIT Kharagpur professors initially worked with community members to help them manually translate sentences from Hindi to Mundari.

What started off as a simple vocabulary game for school children to get them to learn the language soon morphed into sophisticated technology projects, according to the company. MSR researchers are currently working on a Hindi-to-Mundari text translation as well as a speech recognition model that will provide the community access to more content in Mundari.

The researchers also developed a new technology called Interneural Machine Translation (INMT), which helps predict the next word when someone translates between languages and speeds up the translation process.

Work on Gondi, Idu Mishmi on too

Apart from Munda language, Microsoft is also working on other native languages to provide them digital presence.

The researchers partnered with CGNETSwara and IIIT Naya Raipur to work on the Gondi language. The team produced 60,000 parallel sentences between Gondi and Hindi, which has led to the development of a machine translation service, according to Microsoft.

Gondi, the language of the Gond people, is spoken in the six States of Madhya Pradesh, Gujarat, Telangana, Maharashtra, Chhattisgarh and Andhra Pradesh.

Microsoft India's researchers are also working with the Idu Mishmi community in Arunachal Pradesh to create a framework for a digital dictionary for the Idu Mishmi language, which now has less than 12,000 speakers.