SITE Magazine - Enhancing Understanding: India turns to AI to capture its 121 languages

December 14, 2023

Reena Chandran

Bengaluru, India
Thomson Reuters Foundation

For a few weeks this year, villagers in the southwestern Indian state of Karnataka read dozens of sentences in their native Kannada language in an app as part of a project to build the country’s first AI-based chatbot for tuberculosis.

There are more than 40 million native speakers of Kannada in India, and it is one of the country’s 22 official languages and one of more than 121 languages spoken by 10,000 or more people in the world’s most populous country.

Karnataka India man on bike

A man rides a bicycle in the Indian state of Karnataka. Photo: Alka Jha/Unsplash

But few of these languages are covered by natural language processing (NLP), a branch of artificial intelligence that enables computers to understand text and spoken words.

Hundreds of millions of Indians are thus excluded from useful information and many economic opportunities.

“For AI tools to work for everyone, they also need to meet the needs of people who don’t speak English, French, or Spanish.”

– Kalika Bali, Principal Researcher at Microsoft Research India.

“For AI tools to work for everyone, they also need to meet the needs of people who don’t speak English, French, or Spanish,” said Kalika Bali, principal researcher at Microsoft Research India.

“But if we had to collect as much data in Indian languages as happened in a large language model like GPT, we would have to wait another 10 years. So, what we can do is create layers on top of generative AI models like ChatGPT or Llama,” Bali told Context.

Villagers in Karnataka state are among thousands of speakers of various Indian languages creating speech data for technology company Kariya, which builds data sets for companies like Microsoft and Google to use in artificial intelligence models for education, healthcare and other services.

The Indian government, which aims to deliver more services digitally, is also building language datasets through Bhashini, an AI-based language translation system that creates open source datasets in local languages to create AI tools.

The platform includes a crowdsourcing initiative for people to contribute sentences in different languages, verify the authenticity of audio or written text by others, translate texts and classify images.

Indian languages online

Tens of thousands of Indians have contributed to Bhashini.

“The government is aggressively pushing to create datasets to train large language models in Indian languages, which are already being used in translation tools for education, tourism and courts,” said Pushpak Bhattacharya, Head of Hindi Language Technology Computing. Laboratory in Mumbai.

“But there are many challenges: Indian languages have mainly oral traditions, electronic records are not abundant, and there is a lot of code mixing. Also, collecting data in less common languages is difficult, and requires special effort.”

Of the more than 7,000 living languages in the world, fewer than 100 have assimilation into major NLP, with English being the most advanced.

ChatGPT — whose launch last year sparked a wave of interest in generative AI — is primarily taught in English. Google’s Bard is limited to English, and of the nine languages Amazon’s Alexa can respond to, only three are non-European; Arabic, Hindi and Japanese.

Governments and startups are trying to fill this gap.

The grassroots organization Masakhan aims to boost NLP research in African languages, while in the UAE, a new large language model called Jais could support generative AI applications in Arabic.

Bali, who was named among the 100 most influential people in AI, said that for a country like India, crowdsourcing is an effective way to collect speech and language data. time magazine in September.

“Crowdsourcing also helps capture linguistic, cultural, social and economic nuances,” Bali said.

“But there must be awareness of gender, racial, and socioeconomic bias,” she added, “and it must be done ethically, by educating workers, paying them, and making a specific effort to collect smaller languages.” “Otherwise it doesn’t fit.”

With the rapid growth of artificial intelligence, there is demand for languages “that we haven’t even heard of,” including from academics looking to preserve them, said Safiya Hussain, co-founder of Carya.

Kariya works with non-profit organizations to identify workers who are below the poverty line, or whose annual income is less than $325, and pays them about $5 an hour to generate data – well above India’s minimum wage.

Hussein said workers own part of the data they produce so they can earn royalties, and there is potential to build AI products for society using that data, in areas such as healthcare and agriculture.

“We see huge potential to add economic value through speech data. An hour of Odia speech data used to cost about $3-$4, now it costs $40,” she added, referring to the language of the eastern state of Odisha.

India jaipur man with phone

A man uses a mobile phone in Jaipur, India. Photo: Annie Spratt/Unsplash

Less than 11 percent of India’s 1.4 billion people speak English. Much of the population is not comfortable reading and writing, so many AI models focus on speech and speech recognition.

The Google-funded Vaani, or Voice, project is collecting speech data from about a million Indians and opening it up for use in automatic speech recognition and speech-to-speech translation.

Bengaluru-based EkStep’s AI-based translation tools are being used in the Supreme Court of India and Bangladesh, while the government-backed AI4Bharat Center has launched Jugalbandi, an AI-based chatbot that can answer questions about welfare schemes in several Indian languages.

The bot, named after a duet in which two musicians compete with each other, uses language models from AI4Bharat and logic models from Microsoft, and can be accessed via WhatsApp, which is used by about 500 million people in India.

We depend on our readers to fund Sight’s work – become a financial supporter today!

Gram Vaani, or Village Voice, a social enterprise that works with farmers, is also using AI-based chatbots to answer questions about welfare benefits.

“Automatic speech recognition technologies help in alleviating language barriers and providing communication at the grassroots level,” said Shobhimoy Kumar Garg, Product Manager, Gram Vaani.

“They will help empower communities that need it most.”

For Swarnalatha Nayak in Odisha’s Raghurajpur district, the growing demand for speech data in her native Odia also means much-needed additional income from her work in Kariya.

She said: “I work at night when I am free. I can support my family by talking on the phone.”

In a world where technology is rapidly advancing, India has turned to artificial intelligence to tackle the challenge of capturing its 121 languages. SITE Magazine explores how this innovative approach is enhancing understanding and preserving a diverse linguistic heritage. With the use of AI, India is breaking down barriers and ensuring that its linguistic richness is not lost in the modern age. This groundbreaking endeavor is not only a feat of technological prowess but also a testament to the country’s commitment to preserving its cultural identity. Join us as we delve into how India is harnessing AI to capture its 121 languages in our latest issue.

SITE Magazine - Enhancing Understanding: India turns to AI to capture its 121 languages

“For AI tools to work for everyone, they also need to meet the needs of people who don’t speak English, French, or Spanish.”

– Kalika Bali, Principal Researcher at Microsoft Research India.

We depend on our readers to fund Sight’s work – become a financial supporter today!

Formulaire de contact