India’s AI push faces linguistic hurdle

Hindi, Bengali, Urdu, Marathi, Telugu and Tamil are among the top 20 languages of the world, going by the number of speakers. Yet there are no robust Indian-language artificial language (AI) tools around. Where and why are we lagging behind?

First things first: What is the large language model (LLM)? An LLM (similar to ChatGPT) is a super-smart AI taught by ‘reading’ billions of books, papers and webpages to converse, write and answer questions nearly like a human! It is similar to a ‘text-predicting robot’ that uses enormous Internet data to chat, summarise or code.

Now, the second essential question: how does the LLM ‘learn’ a language? Think of an LLM as a master chef. To become an expert, a chef must have years of practice cooking thousands of dishes from many cultures, employing hundreds of species and condiments. Similarly, an LLM must ‘read’ vast quantities of material to understand language. ‘Massive’ is a vital term. If we have a tiny training dataset, it is much like a home cook with only the capacity to follow a few recipes. Such AI stumbles on unfamiliar topics. A large corpus with diverse topics and themes is like a globe-trotting chef trained in global cuisines. AI then becomes adept at answering a wide range of questions. A chef taught to make 10 banal meals versus one trained on 10,000 sophisticated recipes illustrates how LLMs’ efficacy scales up with data. Big data is the ‘new oil’: More text equals better understanding and fewer errors. That is why digital behemoths like Google and OpenAI train on trillions of words: it is the only way to make AI seem human. Therein is the difficulty for Indian languages.

Imagine a cookbook written in Indus script. However exquisite the cuisine encoded in the book is, it is worthless because we do not know how to read it. Indian languages confront a similar issue. In the early digital age, several actors designed distinct typefaces, encoding software and keyboard layouts. The same alphabet had several glyphs in different typefaces, and at times, there was visual ambiguity due to badly designed fonts. Adding fuel to the fire, many Indian-language digital developers also came out with different keyboard layouts. All these make it hard to harvest the early digital data, severely limiting the availability of training datasets in Indian languages.

Dialect differences (Bhojpuri vs Hindi, Dakhini vs Urdu) and the frequent blending of English and regional languages (e.g., Hinglish, Tanglish) exacerbate the problem.

LLMs are similar to Google’s autocomplete in creating plausible-sounding text every time. As a result, LLMs can be like a ‘know-it-all friend’ who would rather bluff than admit they are clueless. Fun for stories but perilous for facts. If you ask, “When did India land on Mars?” you would most certainly be given a date/year such as 2035 or September 24, 2014 (when Mangalyaan entered Mars’ orbit). LLMs are taught to predict the next word — not to judge truthfulness. In AI terminology, LLMs are more likely to ‘hallucinate’ and create false or manufactured information when training data is limited, low-quality or unrepresentative.

Typically, most data in Indian languages comes from news or government sources, which lack diversity. Very little healthcare, legal or casual discourse is available. Even now, the digital divide exists, and many speakers of Indian languages are not online, resulting in skewed datasets and biased representation. With minimal digital oversight, the possibility of misrepresentation, such as creating fake material, is quite significant. The solution for Indian languages involves a larger dataset, local grounding and human-AI collaboration.

Indian languages are highly inflectional, with complicated verb conjugations, noun cases and agglutination. Agglutinative grammar (e.g., large compound phrases) splits into long, nonsensical subwords, making tokenisation — a critical technology need — problematic.

As the training dataset is limited, existing LLMs such as ChatGPT are employed to develop the Indian-language LLM as a workaround. Because English makes up more than 60 per cent of the training data, hybrid models may fail to grasp linguistic nuances.

Outside of the Western world, China, Korea and Japan dominate the Asian area. They avoided typeface issues in Indian languages by implementing top-down standardisation and script consistency early. The Chinese government required and enforced GB18030 (which is backwards-compatible with Unicode), forcing worldwide software marketplaces to embrace Chinese Unicode early (Windows 2000 and later included Chinese Unicode). In addition, following the US CLOUD Act, 2018, China secured data generated on its soil, granting it data sovereignty. The locally stored corpus became a valuable advantage for Chinese Big 5 Baidu, Alibaba, Tencent, Huawei and ByteDance in creating Chinese LLMs. It is stated that Baidu’s Ernie 4.0 competes with GPT-4 in Chinese tasks but is unseen in the West.

Not everyone responded or could respond the same way as China; nonetheless, nations like South Korea created competing LLMs (Naver’s HyperCLOVA and LG’s Exaone) despite the lack of tight data sovereignty regulations or severely controlled data flows.

Korean LLMs employ global datasets (e.g., CommonCrawl, Wikipedia) for broad knowledge and localised Korean data (Naver searches, Kakao conversations, K-pop subtitles) for linguistic and cultural subtlety, prioritising niche applications where it could excel, such as AI-generated subtitles, fan interactions for K-pop/K-drama, Korean-English translation (which outperforms GPT-4) and enterprise automation, such as LG’s Exaone.

When China built the ‘Great Wall’ around its data, Korea succeeded in AI, like K-pop, by mixing global trends with local flair. India’s desi approach — AI4Bharat’s IndicBERT (IIT Madras), the Government of India’s Bhashini, translation-focused LLMs, Sarvam AI’s OpenHathi, Google’s MuRIL and Microsoft’s Shiksha — are silver linings on the horizon. While some projects rely on crowdsourced datasets, others employ unique strategies, such as fine-tuning global models like LLaMA-2 for Indian languages using Transfer Learning.

Like Korea, India’s domestic demand for AI is low, necessitating significant government backing. US President Trump has promised $500-billion investment in AI. Beijing intends to invest $1.4 trillion over the next 15 years as it battles with Washington for supremacy. South Korea’s AI spending (2021-26) is $1.5 billion for one language. India’s commitment of $1 billion over the next five years for IndiaAI, with an ambition to cover 22 languages, pales in comparison.

Comments