Soket Labs LogoSoket Labs Logo

17th April, 2024

Introducing the "Bhasha" Series 🚀: Advancements in Indic Language AI Datasets

bhasha-wiki on huggingface

Soket Labs is pleased to announce 🥳 the release of the "Bhasha" series, commencing with two significant datasets: "bhasha-wiki" and "bhasha-wiki-indic". These datasets are engineered to support the development of AI models that are attuned to the linguistic and cultural nuances of India 🫡, representing a crucial step forward in the diversification of linguistic resources in computational linguistics. By making these datasets available in an open-source format, we aim to foster a collaborative environment where developers and researchers across India can contribute to and benefit from inclusive and contextually aware AI technologies

Stay tuned for exciting updates by following us on LinkedIn

Bhasha-wiki: A Comprehensive Corpus for Indic Language Research

Available on Huggingface 🤗: soketlabs/bhasha-wiki

The "bhasha-wiki" dataset presents a comprehensive corpus consisting of 44.1 million Wikipedia articles translated into six major Indian languages from 6.3 million English articles. This corpus, encompassing over 45.1 billion Indic tokens, serves as a foundational resource for linguistic and AI research, facilitating a wide range of studies into machine translation, natural language processing, and language model training.

Dataset Characteristics:

  • Extensive Lexical Volume: The corpus is substantial, with a total size of 117 GiB, containing 44,418,479 rows and over 20 billion words.
  • Linguistic Diversity: This dataset supports a multilingual framework, including Hindi, Gujarati, Urdu, Tamil, Kannada, Bengali, and English, crucial for cross-linguistic studies.
  • Translation Methodologies: Utilising IndicTrans2, powered by a significant computational resources (3360 GPU-hours on AWS), each article was translated with high fidelity to the original content. Segmentation and translation were handled sentence-by-sentence, with adaptations made for longer sentences to maintain semantic integrity.
SentencesCharactersWordsTokensRows
english149,636,94619,009,297,4392,954,105,6435,430,358,9766,345,497
hindi149,636,94618,622,892,2523,382,736,0746,635,241,6306,345,497
kannada149,636,94619,679,016,4212,349,908,3846,083,839,8256,345,497
bengali149,636,94618,741,174,6942,663,832,8698,248,287,6876,345,497
gujarati149,636,94618,453,210,4462,867,239,2096,032,149,4906,345,497
tamil149,636,94621,457,803,6962,441,061,6096,777,927,966,345,497
urdu149,636,94617,921,351,0513,641,717,0855,966,954,2046,345,497
---------------------------------------------
Total1,04,74,58,622133,884,745,99920,300,600,87345,174,759,77444,418,479

Note: Tokens are calculated using pragna-1b tokenizer

Characters, words and token distribution for each language is shown in the image.

bhasha-wiki on huggingface

Bhasha-wiki-indic: Tailored Dataset for Enhanced Indian Contextual Relevance

Available on Huggingface: soketlabs/bhasha-wiki-indic

The "bhasha-wiki-indic" dataset, a refined subset of the "bhasha-wiki", is specifically curated to enrich models with a deeper understanding of the Indian context. This subset was meticulously selected to include content with significant relevance to India, enhancing the potential for developing culturally resonant AI applications.

Methodology:

  • Focused Semantic Filtering: Initial filtering employed keyword detection ('india' or 'indian'), refined by a topic classifier, achieving an 84% accuracy in distinguishing relevant content.
  • Content Extraction and Processing: Approximately 208,000 articles were identified as contextually relevant and subsequently extracted for six Indian languages, preparing this dataset as a specialised tool for AI models requiring deep cultural comprehension.

Dataset Specifications:

  • Content Volume: The dataset comprises 200,820 rows with nearly 1.54 billion tokens distributed among several languages, providing a rich linguistic base for detailed computational analysis.

Contributions to the Field and Future Directions

These datasets are expected to significantly impact research in computational linguistics and AI by providing high-quality, large-scale resources for training models that require a nuanced understanding of Indian languages and contexts. They also serve as a platform for further scholarly inquiry into algorithmic translations and cultural specificity in AI technologies.

Open Access and Collaborative Engagement

Released under the CC-by-SA-3.0 licence, these datasets facilitate both academic and commercial use, promoting a wide dissemination and application in diverse settings. We invite the global research community to engage with these resources, further enriching the datasets and exploring new frontiers in AI research.

As we continue to develop the "Bhasha" series, we remain committed to advancing the state of AI with a focus on ethical considerations and inclusivity in technology.

About Soket Labs:

Soket Labs, a visionary AI research firm, is at the forefront of promoting advancements towards ethical Artificial General Intelligence (AGI). Our mission is to foster a form of general intelligence that excels in efficiency and accessibility, thereby democratising cutting-edge technology for diverse applications, including autonomous robots, edge devices, and large clusters.