17th April, 2024
Soket Labs is pleased to announce 🥳 the release of the "Bhasha" series, commencing with two significant datasets: "bhasha-wiki" and "bhasha-wiki-indic". These datasets are engineered to support the development of AI models that are attuned to the linguistic and cultural nuances of India 🫡, representing a crucial step forward in the diversification of linguistic resources in computational linguistics. By making these datasets available in an open-source format, we aim to foster a collaborative environment where developers and researchers across India can contribute to and benefit from inclusive and contextually aware AI technologies
Stay tuned for exciting updates by following us on LinkedIn
The "bhasha-wiki" dataset presents a comprehensive corpus consisting of 44.1 million Wikipedia articles translated into six major Indian languages from 6.3 million English articles. This corpus, encompassing over 45.1 billion Indic tokens, serves as a foundational resource for linguistic and AI research, facilitating a wide range of studies into machine translation, natural language processing, and language model training.
Sentences | Characters | Words | Tokens | Rows | |
---|---|---|---|---|---|
english | 149,636,946 | 19,009,297,439 | 2,954,105,643 | 5,430,358,976 | 6,345,497 |
hindi | 149,636,946 | 18,622,892,252 | 3,382,736,074 | 6,635,241,630 | 6,345,497 |
kannada | 149,636,946 | 19,679,016,421 | 2,349,908,384 | 6,083,839,825 | 6,345,497 |
bengali | 149,636,946 | 18,741,174,694 | 2,663,832,869 | 8,248,287,687 | 6,345,497 |
gujarati | 149,636,946 | 18,453,210,446 | 2,867,239,209 | 6,032,149,490 | 6,345,497 |
tamil | 149,636,946 | 21,457,803,696 | 2,441,061,609 | 6,777,927,96 | 6,345,497 |
urdu | 149,636,946 | 17,921,351,051 | 3,641,717,085 | 5,966,954,204 | 6,345,497 |
- | ----------- | ------------ | ------- | -------- | ------ |
Total | 1,04,74,58,622 | 133,884,745,999 | 20,300,600,873 | 45,174,759,774 | 44,418,479 |
Note: Tokens are calculated using pragna-1b tokenizer
Characters, words and token distribution for each language is shown in the image.
The "bhasha-wiki-indic" dataset, a refined subset of the "bhasha-wiki", is specifically curated to enrich models with a deeper understanding of the Indian context. This subset was meticulously selected to include content with significant relevance to India, enhancing the potential for developing culturally resonant AI applications.
These datasets are expected to significantly impact research in computational linguistics and AI by providing high-quality, large-scale resources for training models that require a nuanced understanding of Indian languages and contexts. They also serve as a platform for further scholarly inquiry into algorithmic translations and cultural specificity in AI technologies.
Released under the CC-by-SA-3.0 licence, these datasets facilitate both academic and commercial use, promoting a wide dissemination and application in diverse settings. We invite the global research community to engage with these resources, further enriching the datasets and exploring new frontiers in AI research.
As we continue to develop the "Bhasha" series, we remain committed to advancing the state of AI with a focus on ethical considerations and inclusivity in technology.
Soket Labs, a visionary AI research firm, is at the forefront of promoting advancements towards ethical Artificial General Intelligence (AGI). Our mission is to foster a form of general intelligence that excels in efficiency and accessibility, thereby democratising cutting-edge technology for diverse applications, including autonomous robots, edge devices, and large clusters.