19th April, 2024
We are pleased to inform the NLP community about the availability of the Bhasha SFT dataset, an extensive collection curated by Soket AI Labs for the supervised fine-tuning of Multilingual Large Language Models (LLMs), focusing on Indic languages. The dataset, compiled from several sources, features over 13 million instruction-response pairs in four languages: Hindi, Gujarati, Bengali, and English. It includes both human-annotated and synthetic data to support a range of complex NLP tasks.
Stay tuned for exciting updates by following us on LinkedIn
The Bhasha SFT dataset is designed to aid in the development of language models capable of performing various NLP tasks such as multi-turn conversation, question-answering, text summarization, context-based Q&A, and natural language generation. This dataset provides researchers with the necessary tools to enhance the performance of language models across diverse linguistic settings.
The dataset is available in English, Hindi, Bengali, and Gujarati and is licensed under CC-BY-4.0, Apache-2.0, and MIT, allowing for extensive use, modification, and sharing within the community.
Each entry in the dataset is meticulously organized with several fields to support a variety of NLP tasks:
The dataset aggregates contributions from various sources, ensuring a rich and varied compilation:
The availability of the Bhasha SFT dataset is intended to support the ongoing development and refinement of language technologies, particularly in the domain of Indic languages. Curated by Soket AI Labs from multiple sources, this resource is poised to facilitate significant advancements in multilingual NLP research. We invite researchers and developers to utilize this dataset in their work towards innovative solutions and improved language understanding.