Soket Labs LogoSoket Labs Logo

30th April, 2024

Introducing Pragna-1B: Soket AI Labs' Multilingual Language Model for Indian Languages

pragna-1b on huggingface

Available on Huggingface ЁЯдЧ: soketlabs/pragna-1b

Introduction: Unleashing the Power of Multilingual AI

We at Soket AI Labs are thrilled to unveil India's first open source multilingual model, Pragna-1B available in four Indian languages - Hindi, Gujarati, Bangla and English. The model is designed to cater to the rich tapestry of Indian languages, significantly expanding the horizons of AI inclusivity and accessibility. As we step into an era where technology transcends linguistic boundaries, Pragna-1B emerges as a beacon of innovation, ready to bridge language barriers and enhance user engagement across diverse linguistic landscapes.

Highlights of Pragna-1B:

  • Architecture: Pragna-1B features a Transformer Decoder-only model with 1.25 billion parameters, with a context length of 2048 tokens.
  • Designed for Edge AI: Engineered to deliver state-of-the-art performance for vernacular languages in the smallest form factor, ideal for deployment on-device.
  • Small Language Model (SLM) with Robust Capabilities: Even with 1.25 parameters, Pragna-1BтАЩs performance is comparable to larger 7 billion parameter sized models, offering comprehensive multilingual support for English, Hindi, Bangla, and Gujarati.
  • Culturally Contextualised Training: The model has been meticulously trained on curated datasets specifically designed to encompass the Indian context, ensuring accurate and culturally relevant outputs.
  • Ethical and Responsible AI: Committed to upholding human values, the model is aligned to generate ethical responses.
  • Open Source Availability: The base version of Pragna-1B is accessible as an open-source model on Hugging Face, facilitating development and collaboration within the community.

Architecture Overview

Pragna-1B is a decoder-only transformer model inspired by TinyLlama, featuring the following specifications:

  • Layers: 22
  • Attention Heads: 32
  • Context Length: 2048
  • Hidden Dimension: 2048
  • Expansion Dimension: 5632
  • Vocabulary Size: 69632

This model incorporates Rotary Positional Encoding to infuse positional information into the embeddings, utilising a base of 10,000. It employs RSNorm with an epsilon value of 1e-5 and the Sigmoid Activation Unit (SiLU) as the activation function. Additionally, Pragna-1B adopts Grouped Query Attention, an alternative to Multi-Head Attention, which enhances training and inference speed while reducing memory bandwidth. This also supports the use of lower-compute devices for inference tasks.

Pragna-1B is trained on our proprietary platform, GenAI Studio, a modular AI Developer Platform designed to support any GenAI model architecture. It is capable of scaling across thousands of GPUs or accelerators and is built to be fault-tolerant. The development of this model leveraged Triton, an open-source language from OpenAI, for crafting high-performance custom fused CUDA Kernels for various operations. Furthermore, the model uses Fully Sharded Data Parallel (FSDP) for distributed and parallel training and incorporates the state-of-the-art FlashAttention2 to accelerate training and inference.

Developing an Efficient Tokenizer for Indian Languages

Pragna employs a Byte-Pair Encoding (BPE) tokenizer, specifically trained for handling Indian languages. The tokenizer was trained on six Indian languagesтАФHindi, Bangla, Urdu, Tamil, Kannada, and GujaratiтАФbefore combining them into a unified tokenizer. This unification was achieved through a union set operation, which ensures each language is represented equitably. The merging rules were selected to optimise compression of text into tokens, effectively making the vocabulary size 69,632.

The tokenizer demonstrates its efficiency through its fertility score, a metric that measures the average number of tokens produced per word. This score is crucial as it reflects the tokenizer's ability to compress text into tokensтАФcomputer representable integersтАФwith a lower score indicating more efficient tokenization. Notably, our tokenizer not only performs comparably to those designed for English but excels in processing Hindi, Bangla, Gujarati, Tamil, Kannada, and Urdu as shown in the graph below. For example, the Gemma-7b model produces 5.8 tokens per word whereas Pragna-1b generates 2.8 tokens per word for Kannada almost doubling the throughput just with tokenization.

Moreover, our tokenizer addresses a significant issue prevalent in existing models: the inadequate representation of Indic languages. Traditional models often dissect Indic characters into bytes, severely hindering performance and requiring an excessive number of tokens for effective training. In contrast, our approach significantly enhances model efficiency and performance across Indian languages.

pragna-1b tokenizer

Graph above shows the fertility score for six Indian languages when tokenized using various LLMs.

enhiknbngutaur
soketlabs/pragna-1b2.12.262.853.32.412.941.87
meta-llama/Llama-2-7b2.036.2420.229.1216.3610.665.49
meta-llama/Llama-3-8b1.763.0714.688.610.9611.43.3
openai/gpt-21.788.8922.2814.217.1223.086.68
openai/gpt-3.51.785.7314.688.6810.9611.414.68
openai/gpt-41.785.7314.688.6810.9611.414.68
google/gemma-7b1.742.525.814.24.574.392.34
microsoft/phi-21.788.8922.2814.217.1223.086.68
microsoft/phi-3-mini2.036.2420.229.1216.3610.665.49
sarvamai/OpenHathi-7B-Hi2.032.3520.229.1216.3610.665.48
mistralai/Mistral-7B2.045.9711.727.9313.510.155.15
GenVRadmin/AryaBhatta-GemmaUltra1.742.525.814.24.574.392.34
TinyLlama/TinyLlama-1.1B2.036.2420.229.1216.3610.665.49
Tensoic/Kan-LLaMA-7B1.95.562.947.386.958.595.37
abhinand/tamil-llama-7b1.916.1420.078.9916.232.555.38

Training Data

The quality and quantity of training data are crucial in imparting linguistic and semantic understanding into any language model. One significant challenge in developing language models for Indian languages is the scarcity of large-scale corpora. With all 22 scheduled Indian languages representing less than 1% of internet-scale datasets like mC4, we were compelled to create "Bhasha," a series of high-quality datasets designed for pretraining and instruction fine-tuning of Indian models.

Bhasha-wiki: Starting with 6.3 million English Wikipedia articles, we translated these into six Indian languages to create a dataset of 44.1 million articles. With over 45.1 billion Indic tokens, Bhasha-wiki serves as a foundational resource for linguistic and AI research. It supports a broad spectrum of applications, including machine translation, natural language processing, and language model training.

Bhasha-wiki-indic: A refined subset of the Bhasha-wiki, Bhasha-wiki-indic is curated to enrich models with an in-depth understanding of the Indian context. This subset specifically includes content of significant relevance to India, aiming to develop culturally resonant AI applications.

Bhasha-SFT: Designed to facilitate the development of language models capable of handling various NLP tasks such as multi-turn conversation, question-answering, text summarization, context-based Q&A, and natural language generation, the Bhasha-SFT dataset provides essential tools for enhancing language model performance across diverse linguistic landscapes.

We have also incorporated external datasets like SlimPajama, a clean and deduplicated dataset by Cerebras featuring 627 billion tokens with a majority in English, and Sangraha-Verified by AI4Bharat, a 15 million tuple dataset in multiple Indian languages, curated from human-verified sources.

Training

Model Weight Initialization: Instead of starting from a random distribution, we initialise the weights using TinyLlama-1.1B-Chat-v1.0, a model under the Apache-2.0 licence. This approach leverages the foundational knowledge of an existing model, reducing costs and facilitating the transfer learning from the English domain to the Indian language domain.

Indic Tokenizer: Our base tokenizer, borrowed from Llama-2, initially had a vocabulary size of 32,000. To this, we added new tokens for six Indian languages, expanding the vocabulary to 69,632. This enhancement significantly improved the fertility score across all six languages.

Embedding Initialization for New Tokens: Drawing inspiration from John HewittтАЩs research, "Initializing New Word Embeddings for Pretrained Language Models", we opted for a targeted approach to initialising embeddings for new tokens. Instead of averaging all existing embeddings with added noise, we used a subset of relevant embeddings. This method assumes each new Indic token can be tokenized using Llama-2's existing vocabulary. By averaging these related embedding vectors, we achieved a more accurate representation with lower KL-divergence. We are preparing a detailed blog post to share our methodology and findings on efficient vocabulary expansion.

Embedding Alignment: Initially, we aligned only the embedding and lm_head, keeping other tensors frozen. We utilised a parallel sentences dataset from Bhasha-wiki, pairing sentences in one of the six Indian languages with their English counterparts. Loss was computed solely on the English sentences, aiding in the alignment of model embeddings and facilitating the generation of coherent though nonsensical text in all six languages with minimal training.

Continual Pretraining: We subsequently enabled all 1.25 billion parameters for further training, focusing on three languagesтАФHindi, Bangla, and GujaratiтАФdue to compute limitations. We maintained high, equal sampling probabilities for these languages while incorporating data from SlimPajama, predominantly in English. The model processed approximately 150 billion tokens (3.15 trillion considering the base model) over about 8,000 GPU-hours. Training specifics included 3,000 hours on 4xA100-40 GiB and 5,000 hours on 8xA100-80 GiB systems.

Training Parameters:

  • Precision: BFloat16
  • Batch Size: 2k - 2.5k
  • Context Length: 2,048
  • Learning Rate: 3e-5
  • Optimizer: AdamW
  • LR Scheduler: Cosine
  • Mixed Precision Training

Instruction Fine-Tuning: The model is undergoing supervised fine-tuning across multiple tasks, including conversation, question-answering, summarization, and paraphrasing using Bhasha-SFT which comprises over 13 million instances of instruction-response data in three Indian languages from Aya Collection, Indic-align, and Samvaad.

Ethics and Safety Alignment: Safety and ethics alignment is an important part of the supervised fineтАУtuning step. To make sure Pragna is aligned to human values, we have been working on building our own datasets specifically curated to prevent the model from generating unethical or potentially harmful content. This data is being exposed to the model during SFT.

Ongoing Developments and Community Engagement: As the model is currently undergoing further instruction tuning and safety alignment, we plan to release pragna-1b-it model weights to the community under an open-source licence soon. An initial research preview of the instruction-tuned model is available via a chat interface, though it is not recommended for production use due to its tendency to generate convincing yet factually incorrect information. We eagerly anticipate feedback from the community to further refine and enhance our models.

Inference

Next Token Prediction

ContextTokens Predicted
India is one of the mostIndia is one of the most exciting places to travel to in Asia. The country hosts a wide variety of cultures and offers a plethora of attractions across the country. There's a lot you can do here and the tourists are always on the lookout for the new things to do and see. If you're looking to travel to India for your next adventure, you'll be pleased to know that you can find a ton of things to do in the country. From the temples to the beaches, there's a lot you can see and do here. One of the best things for visitors to do in India is to visit Jaipur. The capital city of Rajputana, Jaipur is a beautiful city that's known as the "City of Lives".
рдирд░реЗрдВрджреНрд░ рдореЛрджреА рдХрд╛ рдЬрдиреНрдордирд░реЗрдВрджреНрд░ рдореЛрджреА рдХрд╛ рдЬрдиреНрдорджрд┐рди рдЖрдЬ рдордирд╛рдпрд╛ рдЬрд╛ рд░рд╣рд╛ рд╣реИред рдЗрд╕ рдореМрдХреЗ рдкрд░ рдкреНрд░рдзрд╛рдирдордВрддреНрд░реА рдирд░реЗрдВрджреНрд░ рдореЛрджреА рдиреЗ рдЯреНрд╡реАрдЯ рдХрд░ рд▓рд┐рдЦрд╛ рдХрд┐ 'рдЖрдЬ рджреЗрд╢ рдХреЛ рдПрдХ рдирдпрд╛ рд╕рдВрд╕рдж рднрд╡рди рдорд┐рд▓рд╛ рд╣реИ, рдЬрд┐рд╕реЗ рдЖрдк рд▓реЛрдЧреЛрдВ рдиреЗ рдмрдзрд╛рдИ рджреА рд╣реИред рднрд╛рд░рдд рдФрд░ рднрд╛рд░рддрд╡рд╛рд╕рд┐рдпреЛрдВ рдХреЛ рдпрд╣ рд╕рдВрд╕рдж рдмрд╣реБрдд рд╣реА рдорд╣рддреНрддреНрд╡рдкреВрд░реНрдг рд╣реИ рдФрд░ рдЗрд╕рдХрд╛ рдЙрдкрдпреЛрдЧ рджреЗрд╢ рдХреЗ рд╡рд┐рдХрд╛рд╕ рдореЗрдВ рдмрд╣реБрдд рдЕрдзрд┐рдХ рдпреЛрдЧрджрд╛рди рджреЗрдЧрд╛ред рдЖрдЬ рд╣рдорд╛рд░реЗ рджреЗрд╢ рдореЗрдВ рдПрдХ рд╕рдВрд╕рдж рдмрдиреА рд╣реБрдИ рд╣реИ рдЬреЛ рд╣рдорд╛рд░реЗ рджреЗрд╢ рдХреА рд▓реЛрдХрддрд╛рдВрддреНрд░рд┐рдХ рдкреНрд░рдХреНрд░рд┐рдпрд╛рдУрдВ рдХреЛ рдордЬрдмреВрдд рдмрдирд╛рддреА рд╣реИред
Prime minister ModiPrime minister Modi's visit to China: PM Modi is scheduled to arrive in Beijing on Thursday. PM Narendra Modi will be the first Indian prime minister to visit China since former President Mao Zedong, who visited India in 1959 and was assassinated by Chinese soldiers a year later. The two leaders are expected to discuss regional security issues such as terrorism, trade and energy cooperation during their talks.
рдорд╣рд╛рддреНрдорд╛ рдЧрд╛рдВрдзреАрдорд╣рд╛рддреНрдорд╛ рдЧрд╛рдВрдзреА рдПрдХ рд╕реНрд╡рддрдВрддреНрд░ рд╡реНрдпрдХреНрддрд┐, рдПрдХ рдЖрддреНрдордирд┐рд░реНрднрд░ рдФрд░ рдПрдХ рдЕрддреНрдпрдВрдд рд▓реЛрдХрдкреНрд░рд┐рдп рдиреЗрддрд╛ рдереЗред рдЙрдиреНрд╣реЛрдВрдиреЗ рдЕрдкрдиреЗ рдЬреАрд╡рдирдХрд╛рд▓ рдореЗрдВ рджреЗрд╢ рдХреЛ рдПрдХ рдирдИ рджрд┐рд╢рд╛ рджреЗрдиреЗ рдХреЗ рд▓рд┐рдП рдмрд╣реБрдд рдХреБрдЫ рдХрд┐рдпрд╛ред рд╕рдорд╛рдЬ рдХреЗ рд╡рд┐рднрд┐рдиреНрди рдХреНрд╖реЗрддреНрд░реЛрдВ рдореЗрдВ рдЙрдирдХреЗ рдпреЛрдЧрджрд╛рди рдХреЗ рд▓рд┐рдП рдЙрдиреНрд╣реЗрдВ рд╕рдореНрдорд╛рдирд┐рдд рдХрд┐рдпрд╛ рдЧрдпрд╛ рдерд╛ред
рдмреНрд░рд┐рдЯрд┐рд╢ рд░рд╛рдЬрд╢рд╛рд╣реАрдмреНрд░рд┐рдЯрд┐рд╢ рд░рд╛рдЬрд╢рд╛рд╣реА рдХреЗ рд╡рд┐рдШрдЯрди рдХреЗ рдмрд╛рдж, рдЗрдВрдЧреНрд▓реИрдВрдб рдФрд░ рд╡реЗрд▓реНрд╕ рдореЗрдВ рдмреНрд░рд┐рдЯреЗрди рдХреА рд╕рдВрд╕рдж рдХрд╛ рдЧрдарди рд╣реБрдЖред рд╕рдВрд╕рдж рдиреЗ 1807 рдореЗрдВ рдЕрдкрдиреЗ рдкрд╣рд▓реЗ рдЕрдзрд┐рдирд┐рдпрдо, "рдж рдХреНрд░рд╛рдЙрди рдХреА рд╡рд┐рдзреЗрдпрдХ" (рдПрдХреНрдЯ рдСрдл рдж рдбреЗрдореЛрдХреНрд░реИрдЯрд┐рдХ) рдХреЛ рдордВрдЬреВрд░реА рджреА, рдЬрд┐рд╕рдореЗрдВ рдкреНрд░рддреНрдпреЗрдХ рд░рд╛рдЬреНрдп рд╕реЗ рдПрдХ рдирд┐рд░реНрд╡рд╛рдЪрд┐рдд рд╕рджрд╕реНрдп рдЪреБрдирд╛ рдЧрдпрд╛ред
ржорж╣рж╛рждрзНржорж╛ ржЧрж╛ржирзНржзрзАржорж╣рж╛рждрзНржорж╛ ржЧрж╛ржирзНржзрзА ржЫрж┐рж▓рзЗржи ржЖржирзНрждрж░рзНржЬрж╛рждрж┐ржХ рж╕ржорж╛ржЬрждрж╛ржирзНрждрзНрж░рж┐ржХ ржЖржирзНржжрзЛрж▓ржирзЗрж░ ржЕржирзНржпрждржо ржкрзНрж░ржзрж╛ржи ржирзЗрждрж╛ред рждрж┐ржирж┐ ржнрж╛рж░рждрзЗрж░ рж╕рзНржмрж╛ржзрзАржирждрж╛ рж╕ржВржЧрзНрж░рж╛ржорзЗ рждрж╛ржБрж░ ржнрзВржорж┐ржХрж╛ ржЕржкрж░рж┐рж╕рзАржоред рззрзпрзкрзн рж╕рж╛рж▓рзЗрж░ рззрзл ржЖржЧрж╕рзНржЯ ржнрж╛рж░рждрзЗ ржмрж┐ржкрзНрж▓ржм ржШржЯрзЗред рззрзпрзкрзо рж╕рж╛рж▓рзЗ ржорж╣рж╛рждрзНржорж╛ ржЧрж╛ржирзНржзрзАрж░ ржорзГрждрзНржпрзБ рж╣ржпрж╝ред
ржнрж╛рж░рждрзЗрж░ рж╕рзНржмрж╛ржзрзАржирждрж╛ ржжрж┐ржмрж╕рзЗржнрж╛рж░рждрзЗрж░ рж╕рзНржмрж╛ржзрзАржирждрж╛ ржжрж┐ржмрж╕рзЗ (рззрзл ржЖржЧрж╕рзНржЯ) рж╕ржХрж╛рж▓ ржерзЗржХрзЗржЗ ржЖрж▓рзЛржЪржирж╛ ржЪрж▓ржЫрзЗ ржжрзЗрж╢ржЬрзБржбрж╝рзЗред рж╕рзЗржЗрж╕ржЩрзНржЧрзЗ рж╕рж╛ржорж╛ржЬрж┐ржХ ржпрзЛржЧрж╛ржпрзЛржЧ ржорж╛ржзрзНржпржорзЗржУ ржнрж╛ржЗрж░рж╛рж▓ рж╣ржпрж╝рзЗржЫрзЗ ржнрж╛рж░рждрзАржпрж╝ ржирж╛ржЧрж░рж┐ржХржжрзЗрж░ ржПржХржЯрж┐ ржЫржмрж┐ ржпрж╛ ржжрзЗржЦрж▓рзЗ ржоржирзЗ рж╣ржмрзЗ ржПржХржЬржи ржмрж┐ржкрзНрж▓ржмрзАред
ркнрк╛рк░ркдркорк╛ркВ рк╕рлНрк╡ркдркВркдрлНрк░ркдрк╛ ркжрк┐рк╡рк╕ркнрк╛рк░ркдркорк╛ркВ рк╕рлНрк╡ркдркВркдрлНрк░ркдрк╛ ркжрк┐рк╡рк╕ рккрк░ ркжрлЗрк╢ркнрк░ркорк╛ркВ рк╡рк┐рк╡рк┐ркз ркХрк╛рк░рлНркпркХрлНрк░ркорлЛ ркпрлЛркЬрк╛ркпрк╛ рк╣ркдрк╛. ркЖ ркжрк░ркорк┐ркпрк╛рки ркШркгрк╛ рк▓рлЛркХрлЛ рккрлЛркдрк╛ркирк╛ рккрк░рк┐рк╡рк╛рк░ рк╕рк╛ркерлЗ ркПркХрк╕рк╛ркерлЗ ркЬрлЛрк╡рк╛ ркорк│рлНркпрк╛ рк╣ркдрк╛.

ркжрлЗрк╢ркирк╛ рк╡ркбрк╛рккрлНрк░ркзрк╛рки ркирк░рлЗркирлНркжрлНрк░ ркорлЛркжрлАркП ркнрк╛рк░ркдркирк╛ рк╕рлНрк╡рк╛ркдркВркдрлНрк░рлНркп ркжрк┐рк╡рк╕ рккрк░ рккрлЛркдрк╛ркирк╛ ркШрк░рлЗ ркПркХ ркЦрк╛рк╕ рк╡рлАркбрк┐ркпрлЛ рк╢рлЗрк░ ркХрк░рлНркпрлЛ ркЫрлЗ.
ркХ рк╡ркЦркдрлЗркХ рк╡ркЦркдрлЗ, ркдрлЗркоркгрлЗ ркдрлЗркоркирк╛ ркорк┐ркдрлНрк░ ркЕркирлЗ рк╕рк╣рк╛ркпркХ рк╡рк┐рк▓рк┐ркпрко рк╣рлЗркирк░рлА ркорлЗркХркбрлЛркирк╛рк▓рлНркб рк╕рк╛ркерлЗ ркорк│рлАркирлЗ "ркз ркУрк░рк┐ркЬрк┐ркирк▓ ркорлЗрки ркУркл ркз ркбрлЗркб" ркирк╛ркоркирк╛ рккрлБрк╕рлНркдркХркирлБркВ ркирк┐рк░рлНркорк╛ркг ркХрк░рлНркпрлБркВ рк╣ркдрлБркВ; ркдрлЗ 1961ркорк╛ркВ рккрлНрк░ркХрк╛рк╢рк┐ркд ркеркпрлБркВ рк╣ркдрлБркВ. 1958ркорк╛ркВ, ркдрлЗркоркгрлЗ ркз ркирлНркпрлВ ркпрлЛрк░рлНркХрк░ ркорк╛ркЯрлЗ рк▓рлЗркЦркХ ркЕркирлЗ рк╕ркВрккрк╛ркжркХ рккрк┐ркдрк╛ ркдрк░рлАркХрлЗ ркХрк╛рко ркХрк░рлНркпрлБркВ рк╣ркдрлБркВ.
ркПркХ рк╕ркоркпрлЗ ркПркХ рк░рк╛ркЬрк╛ ркЬркВркЧрк▓ркорк╛ркВ рк░рк╣рлЗркдрлЛ рк╣ркдрлЛ. ркдрлЗркоркгрлЗркПркХ рк╕ркоркпрлЗ ркПркХ рк░рк╛ркЬрк╛ ркЬркВркЧрк▓ркорк╛ркВ рк░рк╣рлЗркдрлЛ рк╣ркдрлЛ. ркдрлЗркоркгрлЗ, ркПркХ ркжрк┐рк╡рк╕ рк░рк╛ркЬрк╛ркирлЗ ркХрк╣рлНркпрлБркВ, "рк╣рлБркВ ркдркоркирлЗ ркорк╛рк░рлА ркЬрк╛ркдркирлЗ ркПркХ рк╕рлНркдрлНрк░рлА ркдрк░рлАркХрлЗ ркЖрккрлБркВ ркЫрлБркВ." рк░рк╛ркЬрк╛ ркЦрлБрк╢ ркеркпрлЛ. ркдрлЗркоркгрлЗ ркдрлЗркирлЗ ркдрлЗркирк╛ ркШрк░ркорк╛ркВ рк▓ркИ ркЧркпрк╛. ркдрлЗ ркжрк┐рк╡рк╕рлЗ рк░рк╛ркгрлА рккрлЛркдрк╛ркирк╛ рккркдрк┐ рк╕рк╛ркерлЗ ркШрк░рлЗ ркЖрк╡рлА ркЕркирлЗ ркХрк╣рлНркпрлБркВ ркХрлЗ ркдрлЗ ркдрлЗркирлА рккрк╛рк╕рлЗ ркЫрлЗ. рк░рк╛ркЬрк╛ркП ркдрлЗркирлЗ рккрлВркЫрлНркпрлБркВ ркХрлЗ рк╢рлБркВ ркдрлЗркгрлЗ ркдрлЗркирлА рк╕рк╛ркерлЗ рк▓ркЧрлНрки ркХрк░рлНркпрк╛ ркЫрлЗ ? рк░рк╛ркЬрк╛ ркЕркирлЗ рк░рк╛ркгрлАркП рккрлВркЫрлНркпрлБркВ - рк╣рк╛. ркдрлНркпрк╛рк░рлЗ ркдрлЗркоркгрлЗ ркХрк╣рлНркпрлБркВ ркХрлЗ, "ркдркорлЗ ркоркирлЗ ркдркорк╛рк░рлА рк╕рк╛ркерлЗ рк▓ркЧрлНрки ркХрк░рк╡рк╛ркирлБркВ рк╡ркЪрки ркЖрккрлНркпрлБркВ ркЫрлЗ. рк╣рк╡рлЗ ркорк╛рк░рлЗ ркдркорк╛рк░рк╛ рккрк░ рк╡рк┐рк╢рлНрк╡рк╛рк╕ рк░рк╛ркЦрк╡рлЛ ркЬрлЛркИркП ркирк╣рлАркВ".

SFT outputs

InstructionResponse
Tell me about IndiaIndia is a vast and diverse country located in South Asia. It is the seventh-largest country in the world by land area and is also the most populous country, with over 1.4 billion people as of my last update. Here are some key points about India:

Inchurching History: Modern India has a long and rich history that spans thousands of years. It was under various empires for many centuries before it became part of the British Empire in the 18th century. The Indian Rebellion of 1857 was a significant event that led to the independence of India from British rule and its unification into one nation.

Economy: India's economy is one of the largest in the world, with economic growth reportedly averaging around 7% per year since 1990. It is considered one of the most advanced economies in the world and hosts several multinational companies, including Tata Motors, Airtel, HCL Teletracns, and Infosys among others.

Society: India's society is diverse and pluralistic, with individuals belonging to different religions, ethnicies, and social classes. It faces challenges such as poverty, inequality, and environmental degradation.
рдореБрдЭреЗ рднрд╛рд░рдд рдХреЗ рдмрд╛рд░реЗ рдореЗрдВ рдмрддрд╛рдУ?рднрд╛рд░рдд рджрдХреНрд╖рд┐рдг рдПрд╢рд┐рдпрд╛ рдореЗрдВ рд╕реНрдерд┐рдд рдПрдХ рд╡рд┐рд╢рд╛рд▓ рдФрд░ рд╡реИрд╢реНрд╡рд┐рдХ рджреЗрд╢ рд╣реИред рдпрд╣ рднреВрдорд┐ рдХреНрд╖реЗрддреНрд░ рдХреЗ рд╣рд┐рд╕рд╛рдм рд╕реЗ рджреБрдирд┐рдпрд╛ рдХрд╛ рд╕рд╛рддрд╡рд╛рдВ рд╕рдмрд╕реЗ рдмрдбрд╝рд╛ рджреЗрд╢ рд╣реИ рдФрд░ рд╕рдмрд╕реЗ рдЕрдзрд┐рдХ рдЖрдмрд╛рджреА рд╡рд╛рд▓рд╛ рджреЗрд╢ рд╣реИ, рдЬрд╣рд╛рдВ рдореЗрд░реА рдкрд┐рдЫрд▓реА рдЕрджреНрдпрддрди рдЬрд╛рдирдХрд╛рд░реА рдХреЗ рдЕрдиреБрд╕рд╛рд░ 12.2 рдХрд░реЛрдбрд╝ рд▓реЛрдЧ рд╣реИрдВред рд╡реЗ рд╣рд┐рдВрджреВ рд╕рдореБрджрд╛рдп рдХрд╛ рдиреЗрддреГрддреНрд╡ рдХрд░рддреЗ рд╣реИрдВ, рдФрд░ рдПрдХ рд╕рдВрдШреАрдп рд╕рдВрд╕рджреАрдп рд▓реЛрдХрддрд╛рдВрддреНрд░рд┐рдХ рдЧрдгрд░рд╛рдЬреНрдп рд╣реИред
рднрд╛рд░рдд рдХреЗ рдкреНрд░рдзрд╛рди рдордВрддреНрд░реА рдХреМрди рд╣реИрдВрдирд░реЗрдВрджреНрд░ рдореЛрджреА рднрд╛рд░рдд рдХреЗ рд╡рд░реНрддрдорд╛рди рдкреНрд░рдзрд╛рдирдордВрддреНрд░реА рд╣реИрдВред
рднрд╛рд░рдд рдХреЛ рдЖрдЬрд╛рджреА рдХрдм рдорд┐рд▓реА1942 рдореЗрдВ рднрд╛рд░рдд рдХреЗ рд╕реНрд╡рддрдВрддреНрд░рддрд╛ рд╕рдВрдЧреНрд░рд╛рдо рдХреЗ рджреМрд░рд╛рди рднрд╛рд░рдд рдиреЗ рдЕрдкрдиреА рд╕рд╛рд░реНрд╡рдЬрдирд┐рдХ рд░реВрдк рд╕реЗ рдШреЛрд╖рд┐рдд рд╕реНрд╡рддрдВрддреНрд░рддрд╛ рдХреЗ рд▓рд┐рдП рд▓рдбрд╝рд╛рдИ рд▓рдбрд╝реА рдереАред рдпрд╣ рдпреБрджреНрдз 1947 рддрдХ рдЪрд▓рд╛, рддрдм рднрд╛рд░рдд рдХреА рд╕реНрдерд╛рдкрдирд╛ рд╣реБрдИ рдФрд░ рдЗрд╕реЗ рднрд╛рд░рддреАрдп рд╕рдВрд╡рд┐рдзрд╛рди рджреНрд╡рд╛рд░рд╛ рдорд╛рдиреНрдпрддрд╛ рдкреНрд░рд╛рдкреНрдд рд╣реБрдИред
ржЖржорж╛ржХрзЗ ржнрж╛рж░ржд рж╕ржорзНржкрж░рзНржХрзЗ ржмрж▓рзБржиржнрж╛рж░ржд ржПржХржЯрж┐ ржмрж┐рж╢рж╛рж▓ ржПржмржВ ржмрзИржЪрж┐рждрзНрж░рзНржпржоржпрж╝ ржжрзЗрж╢ ржпрж╛ ржмрзИржЪрж┐рждрзНрж░рзНржпржоржпрж╝ рж╕рж╛ржВрж╕рзНржХрзГрждрж┐ржХ ржУ ржзрж░рзНржорзАржпрж╝ ржжрзГрж╖рзНржЯрж┐ржнржЩрзНржЧрж┐ ржкрзНрж░ржжрж╛ржи ржХрж░рзЗред ржПржЯрж┐ ржмрж┐ржнрж┐ржирзНржи ржнрж╛рж╖рж╛рж░ ржХрзЗржирзНржжрзНрж░рзАржнрзВржд ржХрж╛ржарж╛ржорзЛ ржПржмржВ ржмрж┐ржнрж┐ржирзНржи ржзрж░рзНржорзАржпрж╝ ржкржжрзНржзрждрж┐рж░ ржЬржирзНржп ржкрж░рж┐ржЪрж┐рждред
ржмрзИржЬрзНржЮрж╛ржирж┐ржХ ржЕрж░рзНржЬржи ржорж╛ржирзЗ ржХрж┐?ржмрзИржЬрзНржЮрж╛ржирж┐ржХ ржЕрж░рзНржЬржиржЧрзБрж▓рж┐ ржЧржгржирж╛, ржпрзЛржЧрж╛ржпрзЛржЧ ржПржмржВ ржмрзИржЬрзНржЮрж╛ржирж┐ржХ рж╕ржВрж╕рзНржерж╛рж░ ржорж╛ржзрзНржпржорзЗ ржмрж╛ ржмрзНржпржХрзНрждрж┐ржЧрждржнрж╛ржмрзЗ ржЕрж░рзНржЬрж┐ржд ржмрж┐ржнрж┐ржирзНржи ржЕрж░рзНржЬржи рж╣рждрзЗ ржкрж╛рж░рзЗред
ркнрк╛рк░ркдркирлА рк░рк╛ркЬркзрк╛ркирлА рк╢рлБркВ ркЫрлЗ?ркжрк┐рк▓рлНрк╣рлАркирлБркВ рк░рк╛ркЬркзрк╛ркирлА рк╕рлНркерк╛рки ркЫрлЗ.
ркоркирлЗ ркнрк╛рк░ркд рк╡рк┐рк╢рлЗ ркХрк╣рлЛркнрк╛рк░ркд ркПркХ рк╡рк┐рк╢рк╛рк│ ркЕркирлЗ рк╡рлИрк╡рк┐ркзрлНркпрк╕ркнрк░ ркжрлЗрк╢ ркЫрлЗ ркЬрлЗ рк╡рк┐рк╡рк┐ркз рк░рк╛ркЬркХрлАркп рккрлНрк░ркжрлЗрк╢рлЛ ркЕркирлЗ рк╡рлНркпрк╡рк╕рк╛ркпрк┐ркХ рк╡рк┐ркнрк╛ркЧрлЛ ркЫрлЗ. ркдрлЗ ркжркХрлНрк╖рк┐ркг ркПрк╢рк┐ркпрк╛ркорк╛ркВ ркЖрк╡рлЗрк▓рлЛ ркЫрлЗ ркЕркирлЗ ркорк╛ркирк╡ркдрк╛рк╡рк╛ркжрлА ркЕркирлЗ ркЖрк░рлНркерк┐ркХ ркмрк╛ркмркдрлЛркорк╛ркВ рк╕ркХрлНрк░рк┐ркп рк░рлАркдрлЗ ркнрк╛ркЧ рк▓рлЗ ркЫрлЗ.
ркжрк┐рк▓рлНрк╣рлА ркХрлНркпрк╛ркВ ркЫрлЗ ркЕркирлЗ ркдрлЗ рк╢рлЗркирк╛ ркорк╛ркЯрлЗ рккрлНрк░ркЦрлНркпрк╛ркд ркЫрлЗ?ркжрк┐рк▓рлНрк╣рлАркирлБркВ рк╕рлНркерк╛рки ркнрк╛рк░ркдркирк╛ ркжрк┐рк▓рлНрк╣рлАркорк╛ркВ ркЫрлЗ. ркдрлЗ ркдрлЗркирлА рк╕рк╛ркВрк╕рлНркХрлГркдрк┐ркХ ркЕркирлЗ ркРркдрк┐рк╣рк╛рк╕рк┐ркХ ркорк╣ркдрлНрк╡ ркорк╛ркЯрлЗ ркЬрк╛ркгрлАркдрлБркВ ркЫрлЗ.
ржХрж▓ржХрж╛рждрж╛рж░ ржРрждрж┐рж╣рж╛рж╕рж┐ржХ рждрж╛рзОржкрж░рзНржп ржХрзАред ржмрж┐рж╕рзНрждрж╛рж░рж┐ржд ржмрж▓рзБржиредржХрж▓ржХрж╛рждрж╛, ржнрж╛рж░рждрзЗрж░ ржПржХржЯрж┐ ржРрждрж┐рж╣рж╛рж╕рж┐ржХ рж╢рж╣рж░ред ржПржЯрж┐ ржнрж╛рж░рждрзАржпрж╝ ржЙржкржорж╣рж╛ржжрзЗрж╢рзЗрж░ ржмрзГрж╣рждрзНрждржо рж╢рж╣рж░ржЧрзБрж▓рж┐рж░ ржоржзрзНржпрзЗ ржПржХржЯрж┐ред рж╢рж╣рж░рзЗрж░ ржЕржнрзНржпржирзНрждрж░рзАржг ржЕржЮрзНржЪрж▓рзЗ ржмрж┐ржнрж┐ржирзНржи ржкрзНрж░рж╛ржХрзГрждрж┐ржХ ржжрзГрж╢рзНржп ржПржмржВ ржкрзНрж░рж╛рж╕рж╛ржж рж░ржпрж╝рзЗржЫрзЗ ржпрж╛ ржПржЯрж┐ржХрзЗ ржмрж┐рж╢рзНржмрзЗрж░ рж╕рзЗрж░рж╛ рж╢рж╣рж░ ржХрж░рзЗ рждрзБрж▓рзЗржЫрзЗред ржПржЗ рж╢рж╣рж░рзЗ ржмрзЗрж╢ ржХржпрж╝рзЗржХржЯрж┐ ржкрзНрж░рж╛ржЪрзАржи ржПржмржВ ржоржзрзНржпржпрзБржЧрзАржпрж╝ рж╕рзНржерж╛ржкрждрзНржпрзЗрж░ ржЙржжрж╛рж╣рж░ржг рж░ржпрж╝рзЗржЫрзЗ ржПржмржВ ржПржЦрж╛ржирзЗ ржЕржирзЗржХ ржмрзИржЪрж┐рждрзНрж░рзНржпржоржпрж╝ ржЬрзАржмржиржпрж╛рждрзНрж░рж╛ рж░ржпрж╝рзЗржЫрзЗред

Research Preview: spaces/pragna-chat

Disclaimer: An initial checkpoint of the instruction tuned model is made available as a research preview. It is hereby cautioned that the model has the potential to produce hallucinatory and plausible yet inaccurate statements. Users are advised to exercise discretion when utilizing the generated content.

Performance and Benchmarking

Hindi

Arc-EasyArc-ChallengeHellaswagAverage
pragna-1b0.330.220.350.30
sarvamai/OpenHathi-7B-Hi-v0.1-Base0.35820.26450.43150.35
meta-llama/Llama-2-7b-hf0.2950.24060.37890.30
google/gemma-7b0.59260.42580.63410.55
meta-llama/Meta-Llama-3-8B0.53540.35410.60720.50

Gujarati

Arc-EasyArc-ChallengeHellaswagAverage
pragna-1b0.320.220.370.30
google/gemma-7b0.49540.32080.56730.46

English

ArcHellaswagMMLUTruthfulQAWinograndeGSM8KAverage
pragna-1b0.30.510.270.380.5600.34
TinyLlama/TinyLlama-1.1B-Chat-v1.00.360.610.250.370.610.020.37
meta-llama/Meta-Llama-3-8B-Instruct0.60.820.670.510.770.680.66
meta-llama/Llama-2-7b-hf0.530.780.460.390.740.140.51

Eval numbers for Hindi and Gujarati are taken from Indic LLM Leaderboard and English from HF's Open LLM Leaderboard. We are running benchmarks for Bangla and will be updated soon.

Pragna-1b, boasting 1.25 billion parameters, demonstrates performance akin to that of more expansive models such as OpenHathi-7b and Llama-2-7b, and falls within a comparable range with gemma-7b. However, there is a trade-off in English language proficiency when compared to the base model, TinyLlama. This difference arises from the model's integration of extensive information from three Indian languages into the same parameter space. Essentially, Pragna-1b manages to encapsulate a vast corpus of global knowledge, totaling close to 1 TiB, within a remarkably constrained latent space of just 2.3 GiB (BFloat16).

These are preliminary benchmarking results for Pragna-1B. However, for Indian language benchmarking, we have to be careful in interpreting these figures, as the quality of benchmarking data is currently suboptimal. This limitation largely stems from the reliance on datasets generated through machine translations. We are working to develop comprehensive benchmarks that assess both the linguistic understanding and contextual intelligence of large language models across a variety of fields and tasks for all Indian languages.

pragna-1b on huggingface

Pragna-1B achieves substantial speed enhancements, demonstrating approximately 3x, 7x, 3x, and 6.5x acceleration for Hindi, Kannada, Bangla, and Gujarati, respectively, relative to Llama-2-7B when considering efficient tokenization alone. The actual throughput and speed improvements are expected to be even greater, owing to reduced computational demands facilitated by a parameter count nearly 7 times lower than that of Llama-2-7B.

Future Work

Alignment for Factual Accuracy: While the model demonstrates good linguistic understanding, it is imperative to align it more closely with factual veracity. Addressing instances of factual hallucination remains a top priority, as we strive to ensure that the model consistently delivers accurate and reliable information.

Expansion of Linguistic Scope: In our pursuit of linguistic inclusivity, we aim to broaden the knowledge base of the model by incorporating additional Indian languages. This expansion involves not only introducing more linguistic tokens but also deepening the model's understanding of the nuances inherent in diverse language structures.

Mixture of Experts for Multilingual Proficiency: To facilitate seamless communication across a spectrum of languages, we are exploring the implementation of a Mixture of Experts approach. This will enable the model to adeptly correlate complexities of multilingual interactions, thereby enriching its linguistic capabilities.

Architectural Innovation and Experimentation: Being a research first team, we will strategise researching and experimenting with novel architectures to innovative solutions and optimise the model's performance across various domains and use cases.

Decoupling Knowledge and Language Understanding: A key focus of our future endeavours involves refining the model's architecture to effectively separate knowledge acquisition from language understanding. By delineating these two components within the model's architecture, we aim to bolster its capacity for multilingual knowledge transfer.

Distillation and Quantization for Efficiency: Focussing on the importance of efficiency without compromising performance, we will be developing distilled and quantized versions of the model. These streamlined iterations harness the power of compression techniques to maximise computational efficiency while upholding the model's performance.

Conclusion

Pragna-1B's introduction holds significant promise for India, promising to revolutionize sectors like education, governance, and commerce. By supporting Indian languages, it fosters inclusivity, expands access to information, and empowers communities in the digital economy. As it evolves, Pragna-1B can bridge linguistic divides, foster cross-cultural understanding, and drive socioeconomic development, making a lasting impact on India's digital landscape.