Soket Labs LogoSoket Labs Logo

LLM FOR INDIC LANGUAGES

Open Source Products (to be released)

Mutilingual Tokenizer

An efficient tokenizer for multi-lingual encoding

BHASHA-7B-8K-HI-Base

A 7B transformer model with a context length of 8K pre-trained on Hindi

BHASHA-7B-8K-Base

A 7B transformer model with a context length of 8K pre-trained on 22 scheduled languages of India

BHASHA-7B-8K-Instruct

A 7B transformer model instruction tuned to align with human intents

Research Interests

Efficient and Representative Tokenisation

This is a fundamental building block for enabling language models to work effectively with low-resource languages. It underpins various linguistic, computational, and societal aspects, contributing to accurate language understanding, efficient processing, and improved cross-cultural communication.

LLMs for Low Resource Languages

Building large language models for low-resource languages is not just about advancing technology; it's about promoting cultural diversity, inclusivity, and access to information for all communities, regardless of the size of their language group.

Cross Lingual Knowledge Transfer

Cross-Lingual Knowledge Transfer involves transferring insights, information, or expertise gained in one language to another, facilitating learning across language barriers. It is important as it enables efficient utilization of existing knowledge in multiple languages, enhancing accessibility, collaboration, and innovation in diverse linguistic contexts.

Faster and Efficient Inference

Efficient and fast inference in large language models refers to the ability to quickly generate accurate responses or predictions from the model while minimizing computational resources. It's crucial because it enables real-time applications like chatbots, search engines, and voice assistants to provide timely and seamless interactions, enhancing user experience and enabling widespread adoption.

Fact Alignment

Fact Alignment in large language models refers to the process of ensuring that generated text corresponds accurately to factual information. It's vital to maintain credibility and prevent misinformation, enhancing the model's reliability for tasks like answering questions, generating summaries, and aiding decision-making.