Project EKA — Soket AI

// WHAT WE'RE BUILDING

Foundation models built for Bharat & the Global South

Project EKΛ is Soket’s boldest vision — building AI for a billion, from the heart of India. Our mission is to create world-class models that master math, code, and reasoning, while speaking the languages of Bharat and the Global South. We believe talent and ambition from India can shape the very frontier of AI. We work at the edge of research in architecture, large-scale training, and language resources — reimagining what’s possible for low-resource and diverse languages. Join us at the vanguard — and help put India at the center of global AI innovation. Project EKΛ is dedicated to producing fundamental advances in both the science and engineering of AI.

We open-source where we can, train on sovereign compute, and publish research that advances Indic NLP, systems for ML, and efficient inference.

1536

NVIDIA H100 GPUs

Sovereign compute via the IndiaAI Mission

25T

Tokens curated

High-quality pre-training corpus to date

120B+

Parameters (Sparse MoE)

Frontier-scale architecture in training

60+

Languages

Indian, Global South, and programming languages

// CORE CAPABILITIES

Two tracks: technical reasoning & multilingual AI

Math

Advanced mathematical reasoning, proofs, and symbolic computation — models tuned for rigorous step-by-step logic.

Code

Multi-language code generation, debugging, and optimization across 20+ programming languages.

Reasoning

Logical deduction, complex analysis, and long-horizon problem solving for high-stakes workflows.

Multilingual

22 Indian languages, 20+ Global South languages, and English — a dedicated vertical for sovereign text and speech, with curated data and tokenization built for linguistic diversity.

// LANGUAGE COVERAGE

60+ languages — not an afterthought

Most frontier labs optimize for English. EKΛ runs two parallel verticals: frontier math, code, and reasoning models for rigorous technical work — alongside sovereign multilingual modeling for Indian and Global South languages, with efficient tokenization and curated corpora per script for regional text, speech, and domain-specific use cases.

22 Indian languages
Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Odia, Punjabi, Assamese, Urdu, Sanskrit, and more
20+ Global South languages
Arabic, Indonesian, Thai, Vietnamese, Burmese, Kazakh, Portuguese, Spanish, and more
20+ programming languages
Python, Rust, Go, TypeScript, C++, Java, SQL, Julia, and more

// RESEARCH DIRECTIONS

Problems we're actively working on

If you care about data systems, training at scale, tokenizers, post-training, or ethical AI — these are the threads where your work ships into a national-scale model, not a side project.

01

High-quality data for pre- and post-training

Curation, filtering, and synthesis pipelines for Indic and Global South text, code, math, and speech — built for both pre-training and alignment stages.

02

Efficient model architecture for language variance

Sparse MoE and routing strategies that handle extreme token-efficiency and morphological diversity across scripts and domains.

03

Efficient tokenization

Building one of the most token-efficient vocabularies for Indian and Global South languages — minimizing bytes-per-token for Indic scripts and enabling longer context at lower compute cost.

04

Post-training methods

SFT and preference optimization for math, code, and reasoning models; separate alignment pipelines for multilingual foundations and regulated-sector workflows.

05

Efficient inference & algorithmic design

Kernel fusion, speculative decoding, quantization, and systems co-design from training through production serving.

06

Sustainable & efficient AI research

Optimal power and water usage for large-scale training — measuring and minimizing the environmental cost of frontier runs.

07

AI for critical sectors

Applied research for defence, cybersecurity, military and civilian intelligence, finance, banking, and information technology — with auditable, on-premise deployment paths.

08

Ethical AI

Safety, alignment, bias mitigation, and responsible deployment — embedding ethical constraints into data curation, training, evaluation, and release for sovereign and high-stakes use cases.

// RELATED RELEASES

Pragna-1B

1.25B-parameter open multilingual model — Hindi, English, Gujarati, Bengali.

Hugging Face

Dhrith ASR

Speech recognition for Indic and voice-first markets.

Read the blog

EKA tokenizer

Among the most token-efficient vocabularies for Indian and Global South languages — built in-house to cut sequence length and training cost versus mainstream open tokenizers.

Sovereign models for reasoning & multilingual AI