High-quality data for pre- and post-training
Curation, filtering, and synthesis pipelines for Indic and Global South text, code, math, and speech — built for both pre-training and alignment stages.
// PROJECT EKΛ
Backed by the IndiaAI Mission
Project EKΛ is Soket's flagship research program — advancing frontier math, code, and reasoning models in parallel with sovereign multilingual AI for Indian and Global South languages.
// WHAT WE'RE BUILDING
Project EKΛ is Soket’s boldest vision — building AI for a billion, from the heart of India. Our mission is to create world-class models that master math, code, and reasoning, while speaking the languages of Bharat and the Global South. We believe talent and ambition from India can shape the very frontier of AI. We work at the edge of research in architecture, large-scale training, and language resources — reimagining what’s possible for low-resource and diverse languages. Join us at the vanguard — and help put India at the center of global AI innovation. Project EKΛ is dedicated to producing fundamental advances in both the science and engineering of AI.
We open-source where we can, train on sovereign compute, and publish research that advances Indic NLP, systems for ML, and efficient inference.
Sovereign compute via the IndiaAI Mission
High-quality pre-training corpus to date
Frontier-scale architecture in training
Indian, Global South, and programming languages
// CORE CAPABILITIES
Advanced mathematical reasoning, proofs, and symbolic computation — models tuned for rigorous step-by-step logic.
Multi-language code generation, debugging, and optimization across 20+ programming languages.
Logical deduction, complex analysis, and long-horizon problem solving for high-stakes workflows.
22 Indian languages, 20+ Global South languages, and English — a dedicated vertical for sovereign text and speech, with curated data and tokenization built for linguistic diversity.
// LANGUAGE COVERAGE
Most frontier labs optimize for English. EKΛ runs two parallel verticals: frontier math, code, and reasoning models for rigorous technical work — alongside sovereign multilingual modeling for Indian and Global South languages, with efficient tokenization and curated corpora per script for regional text, speech, and domain-specific use cases.
Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Odia, Punjabi, Assamese, Urdu, Sanskrit, and more
Arabic, Indonesian, Thai, Vietnamese, Burmese, Kazakh, Portuguese, Spanish, and more
Python, Rust, Go, TypeScript, C++, Java, SQL, Julia, and more

// RESEARCH DIRECTIONS
If you care about data systems, training at scale, tokenizers, post-training, or ethical AI — these are the threads where your work ships into a national-scale model, not a side project.
Curation, filtering, and synthesis pipelines for Indic and Global South text, code, math, and speech — built for both pre-training and alignment stages.
Sparse MoE and routing strategies that handle extreme token-efficiency and morphological diversity across scripts and domains.
Building one of the most token-efficient vocabularies for Indian and Global South languages — minimizing bytes-per-token for Indic scripts and enabling longer context at lower compute cost.
SFT and preference optimization for math, code, and reasoning models; separate alignment pipelines for multilingual foundations and regulated-sector workflows.
Kernel fusion, speculative decoding, quantization, and systems co-design from training through production serving.
Optimal power and water usage for large-scale training — measuring and minimizing the environmental cost of frontier runs.
Applied research for defence, cybersecurity, military and civilian intelligence, finance, banking, and information technology — with auditable, on-premise deployment paths.
Safety, alignment, bias mitigation, and responsible deployment — embedding ethical constraints into data curation, training, evaluation, and release for sovereign and high-stakes use cases.
// RELATED RELEASES
Among the most token-efficient vocabularies for Indian and Global South languages — built in-house to cut sequence length and training cost versus mainstream open tokenizers.
We're hiring researchers and engineers across data, training, inference, and applied ML. If you want hard systems problems at sovereign scale — we'd like to hear from you.