AI Research & Engineering

Research Scientist, Data - Foundation Models

Bengaluru, India

About Soket AI

Soket is an AI research firm headquartered in Bengaluru with a mission to build efficient and generalized intelligence for humanity. We are focused on advancing frontier AI research through the development of large-scale foundation models in math, code and reasoning that are open, energy-efficient, multilingual, and responsible by design. Funded and supported by the IndiaAI Mission, Government of India. Our work places a strong emphasis on India and the Global South, where access to high-quality AI systems remains limited despite immense linguistic and cultural diversity.

At Soket, we believe the future of AI should be accessible, scalable, and aligned with real-world societal needs. Our teams work across large language models, multimodal systems, speech technologies, reasoning systems, and large-scale AI infrastructure, with a strong focus on open research and practical deployment. We are deeply passionate about pushing the boundaries of AI research while building systems that are useful, trustworthy, and globally impactful.

Compensation

Rs 55,00,000 – Rs 1,00,00,000 (Includes Equity Benefits)
Compensation will be commensurate with industry standards and will be determined based on the candidate's current compensation, relevant experience, skills, and overall qualifications.

Workloads you would be involved in

Lead capability-driven data strategy for frontier foundation model development across pretraining and post-training stages.
Design and own end-to-end datasets that improve model capabilities in code generation, mathematical reasoning, multi-step problem solving, and agentic tool use.
Define data mixture strategies, inclusion/exclusion policies, and quality standards that improve reasoning depth, robustness, and model generalisation.
Build scalable synthetic data generation and execution-grounded training systems inspired by modern LLM frameworks, including self-play, self-improvement, programmatic task generation, and automated code test-case creation.
Design and maintain sandboxed execution environments where model outputs can be executed, verified, critiqued, and incorporated into iterative training loops.
Establish large-scale data curation and evaluation infrastructure involving filtering, deduplication, contamination detection, and reasoning trace validation.
Design annotation systems, quality rubrics, and human/AI-in-the-loop workflows to create high-signal datasets for pretraining, supervised fine-tuning, and reinforcement learning.
Drive the data → model → evaluation feedback loop by diagnosing capability gaps and mapping failures to targeted data interventions.
Design and run controlled experiments across synthetic versus curated data, process versus outcome supervision, and post-training alignment methodologies.
Develop post-training datasets and workflows for supervised fine-tuning, preference optimisation, and reinforcement learning from human or AI feedback.
Build systems that improve long-horizon reasoning, execution-grounded decision-making, and tool-augmented model behaviour.
Provide hands-on technical leadership by building systems, writing production-grade code, mentoring teams, and translating research and product goals into scalable data systems that improve real-world model performance.

You are a good fit if you:

Hold a Master’s or PhD in Computer Science or related fields (PhD preferred).
Have 6+ years of experience in AI research, data systems, machine learning infrastructure, or related domains.
Possess strong expertise in Python and demonstrate sound understanding of programming concepts across systems languages such as C++, Rust, Go, Java, JavaScript, or TypeScript.
Have deep understanding of LLM training pipelines, synthetic data systems, and modern post-training methodologies.
Enjoy designing high-quality datasets and building scalable systems that directly improve model capabilities.
Have experience bridging research and engineering by converting experimental ideas into reliable, production-scale systems.
Enjoy working across the full lifecycle of data strategy, from acquisition and curation to evaluation and model alignment.

You are a strong candidate if you have experience with:

Deep learning and model development frameworks such as PyTorch, Hugging Face Transformers, Hugging Face Datasets, vLLM, lm-evaluation-harness, and SGLang.
Large-scale training and inference ecosystems including TensorFlow, JAX, DeepSpeed, Megatron-LM, Megatron-Core, TRL, verl, NeMo-RL, TensorRT-LLM, ONNX Runtime, Triton Inference Server, and llama.cpp.
Synthetic data and execution sandbox frameworks such as OpenSandBox, NemoGym, Magpie, and Math-Verify.
Distributed infrastructure and MLOps tooling including Docker, Kubernetes, Apache Spark, Ray, Prometheus, and Grafana.
Version control and CI/CD systems including Git, GitHub, GitHub Actions, Jenkins, and related engineering workflows.
Experiment tracking and evaluation tooling such as Weights & Biases and MLflow.
Publication experience at leading AI and NLP conferences such as ACL, EMNLP, and ICML.
Demonstrable software or research impact through public adoption (for example GitHub adoption, contributors, or downloads) or internal production-scale systems such as GPU clusters, data pipelines, or inference infrastructure.

Why work with Soket?

At Soket, you will get the chance to work on problems that only a handful of teams in the world are solving today - building frontier foundation models at scale. You will see first-hand how intelligence is baked into large models and work across the entire stack that powers modern AI systems. You will work with supercomputing-scale GPU clusters and tackle challenging problems in petabyte scale data aggregation and processing, distributed training, model architectures, infrastructure, inference optimization, and large-scale AI deployment.

One day you might be debugging CUDA kernels or NCCL issues, another day optimizing throughput for multi-GPU training runs, building new infrastructure tooling, or experimenting with ideas that make training faster and more efficient. We are a deeply research-driven and engineering-focused team that loves nerding out about systems, scaling laws, training stacks, and AI research. If you enjoy going deep into technical problems and learning from highly talented researchers and engineers, you will feel right at home here. Most importantly, we are building efficient, open, and accessible AI systems for India, the Global South, and ultimately for humanity as a whole.

If this sounds exciting to you, come build the future with us.

Apply Now!

Soket AI Labs is a research-first AI company headquartered in Bengaluru. We are an equal opportunity employer and strongly encourage applications from people of all genders, backgrounds, and ethnicities. We offer competitive compensation, equity participation opportunities, flexible work arrangements across office and remote settings, comprehensive leave policies including parental and wellness leaves, and regular team offsites designed to foster collaboration and innovation.

As an AI-native organization, we use AI systems as part of our candidate assessment and interview processes. Please make sure your resume aligns with the job description. More details about how candidate data is processed and used will be available on the application page.

Apply for this role Browse all openings →