AI Research & Engineering

Machine Learning Engineer - Foundation Models

Bengaluru, India

About Soket AI

Soket is an AI research firm headquartered in Bengaluru with a mission to build efficient and generalized intelligence for humanity. We are focused on advancing frontier AI research through the development of large-scale foundation models in math, code and reasoning that are open, energy-efficient, multilingual, and responsible by design. Funded and supported by the IndiaAI Mission, Government of India. Our work places a strong emphasis on India and the Global South, where access to high-quality AI systems remains limited despite immense linguistic and cultural diversity.

At Soket, we believe the future of AI should be accessible, scalable, and aligned with real-world societal needs. Our teams work across large language models, multimodal systems, speech technologies, reasoning systems, and large-scale AI infrastructure, with a strong focus on open research and practical deployment. We are deeply passionate about pushing the boundaries of AI research while building systems that are useful, trustworthy, and globally impactful.

Compensation

Rs 55,00,000 – Rs 75,00,000 (Includes Equity Benefits)
Compensation will be commensurate with industry standards and will be determined based on the candidate's current compensation, relevant experience, skills, and overall qualifications.

Workloads you would be involved in

Design, implement, and optimise large-scale pretraining and post-training pipelines for foundation models.
Build efficient and scalable training algorithms for large language and multimodal models.
Translate state-of-the-art research papers into production-grade, performant, and maintainable training code.
Optimise distributed training workloads across multi-node and multi-GPU environments.
Work with deep learning distributed training frameworks and optimise training efficiency, scalability, and fault tolerance.
Design and run ablation studies to identify effective architectures, hyperparameters, training strategies, and scaling behaviors.
Improve training throughput, memory efficiency, and hardware utilisation through algorithmic and systems-level optimisation.
Develop and integrate training pipelines with continuous evaluation, benchmarking, and validation frameworks.
Work on efficient checkpointing, model serialisation, export, and deployment-ready artifact generation.
Debug and resolve training instability, convergence issues, gradient pathologies, and distributed systems failures.
Implement and optimise post-training workflows including instruction tuning, alignment, preference optimisation, and model refinement pipelines.
Build tooling for experiment tracking, reproducibility, monitoring, and large-scale training orchestration.
Collaborate closely with research, infrastructure, and model teams to rapidly prototype and productionise new ideas.
Benchmark models across quality, efficiency, latency, and resource utilisation metrics.
Contribute to the design and improvement of internal training stacks, libraries, and developer tooling.

You are a good fit if you:

Have 3+ years of experience in machine learning engineering, deep learning systems, or related domains.
Have strong expertise in deep learning and modern neural network training methodologies.
Are proficient in implementing research ideas and papers into production-efficient code.
Have deep understanding of distributed training paradigms including data, tensor, pipeline, and expert parallelism.
Have strong programming skills in Python and performance-oriented ML development.
Have experience debugging and optimising complex ML training pipelines with profilers like nsys
Understand training dynamics, hyperparameter tuning, scaling laws, and model evaluation methodologies.
Enjoy working at the intersection of research and engineering to build scalable AI systems.
Care deeply about performance, reproducibility, reliability, and engineering quality.

You are a strong candidate if you have experience with:

Distributed training frameworks such as Megatron-LM (must have experience), PyTorch Distributed, DeepSpeed, FSDP, or similar systems.
Large-scale pretraining and post-training workflows for foundation models.
Optimisation techniques including mixed precision training, activation checkpointing, gradient accumulation, and memory-efficient training.
Continuous evaluation, benchmarking, and experiment management systems.
Model export and deployment with frameworks such as SGLang, vLLM or TensorRT.
CUDA, GPU kernel development, or GPU architecture and performance optimisation.
Profiling and debugging tools for distributed and GPU workloads.
HPC or large-scale GPU cluster environments and training infrastructure.
Preference optimisation, alignment methods, or reinforcement learning-based post-training techniques.
Building internal ML tooling, training infrastructure, and reusable libraries for large-scale experimentation.

Why work with Soket?

At Soket, you will get the chance to work on problems that only a handful of teams in the world are solving today - building frontier foundation models at scale. You will see first-hand how intelligence is baked into large models and work across the entire stack that powers modern AI systems. You will work with supercomputing-scale GPU clusters and tackle challenging problems in petabyte scale data aggregation and processing, distributed training, model architectures, infrastructure, inference optimization, and large-scale AI deployment.

One day you might be debugging CUDA kernels or NCCL issues, another day optimizing throughput for multi-GPU training runs, building new infrastructure tooling, or experimenting with ideas that make training faster and more efficient. We are a deeply research-driven and engineering-focused team that loves nerding out about systems, scaling laws, training stacks, and AI research. If you enjoy going deep into technical problems and learning from highly talented researchers and engineers, you will feel right at home here. Most importantly, we are building efficient, open, and accessible AI systems for India, the Global South, and ultimately for humanity as a whole.

If this sounds exciting to you, come build the future with us.

Apply Now!

Soket AI Labs is a research-first AI company headquartered in Bengaluru. We are an equal opportunity employer and strongly encourage applications from people of all genders, backgrounds, and ethnicities. We offer competitive compensation, equity participation opportunities, flexible work arrangements across office and remote settings, comprehensive leave policies including parental and wellness leaves, and regular team offsites designed to foster collaboration and innovation.

As an AI-native organization, we use AI systems as part of our candidate assessment and interview processes. Please make sure your resume aligns with the job description. More details about how candidate data is processed and used will be available on the application page.

Apply for this role Browse all openings →