AI Research & Engineering

Kernel Engineer - Foundation Models

Bengaluru, India

About Soket AI

Soket is an AI research firm headquartered in Bengaluru with a mission to build efficient and generalized intelligence for humanity. We are focused on advancing frontier AI research through the development of large-scale foundation models in math, code and reasoning that are open, energy-efficient, multilingual, and responsible by design. Funded and supported by the IndiaAI Mission, Government of India. Our work places a strong emphasis on India and the Global South, where access to high-quality AI systems remains limited despite immense linguistic and cultural diversity.

At Soket, we believe the future of AI should be accessible, scalable, and aligned with real-world societal needs. Our teams work across large language models, multimodal systems, speech technologies, reasoning systems, and large-scale AI infrastructure, with a strong focus on open research and practical deployment. We are deeply passionate about pushing the boundaries of AI research while building systems that are useful, trustworthy, and globally impactful.

Compensation

Rs 55,00,000 – Rs 75,00,000 (Includes Equity Benefits)
Compensation will be commensurate with industry standards and will be determined based on the candidate's current compensation, relevant experience, skills, and overall qualifications.

Workloads you would be involved in

Design, develop, and optimise high-performance CUDA kernels for large-scale foundation model training and inference workloads.
Build and tune kernels specifically for modern NVIDIA GPU architectures including Hopper and Blackwell.
Develop efficient GPU compute primitives for transformer operations including attention, GEMM, softmax, normalisation, and MoE workloads.
Optimise model execution pipelines through kernel fusion, warp specialisation, asynchronous execution, and memory hierarchy optimisation.
Design and implement high-throughput kernels for inference acceleration and low-latency serving systems.
Work on efficient attention implementations, FlashAttention-style kernels, KV-cache management, and sequence-parallel inference pipelines.
Develop and optimise quantisation and dequantisation kernels across 1-bit, 4-bit, and 8-bit precision schemes.
Improve training and inference throughput through memory access optimisation, tensor core utilisation, occupancy tuning, and launch configuration optimisation.
Benchmark, profile, and debug GPU kernels across distributed training and inference environments.
Analyse bottlenecks at the kernel, compiler, and hardware levels using low-level GPU profiling and debugging tooling.
Implement and optimise kernels using CUDA, Triton, and TileLang while ensuring portability and performance across architectures.
Work closely with model training and systems teams to integrate custom kernels into large-scale foundation model pipelines and runtime systems.
Contribute to performance-critical components within training frameworks, inference engines, and compiler/runtime stacks.

You are a good fit if you:

Have 3+ years of experience in GPU kernel engineering, HPC systems, compiler optimisation, or high-performance ML infrastructure.
Have deep understanding of NVIDIA GPU hardware architecture, including SM design, memory hierarchy, tensor cores, warp scheduling, and execution pipelines.
Have strong expertise in CUDA programming and GPU performance optimisation.
Understand modern kernel optimisation techniques including kernel fusion, warp specialisation, persistent kernels, pipelining, and asynchronous execution.
Enjoy working close to hardware and solving performance bottlenecks at the systems and compiler level.
Have strong debugging, profiling, and benchmarking skills for performance-critical GPU workloads.
Possess strong algorithmic understanding of deep learning workloads and transformer execution patterns.
Love squeezing every microsecond of performance from kernels and understanding why a kernel performs the way it does.

You are a strong candidate if you have experience with:

CUDA kernel development and optimisation for Hopper, Blackwell, or comparable GPU architectures.
GPU programming frameworks and DSLs including CUDA, Triton, and TileLang.
Low-level GPU profiling and debugging tools such as Nsight Systems, Nsight Compute, cuda-gdb, CUPTI, and nvprof.
PTX and low-level GPU ISA design, with the ability to read, analyse, and hand-optimise PTX code.
Memory optimisation techniques including shared memory tuning, register pressure optimisation, cache-aware design, coalesced memory access, and bank-conflict reduction.
Tensor Core programming, WMMA operations, mixed-precision compute, and BF16/FP8 optimisation.
Kernel scheduling and execution optimisation techniques including occupancy tuning, instruction-level parallelism, stream overlap, and launch optimisation.
Training and inference compiler stacks such as CUTLASS, cuBLAS, cuDNN, TensorRT, TorchInductor, or XLA.
Efficient transformer and inference algorithms including FlashAttention, paged attention, speculative decoding, KV-cache compression, and sequence batching.
Quantisation methods including INT8, INT4, FP8, GPTQ, AWQ, and custom quantisation pipelines.
Distributed training and inference systems where kernel performance directly impacts cluster-scale efficiency.
Benchmarking and performance modelling using roofline analysis, latency/throughput profiling, and bottleneck decomposition.

Why work with Soket?

At Soket, you will get the chance to work on problems that only a handful of teams in the world are solving today - building frontier foundation models at scale. You will see first-hand how intelligence is baked into large models and work across the entire stack that powers modern AI systems. You will work with supercomputing-scale GPU clusters and tackle challenging problems in petabyte scale data aggregation and processing, distributed training, model architectures, infrastructure, inference optimization, and large-scale AI deployment.

One day you might be debugging CUDA kernels or NCCL issues, another day optimizing throughput for multi-GPU training runs, building new infrastructure tooling, or experimenting with ideas that make training faster and more efficient. We are a deeply research-driven and engineering-focused team that loves nerding out about systems, scaling laws, training stacks, and AI research. If you enjoy going deep into technical problems and learning from highly talented researchers and engineers, you will feel right at home here. Most importantly, we are building efficient, open, and accessible AI systems for India, the Global South, and ultimately for humanity as a whole.

If this sounds exciting to you, come build the future with us.

Apply Now!

Soket AI Labs is a research-first AI company headquartered in Bengaluru. We are an equal opportunity employer and strongly encourage applications from people of all genders, backgrounds, and ethnicities. We offer competitive compensation, equity participation opportunities, flexible work arrangements across office and remote settings, comprehensive leave policies including parental and wellness leaves, and regular team offsites designed to foster collaboration and innovation.

As an AI-native organization, we use AI systems as part of our candidate assessment and interview processes. Please make sure your resume aligns with the job description. More details about how candidate data is processed and used will be available on the application page.

Apply for this role Browse all openings →