About Soket AI
Soket is an AI research firm headquartered in Bengaluru with a mission to build efficient and generalized intelligence for humanity. We are focused on advancing frontier AI research through the development of large-scale foundation models in math, code and reasoning that are open, energy-efficient, multilingual, and responsible by design. Funded and supported by the IndiaAI Mission, Government of India. Our work places a strong emphasis on India and the Global South, where access to high-quality AI systems remains limited despite immense linguistic and cultural diversity.
At Soket, we believe the future of AI should be accessible, scalable, and aligned with real-world societal needs. Our teams work across large language models, multimodal systems, speech technologies, reasoning systems, and large-scale AI infrastructure, with a strong focus on open research and practical deployment. We are deeply passionate about pushing the boundaries of AI research while building systems that are useful, trustworthy, and globally impactful.
Compensation
Rs 35,00,000 – Rs 50,00,000 (Includes Equity Benefits)
Compensation will be commensurate with industry standards and will be determined based on the candidate's current compensation, relevant experience, skills, and overall qualifications.
Workloads you would be involved in
- Build and maintain scalable, production-grade data pipelines for LLM training data.
- Process large-scale datasets across ingestion, parsing, cleaning, normalization, filtering, and dataset packing.
- Engineer high-throughput distributed data systems for web, code, document, synthetic, and structured datasets.
- Design and manage dataset versioning, lineage, reproducibility, and governance infrastructure.
- Optimize ETL and data processing systems for throughput, cost, reliability, and scalability.
- Build scalable data quality and validation pipelines including deduplication, contamination detection, and metadata enrichment.
- Integrate data systems with large-scale training and inference infrastructure.
- Support synthetic data generation, validation, and automated processing workflows.
- Build developer tooling, CI/CD, observability, and reliability systems for data infrastructure.
- Collaborate closely with research and model teams to deliver training-ready datasets and scalable data platforms.
You are a good fit if you:
- Have a Master's or Bachelor's degree in Computer Science or related fields (Master's preferred).
- Have strong software engineering and data systems fundamentals.
- Enjoy building reliable, modular, and scalable infrastructure.
- Love working with large-scale datasets and distributed systems.
- Have strong debugging and performance optimization skills.
- Care deeply about reproducibility, data quality, and engineering rigor.
- Enjoy solving hard systems and infrastructure problems powering frontier AI models.
You are a strong candidate if you have experience with:
Data & Processing Frameworks
- Apache Spark, Hugging Face Datasets, Apache Arrow, Pandas, Parquet, and JSONL pipelines.
- Distributed processing frameworks like Dask, Ray, Polars, Apache Beam, or Kafka.
- Dataset storage and versioning systems like Delta Lake, lakeFS, DVC, or WebDataset.
Data Quality & Validation
- Exact and semantic deduplication.
- MinHash / LSH-based duplicate detection.
- Contamination detection and heuristic filtering.
- Metadata enrichment and dataset validation systems.
Software Engineering & Infrastructure
- Python as a primary language with familiarity in C++, Rust, Go, Java, or TypeScript.
- Git, GitHub, CI/CD, testing, and production engineering workflows.
- Building scalable, production-grade software or distributed data systems.
- Large-scale data processing and LLM data pipelines, including synthetic data workflows and training data engineering.
Why work with Soket?
At Soket, you will get the chance to work on problems that only a handful of teams in the world are solving today - building frontier foundation models at scale. You will see first-hand how intelligence is baked into large models and work across the entire stack that powers modern AI systems. You will work with supercomputing-scale GPU clusters and tackle challenging problems in petabyte scale data aggregation and processing, distributed training, model architectures, infrastructure, inference optimization, and large-scale AI deployment.
One day you might be debugging CUDA kernels or NCCL issues, another day optimizing throughput for multi-GPU training runs, building new infrastructure tooling, or experimenting with ideas that make training faster and more efficient. We are a deeply research-driven and engineering-focused team that loves nerding out about systems, scaling laws, training stacks, and AI research. If you enjoy going deep into technical problems and learning from highly talented researchers and engineers, you will feel right at home here. Most importantly, we are building efficient, open, and accessible AI systems for India, the Global South, and ultimately for humanity as a whole.
If this sounds exciting to you, come build the future with us.
Apply Now!
Soket AI Labs is a research-first AI company headquartered in Bengaluru. We are an equal opportunity employer and strongly encourage applications from people of all genders, backgrounds, and ethnicities. We offer competitive compensation, equity participation opportunities, flexible work arrangements across office and remote settings, comprehensive leave policies including parental and wellness leaves, and regular team offsites designed to foster collaboration and innovation.
As an AI-native organization, we use AI systems as part of our candidate assessment and interview processes. Please make sure your resume aligns with the job description. More details about how candidate data is processed and used will be available on the application page.