AI Research & Engineering

AI Data Curator - Foundation Models

Bengaluru, India

About Soket AI

Soket is an AI research firm headquartered in Bengaluru with a mission to build efficient and generalized intelligence for humanity. We are focused on advancing frontier AI research through the development of large-scale foundation models in math, code and reasoning that are open, energy-efficient, multilingual, and responsible by design. Funded and supported by the IndiaAI Mission, Government of India. Our work places a strong emphasis on India and the Global South, where access to high-quality AI systems remains limited despite immense linguistic and cultural diversity.

At Soket, we believe the future of AI should be accessible, scalable, and aligned with real-world societal needs. Our teams work across large language models, multimodal systems, speech technologies, reasoning systems, and large-scale AI infrastructure, with a strong focus on open research and practical deployment. We are deeply passionate about pushing the boundaries of AI research while building systems that are useful, trustworthy, and globally impactful.

Compensation

Rs 18,00,000 – Rs 25,00,000

Workloads you would be involved in

Source, curate, and organize high-quality datasets for frontier model development across web, code, document, speech/audio, and synthetic data sources.
Build and maintain datasets aligned with target model capabilities including reasoning, coding, mathematics, multilinguality, and instruction following.
Work with Data Engineering teams to support scalable data acquisition pipelines, web crawling, API-based ingestion, and dataset extraction workflows.
Evaluate dataset quality at scale using manual review and automated validation pipelines.
Identify and filter duplicated, noisy, unsafe, contaminated, or low-signal data across structured and unstructured corpora.
Design and execute annotation, labeling, and quality assurance workflows for supervised fine-tuning and post-training datasets.
Curate and validate synthetic datasets including reasoning traces, self-instruction data, tool-use trajectories, and AI-generated preference datasets.
Maintain dataset metadata, provenance records, versioning, and reproducible curation workflows across iterative releases.
Monitor benchmark contamination, training-test leakage, and evaluation overlap to ensure reliable model benchmarking.
Collaborate closely with Data Scientists, Research Scientists, Model Engineers, and Data Engineers to align datasets with evolving model capability goals.
Continuously improve dataset quality, diversity, and coverage through iterative review, feedback loops, and model-informed curation strategies.

You are a good fit if you:

Hold a Bachelor's or Master's degree in Computer Science, Data Science, Artificial Intelligence, Computational Linguistics, Information Science, or related fields.
Have demonstrated experience in dataset curation, annotation, corpus development, or AI training data workflows.
Have strong analytical rigor and attention to detail with the ability to make nuanced data quality judgments.
Enjoy discovering, organizing, and improving large-scale datasets.
Understand AI/LLM data ecosystems and appreciate the importance of diversity, quality, reliability, and traceability in training data.
Can work closely with engineering and research teams to translate model needs into practical curation strategies.
Are comfortable working with large-scale datasets and iterative data improvement workflows.

You are a strong candidate if you have experience with:

Programming & Data Tooling

Python for data analysis, validation, scripting, and workflow interaction.
Dataset tooling and processing frameworks such as Hugging Face Datasets, Pandas, Apache Spark, Apache Arrow, JSONL pipelines, Apache Parquet, and familiarity with Dask.

Web & Multi-Source Data Acquisition

Web crawling, API-driven data acquisition, HTML parsing, and large-scale dataset ingestion workflows.
Tools such as Scrapy, BeautifulSoup, and Selenium.
Common Crawl-style datasets and large-scale document extraction workflows.

Speech & Audio Dataset Curation

Speech and audio datasets for ASR, TTS, speech translation, conversational AI, or multimodal systems.
Familiarity with datasets such as Mozilla Common Voice and OpenSLR.

Data Quality, Filtering & Validation

Exact and semantic deduplication, contamination detection, heuristic filtering, metadata enrichment, and corpus diversity analysis.
Familiarity with MinHash/LSH pipelines, FAISS, and Elasticsearch.

Annotation & Quality Assurance

Human-in-the-loop annotation workflows, reasoning trace review, code correctness validation, mathematical solution assessment, and preference/ranking datasets.
Annotation guidelines, QA systems, and annotator consistency evaluation.

Workflow & Collaboration Tools

Git and GitHub.
Experience supporting NLP, AI data operations, benchmark creation, or model training workflows is considered a strong advantage.

Why work with Soket?

At Soket, you will get the chance to work on problems that only a handful of teams in the world are solving today - building frontier foundation models at scale. You will see first-hand how intelligence is baked into large models and work across the entire stack that powers modern AI systems. You will work with supercomputing-scale GPU clusters and tackle challenging problems in petabyte scale data aggregation and processing, distributed training, model architectures, infrastructure, inference optimization, and large-scale AI deployment.

One day you might be debugging CUDA kernels or NCCL issues, another day optimizing throughput for multi-GPU training runs, building new infrastructure tooling, or experimenting with ideas that make training faster and more efficient. We are a deeply research-driven and engineering-focused team that loves nerding out about systems, scaling laws, training stacks, and AI research. If you enjoy going deep into technical problems and learning from highly talented researchers and engineers, you will feel right at home here. Most importantly, we are building efficient, open, and accessible AI systems for India, the Global South, and ultimately for humanity as a whole.

If this sounds exciting to you, come build the future with us.

Apply Now!

Soket AI Labs is a research-first AI company headquartered in Bengaluru. We are an equal opportunity employer and strongly encourage applications from people of all genders, backgrounds, and ethnicities. We offer competitive compensation, equity participation opportunities, flexible work arrangements across office and remote settings, comprehensive leave policies including parental and wellness leaves, and regular team offsites designed to foster collaboration and innovation.

As an AI-native organization, we use AI systems as part of our candidate assessment and interview processes. Please make sure your resume aligns with the job description. More details about how candidate data is processed and used will be available on the application page.

Apply for this role Browse all openings →