Senior Lead Engineer
Job Summary:
Capital One is searching for a Senior Lead Engineer in Generative AI Infrastructure to play a pivotal role in advancing foundational AI capabilities. This position entails contributing to various initiatives, including building distributed training clusters, deploying Large-Language Models (LLMs) on GPU instances, and supporting AI research and development within public cloud infrastructure.
Job Duties and Responsibilities:
- Deploy large-scale distributed training clusters in the public cloud, focusing on optimizing storage and networking stack and employing multiple parallelism strategies.
- Design and implement fault-tolerant infrastructure to support long-running large-scale training tasks, utilizing containers and check-pointing libraries.
- Develop run-time infrastructure to serve large ML models such as LLMs and FMs in the public cloud.
- Establish infrastructure for deploying search indexes and embeddings in vector databases, closely aligning with other capabilities.
- Collaborate with cloud and container infrastructure teams, as well as AI researchers, to design and implement key capabilities.
Qualifications and Experience:
- Bachelor's degree in Computer Science, Computer Engineering, or a related technical field.
- At least 8 years of experience designing and constructing data-intensive solutions using distributed computing.
- Proficiency in Python, Go, Scala, or Java, with a minimum of 8 years of programming experience.
- Minimum of 1 year of experience with high-performance computing (HPC), vector embedding, or semantic search technologies.
- At least 1 year of experience building, scaling, and optimizing training or inferencing systems for deep neural networks.
Preferred Qualifications:
- Master's or Doctoral degree in Computer Science, Computer Engineering, Electrical Engineering, Mathematics, or a related field.
- Background in machine learning with experience in large-scale training and deployment of deep neural networks and/or transformer architectures.
- Experience with machine learning frameworks such as TensorFlow, PyTorch, Lightning, Mosaic ML, etc.
- Ability to thrive in a fast-paced environment with ambiguity and competing priorities and deadlines.
- Previous experience at tech and product-driven companies/startups is preferred.
- Ability to iterate rapidly with researchers and engineers to enhance product experience while building foundational capabilities.
- Familiarity with deploying large neural network models in demanding production environments.
- Experience with building GPU clusters in the public cloud with tightly-coupled storage and networking.
Salary:
- New York City (Hybrid On-Site): $234,700 - $267,900 for Sr. Lead Machine Learning Engineer
Benefits:
- Performance-based incentive compensation, including cash bonuses and long-term incentives.
- Comprehensive health, financial, and other benefits supporting total well-being.
- Opportunities for career growth and development within Capital One.
- Inclusive and supportive work environment fostering personal and professional growth.
About Company:
Join Capital One in its mission to create trustworthy, reliable, and human-in-the-loop AI systems, transforming the banking industry for good. Help us reimagine how we serve our customers and businesses with emerging AI capabilities.