[Remote] Principal GPU Infrastructure Engineer – AI/HPC Systems
Note: The job is a remote job and is open to candidates in USA. Axiom Recruit is partnering with a rapidly scaling technology business that is building advanced compute infrastructure for next-generation AI systems. They are seeking a Principal GPU Infrastructure Engineer to design and operate large-scale GPU environments supporting demanding enterprise-grade workloads across high-performance compute platforms.
Responsibilities
- Own the lifecycle management of large-scale GPU infrastructure, from provisioning and firmware validation through to operational reliability
- Lead operations across high-density, liquid-cooled compute environments supporting next-generation AI workloads
- Build automated observability and remediation systems using Prometheus, Grafana, NVIDIA DCGM, and infrastructure automation tooling
- Drive NetBox DCIM integration, asset management, IPAM, and infrastructure compliance across complex compute environments
- Act as a senior technical lead for infrastructure operations, incident response, vendor management, and enterprise-level infrastructure support
Skills
- Strong experience managing large-scale GPU, HPC, or high-performance compute infrastructure
- Deep hands-on expertise with NVIDIA GPU systems, including H200, B200, or B300 environments
- Advanced knowledge of InfiniBand, NVLink, NVSwitch, and high-throughput networking architectures
- Strong Linux systems engineering background with infrastructure automation using Python or Go
- Experience with observability and monitoring tooling including Prometheus, Grafana, NVIDIA DCGM, and SNMP
- Proven experience across bare-metal provisioning, infrastructure lifecycle management, and automated/self-healing systems
- Experience with liquid-cooled or high-density compute environments
- Familiarity with NVIDIA Mission Control and GPU cluster management
- Exposure to confidential compute technologies and attestation workflows
- Experience building infrastructure standards in fast-scaling environments
Benefits
- Competitive salary and benefits package
- Opportunity to build next-generation AI infrastructure
- Exposure to cutting-edge GPU and HPC environments
- Strong ownership across infrastructure and automation
- Engineering-led culture working on mission-critical systems
Company Overview