[Remote] Principal GPU Infrastructure Engineer – AI/HPC Systems

Remote Full-time Now Hiring

Note: The job is a remote job and is open to candidates in USA. Axiom Recruit is partnering with a rapidly scaling technology business that is building advanced compute infrastructure for next-generation AI systems. They are seeking a Principal GPU Infrastructure Engineer to design and operate large-scale GPU environments supporting demanding enterprise-grade workloads across high-performance compute platforms.

Responsibilities

Own the lifecycle management of large-scale GPU infrastructure, from provisioning and firmware validation through to operational reliability
Lead operations across high-density, liquid-cooled compute environments supporting next-generation AI workloads
Build automated observability and remediation systems using Prometheus, Grafana, NVIDIA DCGM, and infrastructure automation tooling
Drive NetBox DCIM integration, asset management, IPAM, and infrastructure compliance across complex compute environments
Act as a senior technical lead for infrastructure operations, incident response, vendor management, and enterprise-level infrastructure support

Skills

Strong experience managing large-scale GPU, HPC, or high-performance compute infrastructure
Deep hands-on expertise with NVIDIA GPU systems, including H200, B200, or B300 environments
Advanced knowledge of InfiniBand, NVLink, NVSwitch, and high-throughput networking architectures
Strong Linux systems engineering background with infrastructure automation using Python or Go
Experience with observability and monitoring tooling including Prometheus, Grafana, NVIDIA DCGM, and SNMP
Proven experience across bare-metal provisioning, infrastructure lifecycle management, and automated/self-healing systems
Experience with liquid-cooled or high-density compute environments
Familiarity with NVIDIA Mission Control and GPU cluster management
Exposure to confidential compute technologies and attestation workflows
Experience building infrastructure standards in fast-scaling environments

Benefits

Competitive salary and benefits package
Opportunity to build next-generation AI infrastructure
Exposure to cutting-edge GPU and HPC environments
Strong ownership across infrastructure and automation
Engineering-led culture working on mission-critical systems

Company Overview

Web3/Blockchain/AI Recruitment It was founded in 2019, and is headquartered in Dubai, Dubai, ARE, with a workforce of 11-50 employees. Its website is https://www.axiomrecruit.com/.

Apply for This Position

[Remote] Principal GPU Infrastructure Engineer – AI/HPC Systems

You Might Also Like

[Remote] Senior Business Development Manager

[Remote] Remote Sales Consultant

[Remote] Territory Sales Manager

[Remote] Business Development Specialist - West

[Remote] Personal Financial Consultant | Remote

Shopify Developer Needed to Complete Fully Designed Website (Must Match Pixel-Perfect)

Guest Support Specialist

[Remote] Data Engineer (Level 3 or Level 4)

Desarrollador(a) Frontend Web & Mobile – Angular (Remoto – Colombia)

Reading Interventionist

[Remote] Property Coverage Attorney

Experienced Customer Service Representative – Remote Customer Support

[Remote] AWS MS Engineer (Typescript OR Node.js)

Platform Development Team Lead

[Remote] Cloud Engineer - Senior (Observability - Datadog)

Medical Scribe (Spanish Fluency Required)

Healthcare Recruiter - Physician Recruiter

Instructor, Accounting

Technical Writer / Content Specialist (Remote)

Experienced Virtual Data Entry Assistant – Entry-Level Position at careerzynith

Tech Lead, Web Core Product & Chrome Extension - Osaka, Japan