← All Jobs
Posted Apr 16, 2026

REMOTE AI Support Operations Engineer

Apply Now
Title: AI Support Operations Engineer Location: Fully REMOTE! Salary: $150-200k/year + BONUS + RSUs We're not following someone else's cloud blueprint - we're creating the next one. While legacy providers hand you a finished process, we're engineering the next generation of AI-optimized data center infrastructure from the ground up. As our first internal Staff AI Support Operations Engineer, you'll be a foundational technical leader on a brand-new Ops team. This is a role for an architect-practitioner: the kind of engineer who can untangle a complex InfiniBand issue one hour and automate away the root cause the next. You won't just maintain systems - you'll build the operational standards and technical foundations that every future engineer will rely on. Key Responsibilities • Cluster Engineering & Operations: Collaborate with engineering teams to architect, deploy, and bring new AI compute clusters online while delivering expert-level support for existing high-density GPU environments • Infrastructure Source of Truth: Own NetBox and related internal systems, ensuring all infrastructure data is accurate, consistent, and reliably maintained • Automation & Tooling: Build and refine internal automation using Python, Ansible, and Terraform to eliminate manual workflows and modernize fragile legacy processes • Tier 3 Escalation Lead: Serve as the highest technical escalation point for customer and internal issues prior to involvement from Platform or Network/Undercloud teams • Documentation Excellence: Transform tribal knowledge into clear, durable SOPs and technical documentation that establish the operational "gold standard" • Technical Leadership & Mentorship: Raise the technical bar for the team through code reviews, architectural guidance, and mentorship as the organization scales Qualifications • Enterprise-Grade Server Proficiency: Advanced operational knowledge of HPE, Dell, and SuperMicro platforms, including IPMI, BMC, iDRAC workflows, and familiarity with Redfish-based management. • Core Engineering Toolkit: Mastery of Python, Ansible, and Terraform as primary tools for automation, orchestration, and infrastructure lifecycle management. • Linux Performance Engineering: Strong capability in diagnosing and tuning Linux systems, resolving performance bottlenecks, and optimizing workloads at the OS level. • Advanced Incident Resolution: Demonstrated experience serving as the final technical escalation point for complex, high-impact infrastructure failures. • Cloud-Native Operations: Proven production experience operating and troubleshooting Kubernetes environments. Nice to have • Next-Generation GPU Hardware: Familiarity with NVIDIA Blackwell (B200/B300) or Hopper (H100/H200) architectures. • High-Performance Fabrics: Experience with InfiniBand or RoCE networking, and modern high-throughput storage platforms such as Weka or VAST Data. • Bare-Metal Provisioning: Exposure to OpenStack or Canonical MAAS for automated provisioning of physical infrastructure. Legacy is predictable. Safe. Slow. We're none of those things. We're building the Neo-Cloud at AI speed, and the rules aren't handed to you - you define them. If you're ready to trade routine for impact and build systems that actually move the company forward, let's talk.
Interested in this role?Apply on iHire