IT Operations Technical Lead
Explicitly mentions Vibe Coding (AI-assisted development) to accelerate automation and reduce toil.
About the Role
Axle is hiring an IT Operations Technical Lead to own and optimize hybrid cloud and on-premise infrastructure, ensuring reliability, scalability, and operational readiness for mission-critical systems. The role combines hands-on Linux systems administration, incident and patch management, automation (including AI-assisted 'Vibe Coding'), and leadership of a team supporting AI/ML and GPU-enabled workloads.
Job Description
Role
Axle is seeking an IT Operations Technical Lead to oversee hybrid cloud and on-premise infrastructure with a focus on reliability, scalability, automation, and operational excellence. The lead will provide hands-on technical direction, manage incident response and remediation, and mentor systems engineers supporting AI/ML workloads and GPU-enabled environments.
Key Responsibilities
- Lead IT operations aligned with ITIL processes (Incident, Problem, Change, Release Management).
- Provide hands-on management of Linux and Windows environments across cloud and on-premises infrastructure.
- Drive incident response, root cause analysis, and service restoration for mission-critical systems.
- Design, build, and maintain golden images, patching strategies, and system hardening standards.
- Lead patch management and vulnerability remediation programs.
- Develop and implement automation solutions, including AI-assisted development (Vibe Coding), to reduce toil.
- Support and optimize infrastructure for AI/ML workloads, including provisioning, scaling, and performance tuning.
- Manage GPU-enabled compute environments for high-performance computing and machine learning.
- Oversee monitoring, logging, alerting, and observability frameworks.
- Manage and mentor a team of systems engineers and collaborate with architecture, security, and development teams.
- Maintain documentation, runbooks, and SOPs; ensure operational readiness.
Requirements
- 5+ years leading operations teams with hands-on experience driving operational process improvements.
- 10+ years of hands-on Unix/Linux experience, including CentOS / Red Hat administration in large-scale distributed environments.
- Proven experience implementing and operating within ITIL frameworks.
- Hands-on incident management, patching, system hardening, and production support experience.
- Experience building and maintaining golden images and standardized environments.
- Strong scripting/automation skills (Python, Bash, PowerShell or similar).
- Experience with configuration management and automation tools (Ansible, Terraform, Puppet, Chef or similar).
- Strong networking fundamentals (DNS, TCP/IP, firewalls, load balancing).
- Experience with monitoring and logging tools (Nagios, Splunk, ELK, Prometheus, Grafana).
- Cloud build-out or migration experience with at least one provider: Amazon AWS, Google GCP, Microsoft Azure.
- 2+ years with CI/CD and automation tools such as Terraform, Ansible, Chef, Puppet, Jenkins, GitHub.
- Experience supporting AI/ML workloads and GPU-based compute environments (e.g., NVIDIA GPU instances).
- Knowledge of security and compliance frameworks (NIST 800-53, FedRAMP, FISMA) is preferred.
Preferred Certifications
- ITIL, Linux, AWS, Azure, Kubernetes certifications (CKA/CKAD).
- Networking certifications (CCNA/CCNP) considered a plus.
Benefits
- 100% medical, dental & vision coverage for employees
- Paid time off and paid holidays
- 401(k) match up to 5%
- Educational benefits for career growth
- Employee referral bonus
- Flexible spending accounts (Healthcare FSA, Parking PRK, Dependent Care DCAP, Transportation TRN)