Major Incident Manager
Crusoe
The Incident Manager role is critical to maintaining service reliability and preserving customer trust. This position directly impacts company success by minimizing downtime, managing high-severity incidents, and ensuring rapid resolution of complex technical challenges. You will lead the response to high-visibility incidents and customer escalations, acting as a central point of coordination to drive timely, effective outcomes.
In this role, you’ll spearhead the management of critical incidents from identification through resolution, while continuously improving incident response processes and support readiness. You’ll work cross-functionally with engineering, product, and customer teams to design scalable self-service support workflows, contribute to product improvements, and develop robust incident response strategies. You’ll also play a key role in mentoring team members, delivering training, and building knowledge resources that strengthen both internal teams and customer success.
We’re looking for a technically skilled professional with strong Linux expertise, excellent communication skills, and 4–5 years of customer-facing experience. Prior experience in incident management and on-call rotations is essential.
What You’ll Be Working On
Troubleshoot & Resolve
- Diagnose and resolve complex technical issues related to InfiniBand, containerization, and distributed training environments
- Lead high-severity incident response efforts to ensure rapid mitigation and minimal disruption to customer operations
- Manage customer escalations with professionalism, clarity, and urgency, ensuring stakeholder confidence throughout the incident lifecycle
Implement & Optimize
- Guide customers through the implementation, configuration, and optimization of HPC infrastructure
- Partner with customers to improve performance, scalability, and efficiency across their environments
Educate & Empower
- Develop and deliver internal and external training materials, including live training sessions, documentation, and knowledge base articles
- Provide ongoing enablement to help customers effectively adopt and maximize the value of company solutions
- Lead incident response training and preparedness initiatives for internal teams
Collaborate Internally
- Work closely with engineering and product teams to share customer feedback and operational insights
- Influence product enhancements and reliability improvements based on real-world incident data
- Contribute to the continuous improvement of incident management processes and the overall customer experience
What You’ll Bring to the Team
Technical Proficiency
- Strong hands-on experience with Linux, virtualization, Kubernetes, and managing customer incidents
- Solid understanding of the TCP/IP stack
- Working knowledge of Infrastructure-as-Code (IaC) practices
Essential Skills
- Excellent written and verbal communication skills, with the ability to clearly explain complex technical issues
- Proven problem-solving mindset with strong diagnostic and analytical abilities
- 3–5+ years of experience in a team leadership role, serving as a liaison between internal teams and external customers
- 4–5 years of customer-facing experience in a technical environment
- Direct experience participating in or leading incident management efforts and on-call rotations
Bonus Skills
- Programming experience in one or more programming languages
Benefits & Perks
- Industry-competitive compensation
- Restricted Stock Units (RSUs) in a fast-growing, well-funded technology company
- Comprehensive health insurance options, including HDHP and PPO plans, plus vision and dental coverage for you and your dependents
- Employer contributions to HSA accounts
- Paid parental leave
- Company-paid life insurance, short-term disability, and long-term disability coverage
- Teladoc access
- 401(k) plan with a 100% company match up to 4% of salary
- Generous paid time off and holiday schedule
- Cell phone reimbursement
- Tuition reimbursement
- Subscription to the Calm app
- MetLife Legal benefits
- Company-paid Commuter FSA benefit of $200 per month