Overview
Salary: $60-63.83 Hourly up to 63.81/hr W2
Position Summary The OMS RUN Platform Reliability Lead is a highly technical leadership role responsible for ensuring the reliability, stability, scalability, and continuous improvement of the Fluent Commerce Order Management (OMS) platform. This role sits at the intersection of Software Engineering, Site Reliability Engineering (SRE), and Technical Operations, leading the RUN support organization while driving automation-first operational excellence. Unlike a traditional application support role, this position requires deep engineering expertise to troubleshoot complex production issues, develop self-healing automation, optimize platform performance, and partner closely with engineering teams to improve the resilience of the Order Management ecosystem. The ideal candidate is experienced working within high-volume, event-driven SaaS platforms and possesses strong technical knowledge of Fluent Commerce, GraphQL APIs, Java extensions, SQL, Python automation, and cloud-based observability tools. Key Responsibilities Platform Reliability & Self-Healing Automation
- Design and implement automated remediation solutions that reduce manual operational effort and improve platform resiliency.
- Build automated Order Replay capabilities to recover synchronization failures across event-driven integrations.
- Develop utilities and automation using Python, Fluent Commerce APIs, and SDKs for bulk updates, data remediation, and operational clean-up activities.
- Create predictive monitoring and proactive alerting that identifies issues before they impact customers.
- Continuously identify opportunities to eliminate operational toil through automation.
Observability & Platform Monitoring
- Design and maintain advanced monitoring dashboards using Datadog, Splunk, New Relic, or similar observability platforms.
- Monitor GraphQL performance, API latency, webhook processing, order throughput, and platform health.
- Configure intelligent alerting for:
- Stuck Orders
- Inventory synchronization failures
- API degradation
- Event processing delays
- Integration failures
- Analyze production trends to proactively improve platform stability and performance.
Technical Incident Management
- Serve as the highest level technical escalation point for complex production incidents.
- Perform deep code-level troubleshooting involving:
- Java custom extensions
- Fluent Commerce workflows
- GraphQL mutations
- REST integrations
- Event processing
- Lead technical Root Cause Analysis (RCA) and develop permanent corrective actions.
- Document technical findings, workarounds, automation opportunities, and platform improvements.
- Drive continuous improvement of operational processes through lessons learned from production incidents.
Performance Engineering
- Analyze platform performance, API response times, database interactions, and integration bottlenecks.
- Recommend architectural improvements that improve scalability and system performance.
- Partner with engineering teams to optimize application performance and reduce operational risk.
- Identify opportunities to improve platform efficiency across high-volume transactional environments.
Engineering Collaboration
- Act as the primary technical liaison between:
- Software Engineering
- Enterprise Architecture
- E-Commerce Product Teams
- Infrastructure & Platform Operations
- Fluent Commerce Engineering
- Ensure operational considerations are incorporated into product design and development roadmaps.
- Collaborate with Fluent Commerce product teams on:
- Platform upgrades
- API versioning
- New platform capabilities
- Production issue resolution
Team Leadership
- Lead and mentor the OMS RUN support engineering team.
- Develop technical capabilities across the organization through coaching and knowledge sharing.
- Establish operational best practices for:
- GraphQL optimization
- Java troubleshooting
- API diagnostics
- Incident response
- Automation development
- Foster a culture focused on engineering excellence, reliability, and continuous improvement.
Change Management & Release Integrity
- Review technical configurations and platform extensions before production deployments.
- Validate production readiness and deployment integrity.
- Support CI/CD processes and operational release governance.
- Manage operational configuration changes using Git version control.
- Ensure proper branching strategies for hotfixes, emergency changes, and production support.
Required Qualifications
Education
- Bachelor's degree in Computer Science, Software Engineering, Information Technology, or a related technical discipline.
Experience
- 5+ years of experience supporting enterprise Order Management Systems, Platform Engineering, Site Reliability Engineering, or Technical Operations.
- Experience supporting high-volume, mission-critical SaaS applications.
- Experience leading technical production support or reliability engineering teams.
- Demonstrated success implementing operational automation and reducing manual support activities.
Technical Qualifications Preferred OMS Experience
- Advanced experience with Fluent Commerce including:
- GraphQL API
- Webhooks
- Essential Rules
- Event Processing
- Order Lifecycle Management
- Inventory Management
Programming & Development
- Strong SQL skills for operational analysis and complex transactional querying.
- Proficiency in Python for automation, scripting, and API integrations.
- Ability to read, troubleshoot, and debug Java applications and custom extensions.
- Experience developing against REST APIs and GraphQL APIs.
- Strong understanding of JSON schemas and API payload structures.
Integration & Event-Driven Architecture Experience with modern distributed systems including:
- RESTful services
- Event-driven architectures
- Pub/Sub messaging
- Kafka
- Azure Event Grid
- Webhooks
- Asynchronous processing
Observability & Monitoring Experience with one or more of the following:
- Datadog
- Splunk
- ELK Stack
- New Relic
- Grafana
- Prometheus
DevOps & Source Control
- Strong Git experience including branching strategies and release management.
- Familiarity with CI/CD deployment pipelines.
- Experience supporting production releases within Agile environments.
Professional Competencies
- Strong analytical and problem-solving skills with the ability to diagnose complex production issues across multiple technology layers.
- SRE mindset focused on automation, scalability, reliability, and reducing operational toil.
- Deep understanding of ITIL principles applied within modern cloud-native environments.
- Excellent communication skills with the ability to translate technical concepts into business impact.
- Strong collaboration and leadership skills with experience working across engineering, infrastructure, and business teams.
- Ability to thrive in fast-paced, high-availability production environments supporting mission-critical commerce platforms.
Preferred Qualifications
- Fluent Commerce implementation or platform engineering experience.
- Experience supporting enterprise retail or eCommerce platforms.
- Experience with cloud-native SaaS architectures.
- Knowledge of microservices architecture and distributed systems.
- Experience building automation frameworks and self-healing operational tooling.
- Familiarity with Azure, AWS, or Google Cloud Platform.
- Experience working within Agile and DevOps environments.
Success Measures Success in this role will be measured by:
- Increased platform availability and reliability
- Reduced Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR)
- Increased operational automation and self-healing capabilities
- Reduced manual production support activities
- Improved incident prevention through proactive monitoring
- Enhanced system performance and scalability
- Strong cross-functional engineering partnerships
- Development of a high-performing, technically proficient RUN support team
**About Aquent Talent:** Aquent Talent connects the best talent in marketing, creative, and design with the world's biggest brands.
Our eligible talent get access to amazing benefits like subsidized health, vision, and dental plans, paid sick leave, and retirement plans with a match. We also offer free online training through Aquent Gymnasium. More information on our awesome benefits!
Aquent is an equal-opportunity employer. We evaluate qualified applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, and other legally protected characteristics. We're about creating an inclusive environment-one where different backgrounds, experiences, and perspectives are valued, and everyone can contribute, grow their careers, and thrive. #LI-SH2
|