Engineer, Sr. Site Reliability

Royal Caribbean Group


Date: 1 day ago
City: Pasay
Contract type: Full time
The Senior Site Reliability Engineer (Senior SRE) will report to the SRE Manager in support of the Royal Caribbean website by utilizing application and user performance data to guide informed decision-making. The Senior SRE will use application and user performance metrics collected from various sources and tools to support tasks such as initial triage of critical production incidents, bug analysis, implementation of best practices in site reliability engineering, infrastructure optimization, and seamless collaboration between internal teams and external service providers, among other operational initiatives.

The ideal candidate will have a deep understanding and proven track record in a senior IT support role and could provide leadership toward the development of new employees. The ideal candidate will also have an eye toward the rapidly evolving technology landscape and provide leadership over advanced and emerging concepts, to research and implement proactive and preventative measures that avoid technical incidents.

S/he must be able to work with multiple product and project teams simultaneously, thrive in a fast-paced and dynamic environment and connect unexpected threads across disparate teams.

Essential Duties And Responsibilities

At a high-level, responsibilities for this role will include:

  • Product Health : Responsible for the Incident Management, Application Performance, Configuration Management and Operational Readiness of the products within her/his ownership. Partners with and collaborate closely with stakeholders from the various teams within IT to ensure that performance tools, configuration tools and monitoring tools meet the needs of her/his products.
  • Incident Management. Is responsible for a team of resources prepared to react quickly to production incidents with the goal to restore systems/applications back to normal service operation as quickly as possible and minimize the impact on guest/crew experience or business operations, thus ensuring the best possible service levels and availability are maintained. Review ticket analysis and approve closure of tickets/incidents. Understands architecture of Royal website and escalates incidents as needed to the appropriate team for further triage. Synthesizes and communicates incident details to the production team, stakeholders, including executive level stakeholders. Document incident, perform postmortem and create next steps (as needed) . Review postmortem / RCA document and follow up
  • Application Performance Management (APM) . Ensures the proactive monitoring and management of performance and availability of the software applications within the products s/he is responsible for. Strives to detect and diagnose complex application performance problems to maintain an expected level of service. Provides insight into application performance metrics (errors, exceptions, baseline violations, etc.) to identify technical impacts of bugs and enhancements. Understands key performance metrics (traffic volumes, booking volumes, response times, etc.) to identify business value of bug fixes and enhancements. Builds case for prioritizing bug and enhancement tickets using the above. Create reports on new deployment build performance for product teams to ensure build quality
  • Configuration Management . Leads the team(s) in implementing and maintaining the technology standards and practices across product definition and product configuration. Adjust health thresholds and other monitoring settings based on historical performance. Creates and maintains performance dashboards used by support and product teams. Maintains alerting, communication, and documentation tool chain to ensure it is up to date and efficient.
  • Change Control Governance . Ensuring all production changes required by the product teams are carried out in a planned and authorized manner, within established change control policies and procedures and that all changes are thoroughly tested and validated from the monitoring perspective.
  • Production Operations Readiness. Ensure all product implementations go through an operational readiness review. Establish and maintain clear communication channels (e.g., Slack, Teams) with the scrum and marketing teams. Ensure all team members are informed about relevant updates and changes that may affect the website.

Qualifications

  • 6-10 years in Site Reliability Engineering (SRE), DevOps, QA, or a related IT operations role.
  • Bachelor’s degree in Computer Science, Information Technology, Computer Engineering, or other relevant advanced degree preferred.

Knowledge And Skills

  • Technical Expertise :
  • Proficiency in cloud platforms such as AWS, AWS Elastic Beanstalk.
  • Understanding of API design principles: REST, SOAP, Graph
  • Advanced knowledge of monitoring and logging tools (AppDynamics, DataDog, Splunk, New Relic, etc.).
  • Extensive experience with Adobe AEM Cloud is preferred to enhance system performance and reliability
  • Problem-Solving Skills :
  • Strong analytical and troubleshooting skills to diagnose and resolve complex production issues swiftly.
  • Ability to develop and implement effective incident response plans.
  • Communication and Collaboration :
  • Excellent written and verbal communication skills for effective interaction with cross-functional teams and documentation.
  • Ability to collaborate with Development, QA, IT, and external managed service providers to ensure seamless operations.

Work Environment

  • The Sr SRE may be required to participate in an on-call rotation to handle urgent incidents and ensure 24x7 system reliability.
  • On-call duties may include evenings, weekends, and holidays as needed.

How to apply

To apply for this job you need to authorize on our website. If you don't have an account yet, please register.

Post a resume

Similar jobs

Accounts Payable Supervisor

SMDC, Pasay
3 days ago
Duties and Responsibilities1. Initial review of RFP which includes receiving, monitoring, recording and verifying of payables in SAP accurately, day-to day management of all project payment cycle activities in a timely and efficient manner.2. Review contractor’s/supplier statements and reconcile contractors/suppliers accounts3. Request for vendor creation/update in the online vendor workflow4. Corresponds with requestors and respond to inquiries5. Prepares adjusting entries...

Specialist - Master Data Management

DHL Global Forwarding, Pasay
6 days ago
Join our “Master Data Management Team” at DHL Global Forwarding, Freight (DGFF) GSC – Global Service Centre! Job Title: Associate – Master Data Management (MDM)Job Location: ManilaAre you dynamic and results-oriented with a passion for logistics? Join our high-performing Global Shared Services Team (GSC) at DHL Global Forwarding, Freight (DGFF); a Great Place to Work certified organization and one of...

Reporting Manager

Mace, Pasay
1 week ago
At Mace, our purpose is to redefine the boundaries of ambition. We believe in creating places that are responsible, bringing transformative impact to our people, communities and societies across the globe. To learn more about our purpose, culture and priorities, visit our strategy site.Within our consult business, we harness our unique combination of leading-edge practical expertise and project delivery consultancy...