Senior Site Reliability Engineer

EPAM Systems (Poland) sp. z o.o.

Kraków, Grzegórzki +1 mehr

Hybrid

🐍 Python

☕ JavaScript

☕ Java

ServiceNow

Splunk

Git

GitLab

☁️ Microsoft Azure

☁️ Azure Kubernetes Service

Hybrid

🚢 Kubernetes

🐳 Docker

Requirements

Python

JavaScript

Java

ServiceNow

Splunk

Git

GitLab

Microsoft Azure

Azure Kubernetes Service

Linux

Minimum of 3 years programming experience in Python, JavaScript, or Java
At least 3 years of experience in DevOps including building and troubleshooting pipelines
Proficiency in automation using Python or other scripting languages
Knowledge of Unix administration
Familiarity with ITIL processes
Experience using ServiceNow for operational support
Experience with Azure Log Analytics and query languages such as KQL or Splunk
Hands-on experience implementing monitoring solutions
Experience with Git version control systems, preferably GitLab
Previous experience in L2/L3 application or infrastructure support
Working knowledge of containers and Azure Kubernetes Service
Strong experience with Microsoft Azure platform
Demonstrated ability to deploy enterprise applications using infrastructure as code

Drive reliability improvements in production systems to reduce escalations and enable faster feature development
Handle support escalations with a thorough understanding of environment, code, and logs
Manage incident response, change management, and business continuity activities
Analyze and document system issues from business and technical perspectives
Identify and implement solutions and system improvements, including automation of manual tasks
Collaborate with product managers, developers, quality analysts, and support teams to support project delivery and onboarding
Provide regular updates to management on system status and issues
Develop technical fixes and scripts to support operational needs
Investigate problems to determine root cause and provide workarounds
Create and maintain known error documentation
Own the lifecycle of problem resolution
Perform daily system monitoring and troubleshoot production issues
Support and configure global production environments
Manage release processes for UAT and production environments
Document support procedures, releases, and troubleshooting guides
Provide coverage during weekdays and weekends as needed

Bericht