Role Overview:
Our client is expanding the engineering team responsible for ensuring the stability and predictable behaviour of their distributed services and platforms. This role involves hands-on production work, including monitoring, incident response, troubleshooting, and continuous improvements that increase platform reliability over time.
You will work as part of an SRE shift rotation covering late-evening and night hours, ensuring end-to-end ownership of incidents — from identifying user impact to post-incident follow-ups and preventive improvements.
Key Responsibilities:
- Working in shift-based operations: monitoring, alert response, incident handling, escalation when needed;
- Participating in incident handling: initial classification, technical investigation, coordination with engineering teams, and following-up improvements;
- Developing and refining observability across platforms (metrics/alerts, dashboards, logs);
- Reducing operational toil: small automation, runbooks, and repeatable processes (the “make it easier next time” mindset);
- Collaborating with development teams to improve production readiness (basic reliability practices, cleaner incident follow-ups).
Ideal profile for the position:
Core skills:
- Good Linux skills in production environments (debugging basics, system services, logs, performance basics);
- Solid understanding of networking fundamentals (TCP/IP, DNS, HTTP, load balancing basics, TLS fundamentals);
- Experience with containers and image lifecycle basics (Docker or compatible runtimes);
- Ability to troubleshoot across application, network, and infrastructure layers using logs/metrics and simple tools (curl, basic traffic/log analysis; scripting is a plus);
- Basic familiarity with observability: metrics and alerting, dashboards, logging (any modern stack is fine).
Experience:
- 1+ year in a production-focused role (Ops / Support L2+ / DevOps / Junior SRE — what matters is real production exposure);
- Participation in production incidents (triage, investigation, escalation, basic follow-ups);
- Availability to cover late-evening and night shifts, in rotation.
SRE fundamentals (basic understanding):
- You understand the difference between “just running infra” and SRE as a discipline: reliability targets, fast detection, clear escalation, and consistent follow-up;
- You’re familiar with SLI/SLO and can explain them in simple words (high-level understanding is enough).
What will be an advantage:
- Familiarity with Kubernetes (deep production ownership is not required yet);
- Exposure to AWS services such as EC2, ALB/NLB, RDS, S3, and IAM basics;
- Exposure to Terraform and/or Ansible (small changes, basic understanding of principles);
- Experience working in high-availability environments where downtime actually matters.
The company guarantees you the following benefits:
- Global Collaboration: Join an international team where everyone treats each other with respect and moves towards the same goal;
- Autonomy and Responsibility: Enjoy the freedom and responsibility to make decisions without the need for constant supervision;
- Competitive Compensation: Receive competitive salaries reflective of your expertise and knowledge as our partner seeks top performers;
- Remote Work Opportunities: Embrace the flexibility of fully remote work, with the option to visit company offices that align with your current location;
- Paid Time Off: Prioritise work-life balance with paid vacation and sick leave days to prevent burnout;
- Career Development: Access continuous learning and career development opportunities to enhance your professional growth;
- Corporate Culture: Experience a vibrant corporate atmosphere with exciting parties and team-building events throughout the year;
- Referral Bonuses: Refer talented friends and receive a bonus after they successfully complete their probation period;
- Medical Insurance Support: Choose the right private medical insurance and receive compensation (full or partial) based on the cost;
- Flexible Benefits: Customise your compensation by selecting activities or expenses you'd like the company to cover, such as a gym subscription, language courses, Netflix subscription, spa days, and more;
- Education Foundation: Participate in a biannual raffle for a chance to learn something new unrelated to your job as part of your commitment to ongoing education.
Interview process:
- A 30-minute interview with a Recruiter to get to know you and your experience;
- 1st stage of technical interview (1 h) with the DevOps team to assess your theoretical skills;
- 2nd stage of technical interview (1 h) with the DevOps team to assess your hard skills;
- A final interview to gauge your fit with the company culture and working style.