Role Overview:
We are assisting our partner iGaming company in expanding their engineering team responsible for ensuring the stability and predictable behaviour of their distributed services and platforms. The role involves working with production infrastructure, analysing system behaviour, and implementing practices that improve reliability across multiple platforms.
This position is intended for engineers who clearly understand the difference between SRE and DevOps practices, and for whom SLOs, error budgets, and availability targets such as 99.85–99.95% are practical tools rather than abstract concepts.
The engineer will work as part of an SRE shift schedule covering late-evening and night hours (17:00–01:00 and 00:00–08:00 CET, in rotation) to ensure end-to-end ownership of incidents, from user impact to root cause and follow-up improvements.
Key Responsibilities:
- Contributing to architectural changes affecting the reliability and scalability of services and platforms;
- Operating and improving Kubernetes clusters (cluster model, networking, ingress, load balancing);
- Working with AWS-based environments (networking, storage, compute, managed services);
- Managing infrastructure using Terraform and configuration management with Ansible;
- Developing and refining monitoring and observability across platforms (Prometheus, Alertmanager, Grafana, and log aggregation such as ELK / Loki);
- Participating in incident handling: initial classification, technical investigation, coordination with product/engineering teams, and following-up improvements;
- Reducing operational toil and building tools that support reliability and efficiency (internal utilities, automation, CI/CD improvements);
- Collaborating with development teams to embed SRE practices into the lifecycle of services (SLIs/SLOs, error budgets, readiness for production).
Ideal profile for the position:
Core skills:
- Strong Linux skills in production environments (debugging, performance, system services);
- Solid understanding of networking (TCP/IP, DNS, HTTP, load balancing, TLS);
- Hands-on experience operating Kubernetes in production (not just local clusters);
- Experience with AWS cloud services (for example: EC2, ALB/NLB, RDS, S3, IAM, EKS or self-managed Kubernetes);
- Confident use of Terraform and Ansible in real environments (multi-environment IaC, reusable modules/roles);
- Experience with observability tools:
- metrics and alerting (Prometheus/Alertmanager or similar),
- dashboards (Grafana or similar),
- logging (ELK stack, Loki or comparable solutions).
- Ability to troubleshoot across application, network, and infrastructure layers, using scripting and tools (Python/Go/Bash, curl, tcpdump, log analysis, etc.);
- Experience with containers and image lifecycle (Docker or compatible runtimes).
Experience:
- Participation in production incidents and technical post-incident reviews (not just on-call escalation);
- 2–5 years of practical experience in SRE, infrastructure, platform or production-focused DevOps engineering;
- Experience working within CI/CD pipelines (for example: Jenkins, GitLab CI, GitHub Actions, ArgoCD or similar);
- Exposure to environments with high availability requirements (e.g. low tolerance to downtime, strict SLAs/SLOs).
- Availability to work between 5 PM and 8 AM CET, in the following shifts: 17:00–01:00 and 00:00–08:00.
What will be an advantage:
- Experience with high-load or real-time systems (payments, finance, gaming, streaming);
- Experience with CDNs or real-time log aggregation/analytics;
- Familiarity with databases and message systems (for example: PostgreSQL, MySQL, MongoDB, Kafka, Redis, RabbitMQ);
- Experience with involving external integrations and third-party APIs (payment providers, KYC, risk/anti-fraud, content providers);
- Experience with service meshes, API gateways or ingress controllers (Istio, Linkerd, NGINX, Envoy, etc.).
Success Metrics:
- Maintain and improve SLOs for key services in the 99.85–99.95% availability range, with clear SLIs and error budgets;
- Keep unplanned downtime below 1% for critical user-facing functionality;
- Ensure that the majority of infrastructure and platform configuration (target ≥ 90–95%) is managed as code (Terraform, Ansible, Kubernetes manifests/Helm charts);
- Systematically reduce MTTR (Mean Time To Recovery) for incidents by improving detection, diagnostics and standard operating procedures;
- Prevent repeated high-severity incidents by driving post-incident reviews and concrete follow-up actions (configuration changes, automation, runbooks, architectural adjustments);
- Maintain up-to-date operational documentation and runbooks for core services, so that incidents can be handled consistently across the team.
The company guarantees you the following benefits:
- Global Collaboration: Join an international team where everyone treats each other with respect and moves towards the same goal;
- Autonomy and Responsibility: Enjoy the freedom and responsibility to make decisions without the need for constant supervision;
- Competitive Compensation: Receive competitive salaries reflective of your expertise and knowledge as our partner seeks top performers;
- Remote Work Opportunities: Embrace the flexibility of fully remote work, with the option to visit company offices that align with your current location;
- Paid Time Off: Prioritise work-life balance with paid vacation and sick leave days to prevent burnout;
- Career Development: Access continuous learning and career development opportunities to enhance your professional growth;
- Corporate Culture: Experience a vibrant corporate atmosphere with exciting parties and team-building events throughout the year;
- Referral Bonuses: Refer talented friends and receive a bonus after they successfully complete their probation period;
- Medical Insurance Support: Choose the right private medical insurance and receive compensation (full or partial) based on the cost;
- Flexible Benefits: Customise your compensation by selecting activities or expenses you'd like the company to cover, such as a gym subscription, language courses, Netflix subscription, spa days, and more;
- Education Foundation: Participate in a biannual raffle for a chance to learn something new unrelated to your job as part of your commitment to ongoing education.
Interview process:
- A 30-minute interview with a Recruiter to get to know you and your experience;
- 1st stage of technical interview (1 h) with the DevOps team to assess your theoretical skills;
- 2nd stage of technical interview (1 h) with the DevOps team to assess your hard skills;
- A final 1-hour interview to gauge your fit with the company culture and working style.