A high-performance trading technology firm is hiring a Platform/Site Reliability Engineer to join its Infrastructure team. The firm develops all core systems in-house and operates large-scale, latency-sensitive research and production platforms.
The Role
This is a senior, hands-on SRE position with ownership of cloud and on-prem infrastructure supporting research, batch compute, and production workloads. You will work closely with engineering and research teams to design, operate, and evolve highly reliable systems, while helping embed a strong SRE culture across the organisation.
What You’ll Do
-
Build and operate observability platforms (monitoring, logging, tracing, alerting) for high availability and rapid incident response
-
Architect and maintain scalable infrastructure across cloud and on-prem environments
-
Support and evolve research compute clusters, including batch and workflow-driven workloads
-
Investigate and resolve live production issues end-to-end
-
Improve CI/CD pipelines, tooling, and developer experience in partnership with engineers
-
Drive SRE best practices and operational excellence
What We’re Looking For
-
8+ years experience in SRE / Platform / Infrastructure engineering
-
Background in trading, quantitative research, or other performance-critical environments
-
Strong Kubernetes experience (design and operations)
-
Practical knowledge of GitOps and modern CI/CD workflows
-
Experience supporting batch, workflow, or HPC-style systems
-
Solid cloud fundamentals (AWS or GCP)
-
Proficiency in Python and/or Go
-
Comfortable owning production systems and incidents
-
Strong communication skills and an ownership mindset
Nice to Have
-
Kubernetes operators
-
Bare-metal or hybrid infrastructure
-
Containerisation and configuration management tools
-
Security-aware infrastructure engineering
-
Observability tooling (Prometheus, ELK)
-
Enterprise Linux (RedHat / CentOS)
-
CI/CD tooling (GitLab CI, Jenkins)
-
Open-source contributions