Cloud Operations & Site Reliability Engineer

Overview

We're looking for a Cloud Operations & Site Reliability Engineer. Headquartered in Los Angeles, California, Right Balance provides top-tier technology talent for innovative companies in the US. We’re in the top 50 companies to watch in LA.

Engagement Details

Our client is the database for data-intensive apps that require high performance and low latency. It enables teams to harness the ever-increasing computing power of modern infrastructures-eliminating barriers to scale as data grows. Unlike any other database, it's built with deep architectural advancements that enable exceptional end-user experiences at radically lower costs. Over 300 game-changing companies like Disney+ Hotstar, Expedia, FireEye, Discord, Crypto.com, Zillow, Starbucks, Comcast, and Samsung use us for their toughest database challenges. Our database is available as free open-source software, a fully-supported enterprise product, and a fully managed service on multiple cloud providers.

Role

We're seeking experienced and dynamic individuals to join our Cloud Operations & Site Reliability Engineering team. As a Cloud Operations & SRE Engineer, you will play a vital role in maintaining the operational excellence of our cutting-edge NoSQL database platform. Leveraging your expertise in cloud infrastructure, Kubernetes, and system operations, you will ensure the reliability, scalability, and performance of our cloud offerings. If you are passionate about working in a fast-paced environment, collaborating with cross-functional teams, and driving continuous improvement, this role is tailored for you.

Arrangement

Location: Remote (in APAC)
Availability: Full-time. You’ll need to start your work day between 2pm and 4pm Pacific Time.

Responsibilities:

Collaborate with the Cloud Operations & SRE team to ensure the smooth day-to-day operation of our Cloud platform. Monitor system health, troubleshoot issues, and proactively address any operational challenges.
Assist and perform upgrades for our Cloud platform, including database versions, OS upgrades, and security patches. Collaborate with DevOps/Cloud Engineering to ensure seamless upgrade processes.
Participate in scaling up and down Monitor & Managers servers based on demand. Employ proactive monitoring strategies to identify and address potential performance bottlenecks and resource constraints.
Act as a liaison with the Support Organization to address cloud platform-related issues. Respond to tasks and tickets escalated by Support Staff, and collaborate to ensure timely resolutions.
Develop and maintain a comprehensive runbook that can be leveraged by Support Staff to troubleshoot and resolve common issues, improving efficiency in issue resolution.
Create scripts and automation solutions to streamline operational tasks and enhance efficiency. Contribute to the development of automation strategies for cloud infrastructure management.
Feature Requests: Collaborate with the Cloud Engineering team to define and create feature requests that enhance the functionality and performance of Cloud.
Conduct regular cluster health and performance audits, identifying areas for optimization. Implement strategies to enhance the efficiency and reliability of Cloud clusters.
Work closely with the Customer Success team to ensure that provisioned resources align with customer needs and purchased packages. Provide insights into potential scaling opportunities and usage optimization.
Demonstrate a deep understanding of public cloud environments (AWS, GCP, Azure), Kubernetes, Linux system operations, and NoSQL database deployment/management. Apply this knowledge to resolve complex technical challenges.
Utilize scripting languages like Python, Terraform, Ansible and Bash to create automation tools that enhance operational efficiency.
Cross-Functional Collaboration: Collaborate closely with Support and Engineering teams to address issues, drive improvements, and implement customer-focused solutions.

What’s in it for you

Learn and evolve your skills using the latest and greatest technology tools in a rapidly growing company.
Learn from the best people around you. We constantly challenge the status quo and invent new ways of building a great product.
Flexible hours. Join daily standups, sprint planning, and retrospective meetings. Other than that you’re in control of your own schedule.
100% remote. Work anywhere, whether it is remotely in the comfort of your home, in a shared co-working space, in an RV on the beach, or while being a nomad in another country.
Work on challenging problems, innovate, and positively impact many people's lives while having fun doing it.

Required Qualifications

Upper-intermediate to fluent speaking and writing English. Able to have a real-time conversation.
3+ years of full-time hands-on Cloud Platform (AWS, GCP, Azure) experience.
3+ years of full-time hands-on Linux (System Operations and Metrics Analysis) experience.
2+ years of full-time hands-on Python experience.
Strong scripting skills in Python and Bash.
Experience with reporting and visualization tools such as Splunk, Grafana, Prometheus, and Kibana.
Exceptional organizational skills and ability to manage multiple projects concurrently.
Ability to work both independently and collaboratively within cross-functional teams.
Strong problem-solving skills, especially under pressure.
Eagerness to continuously learn and adapt to emerging technologies.
Familiarity with container technologies like Docker and Kubernetes.
Familiarity with automation tools such as Ansible and Terraform.

Nice to haves

3+ years of Kubernetes experience.
Proficiency with automation tools such as Ansible and Terraform.
Proven expertise in NoSQL database deployment, management, and data modeling.
Bachelor’s degree in Computer Science or equivalent demonstrated ability.

Frequently Asked Questions

What are your typical clients?

The majority of our clients are venture-backed startups at the growth stage. Usually, at this stage, the company already achieved a product-market fit and is looking to expand rapidly. That’s where we bring the best engineering practices, strong architecture, the latest technologies, and consistent processes to help companies scale.

What is the length of your engagements?

Most of our long-term full-time engagements last multiple years. It allows you to evolve your career with the client company taking on more responsibilities.

What’s your company size?

The Right Balance team is 60+ engineers going to 100+ by the end of the year. The current client size team is 210+ people. The timing is great to be a part of a rapidly growing team making meaningful contributions.

What happens if the engagement is completed?

Most of our engagements are long-term in nature. That said, if the current engagement is ramping down, we’ll present you with more long-term opportunities to transition into.

What are your core values?

Client First: we only win when our clients win. We treat client challenges as our own.

Ownership: we embrace responsibility, taking on challenges, getting them to completion, and enjoying getting things done.

Quality: we’re passionate about achieving quality outcomes by applying meticulous attention to detail.

Get in touch

Interested in this opportunity?

Right Balance ®

Opportunities