Site Reliability Engineer

Argano

Full Time

Remote

Posted Just posted

Apply This Job

Job description

About Argano

Argano is a business modernization partner, purpose-built to give rise to the possibilities of the Digital Renaissance for companies with complex sales and operating environments. We innovate adaptive, efficient, cloud-based digital operating foundations on which the transformational businesses of the 21st century must be built. These modern, scalable, and sustainable foundations integrate operations from commerce to cash to close to consolidation and free our clients to innovate and respond in new and cost-effective ways. The Argano platform uniquely offers the advantage of integrated, world-class capability partners, working together to solve complex challenges across the full spectrum of our client's business. For more information, visit www.argano.com

Job Description:

We are looking for an experienced Senior Site Reliability Engineer to join our team. The successful candidate will be responsible for ensuring the reliability, availability, and performance of our production systems. The Senior SRE will work closely with development teams to ensure that new systems are designed with reliability and scalability in mind.

Responsibilities:

Design and implement systems to ensure the reliability, availability, and performance of our production systems.

One of the primary focus will be on cloud environment support, build automation and developer productivity.
Work with development teams to ensure that new systems are designed with reliability and scalability in mind.
Develop and maintain monitoring and alerting systems to proactively detect and resolve issues.
Continuously improve system reliability and performance through the development of automated tools and processes.
Implement DevOps pipelines and infrastructure automation.

Participate in on-call rotations and respond to incidents in a timely manner.
Reduce mean time to identify (MTTI) and mean time to recovery (MTTR) by helping troubleshoot, monitor, alert, and automating recovery.
Improve mean time between failures (MBTF) by helping teams define SLI/SLOs and prioritize proactive investment tasks.
Diligently observe and interpret cloud monitoring dashboards and alerts.
Resolve basic issues; escalate urgent and complex issues. Know the difference.

Be aware of customer SLA’s and escalate issues if cases are taking too long to resolve.
Monitoring case backlog to ensure we meet agreed SLAs with customers and internal KPI targets; share regular status reports with stakeholders.
Document all troubleshooting and issue management actions via the electronic case management system.
Investigate and troubleshoot complex system issues and provide root cause analysis.
Develop and maintain disaster recovery and business continuity plans.

Collaborate with cross-functional teams to improve system scalability, security, and performance.
Stay up-to-date with industry trends and emerging technologies

Key Skills and Competencies:

Bachelor's or Master's degree in Computer Science, Engineering, or a related field

5+ years of experience in site reliability engineering or a related field
Strong understanding of the following monitoring concepts: Infrastructure, systems, and Application health, system availability, latency, performance, and end-to-end monitoring.
Strong monitoring and debugging skills.
Strong experience with cloud infrastructure and services (AWS, GCP, or Azure)
Strong expertise and hands-on project experience in enterprise level development and maintenance of infrastructure as code using Terraform.

Good practical Linux / Windows-based systems administration skills in a Cloud or Virtualized environment.
Strong hands-on experience with network, storage and compute configuration and setup.
Experience with container orchestration platforms such as Kubernetes
Experience with automation and configuration management tools (e.g., Ansible, Puppet, Chef, Terraform)
DevOps - Create, maintain, and manage CI/CD pipelines for infrastructure.

Experience with monitoring and logging tools such as Prometheus, Grafana, and ELK stack
Strong understanding of network protocols and infrastructure security best practices.
Experience with scripting languages such as Terraform, Python, Ruby, or Bash
Strong analytical and troubleshooting skills
Experience (1 year) with ITIL processes including Incident, Problem, Change, Knowledge and Event Management.
Excellent communication and collaboration skills

#ArganoMS3

caravetterealestate.com is the go-to platform for job seekers looking for the best job postings from around the web. With a focus on quality, the platform guarantees that all job postings are from reliable sources and are up-to-date. It also offers a variety of tools to help users find the perfect job for them, such as searching by location and filtering by industry. Furthermore, caravetterealestate.com provides helpful resources like resume tips and career advice to give job seekers an edge in their search. With its commitment to quality and user-friendliness, caravetterealestate.com is the ideal place to find your next job.

Save This Job Apply Job

Site Reliability Engineer

Job description

Intrested in this job?

Related Jobs

All Related Listed jobs

Veterinary Receptionist

Registered Nurse (RN) | Behavioral Health - Contract - Days

FOREST RANGER

Trainer (w/m/d)

Floating Restaurant Manager

Assistant Manager Trainee Full-Time