AMVETS Jobs

Job Information

Hyundai Autoever America 10471 – Manager, SRE in Fountain Valley, California

10471 – Manager, SRE

Purpose:

The Site Reliability Engineering (SRE) Manager will be working with the development & operations team, focusing on ensuring that connected car systems are working as expected and the underlying infrastructure and network is running smoothly. This role is responsible for the day-to-day operations of the DevOps team and combines a mix of project management, team management, and engineering duties. The DevOps team are subject-matter experts within Telematics domain and provide insight and engineering advice to development and product teams, with a goal to create a highly reliable and scalable software system that can run with minimum failure

Essential Functions:

  • Act as primary point-of-contact (PoC) on all connected card infrastructure operations and projects

  • Work collaboratively with software engineering to define infrastructure and deployment requirements; be a sounding board and provide recommendations for engineering team around infrastructure design and deployment.

  • They first set a goal to create a highly reliable and scalable software system that can run with minimum failure

  • Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding

  • Be the driving force behind our automation and observability initiatives. Build tools and automation that eliminate repetitive tasks and prevent incident occurrence.

  • Build and maintain operational tools for deployment, monitoring, and analysis of connected car infrastructure and systems

  • Perform infrastructure cost analysis and optimization

  • Provide project management, sprint planning, and road-mapping support to the DevOps team

  • Activities include designing, developing, installing, and maintaining software solutions.

  • Work with engineering teams to refine deployment and release processes.

  • Collaborate with the engineering team on projects as the expert on reliability, performance, and efficiency.

  • Manage on-call rotations across connected car applications, using a follow-the-sun model.

  • Participate in 24x7 operational support and on-call rotation shifts.

  • Ensure that all system design and procedures are documented and up to date.

  • Monitor and stress test systems to collect metrics for tuning and capacity planning.

  • Work to automate detection and resolution of recurring issues.

  • Ensure safety, predictability, repeatability, and auditability of all build and deploy processes.

  • Partner with development teams to improve services through rigorous testing and release procedures

  • Participate in system design consulting, platform management, and capacity planning

  • Create sustainable systems and services through automation and uplifts

  • Balance feature development speed and reliability with well-defined service level objectives

    Basic requirements:

  • Bachelor’s or Master’s degree or equivalent in the field of computers, information systems or related degree.

  • 2+ experience as a manager or PM or in a Technical Leadership capacity, preferably in automobile industry within the Telematics domain.

  • Programming experience with one or more high level languages, such as Python, Java, C/C++, Ruby, and JavaScript

  • Proven track record of designing, building, optimizing, and maintaining infrastructure on a large scale.

  • Experience with distributed systems in a production operations environment

  • Expertise analyzing complex application, database, network, and OS issues across a distributed large scale customer facing system

  • Strong communication skills and ability to work effectively across multiple business and technical teams

  • Demonstrated ability to deliver results on time with high quality

  • Extensive experience leading customer facing systems in a high uptime 24/7 environment

  • A depth and breadth of experience with server-side Java development, Oracle and distributed databases

  • A well-developed understanding of the theory and principles of operation of the internet and packet data protocols.

  • Exposure to Cloud, SaaS, and virtualization concepts and performance concerns.

  • Working knowledge of operating system design, processes, and threading model.

  • Knowledge of defining and monitoring system quality measures, including SLO and SLA.

  • Built tooling to improve reliability of systems, automated remediation of issues, or improve scalability.

  • Experience with different flavors of Linux, i.e. RedHat, Ubuntu, CentOS, etc.

  • Hands-on experience collecting performance data, analyzing, troubleshooting, and tuning.

  • Experience with the operations of application with high concurrency, scalability, or availability requirements.

  • Experience leading high performing engineering teams.

  • Experience with containers and container orchestration tools (Docker, Kubernetes)

  • Experience with MySQL, Elasticsearch, Couchbase, Mongo and Redis

     

    Nice to have:

  • Experience with stream-processing open-source frameworks/systems, i.e. Kafka, Spark, etc.

  • Experience with distributed storage technologies like NFS, HDFS, S3 as well as dynamic resource management frameworks (Mesos, Kubernetes, Yarn)

    Salary Range - $112,830 to $173,756

Powered by JazzHR

DirectEmployers