HPC Systems Engineer
This job posting is no longer active.
Winston Salem, NC, United States
Job ID: 80998
Job Family: Information Services
Status: Full Time
Remote Opportunity: Yes
Job Type: Regular
Department Name: 12531088942449-Infrastructure Support
Overview
JOB SUMMARY:
The HPC Systems Engineer is part of a team of technical people who are responsible for the overall operation of the High Performance Compute Cluster (HPC), UNIX and Linux systems environments at Wake Forest University Baptist Medical Center (WFUBMC). This individual will be highly technically skilled in HPC environments and possess excellent communication skills, and leadership capabilities. The HPC Systems Engineer will have responsibility for solutions design, project management, and management of HPC environment. The candidate must possess experience with high performance computing environments; specifically, in using provisioning and monitoring tools like Bright Cluster Manager and workload/queue/scheduling/load balancing tools like SLURM on a high-speed InfiniBand network. Have experience with; workload parallelization utilizing MPI/OpenMPI, HPC application installations in a NFS mounted shared storage space, basic software development principles, basic software development experience to assist end-users with their code, and be able to design, implement, and maintain the environment. Experience with Azure utilizing CycleCloud is a plus. This position will work directly with Researchers and Faculty to assist in training and utilization of the HPC environment.
EDUCATION/EXPERIENCE: High School Diploma or GED and 4 or more years’ relevant experience architecting, implementing, and managing High Performance Computing Cluster systems using Linux or a BS in Information Systems, related field of study or equivalent work experience and 2 or years relevant experience architecting, implementing, and managing High Performance Computing Cluster systems using Linux.
LICENSURE, CERTIFICATION, and/or REGISTRATION: Certifications in RHEL Linux a plus
ESSENTIAL FUNCTIONS:
- Provide in depth working knowledge of High Performance Computing Cluster systems using Linux.
- Design and implement RHEL Linux HPCC systems and solutions integrating key HPCC technologies including but not limited to: visualization, provisioning, monitoring, performance & tuning, workload balancing, application installation, batch job scripting, backup/DR, and troubleshooting all aspects of HPC software/user applications.
- Ensure that system hardware, operating systems, software systems, and related procedures adhere to organizational standards and disaster recovery requirements.
- Perform project coordination duties for assigned projects to include providing detailed implementation/migration planning, managing to schedules and due dates, coordinating communication between project stakeholders, and providing project status reports.
- Develops and reviews processes and technical documentation (run books) to transition new solutions to the Administration Team.
- Recommend and guide improvements across the converged infrastructure service layers of network connectivity, server consumption, storage, virtualization configuration, and operating systems integration.
- Identifies and understands issues, problems, and opportunities; compares data from different sources to draw conclusions; uses effective approaches for choosing a course of action or developing appropriate solutions; and takes action that is consistent with available facts, constraints, and probable consequences.
- Participates in the development and implementation of technical standards, policies, and procedures.
- Works as a liaison with contractors and vendors as needed.
- Works independently or as a member of interdisciplinary teams on complex projects. Provide HPC subject matter expertise and consulting services to our internal clients.
- Ensure system practices are followed for all change, capacity, performance, and threshold management.
- Constantly seek out areas for improvement, automation, and efficiencies.
- Serves as a mentor and resource for less experienced team members
SKILLS/QUALIFICATIONS:
- At least 5 years of practical experience as a Linux HPC Systems Engineer in a complex multi-platform, multi-protocol environment.
- Experience with provisioning tools like Bright Cluster Manager for HPC images.
- Experience with two or more of the following:
- Linux kernel programming
- Concurrent programming via multi-threading
- Parallel and/or distributed programming
- Containers and container orchestration tools (e.g. Singularity, Docker, Kubernetes, etc.)
- Infrastructure as Code and other development and automation tools, such as Ansible, Puppet, Git, etc.
- Scripting languages such as BASH, PowerShell, Perl, Ruby, C/C++, or Python
- Proficiency with OpenMPI and/or MPI
- General HPCC technical knowledge regarding compute, network, memory, and storage components
- Familiarity with using workload/scheduling mangers like SLURM for job submitting, job queueing, job load balancing and reporting.
- Expertise in Linux system installation, configuration, administration, maintenance, tuning, and troubleshooting.
- Working knowledge of VMware and a variety of server/systems monitoring and alerting, analysis, management, and configuration tools.
- Knowledge of computer security systems, applications, procedures and techniques.
- Knowledge of common network protocols (e.g. TCP/IP, HTTP, FTP, NTP, SFTP).
- Familiarity with HIPAA, PHI, and FISMA requirements a plus.
- Experience in managing technical projects, setting realistic timelines, and communicating status.
- Able to solve problems independently, quickly, and completely and communicate them clearly to upper management.
- Must be independent, organized, self-motivated and responsible.
- Demonstrated technical aptitude and critical thinking skills.
- Excellent interpersonal/communication skills, and the ability to work collaboratively as part of a team.
WORK ENVIRONMENT:
- Fast paced and occasionally long hours.
- May be exposed to/occasionally exposed to patient elements.
- Subject to varying and unpredictable situations.
- Handles emergency or crisis situations.
- Subject to many interruptions.
- Occasionally subjected to irregular hours.
- Occasional pressure due to multiple priorities.
- Annual flu and sars-cov-19 vaccinations are required, exemptions (medical/religious) available.
PHYSICAL REQUIREMENTS:
0% |
35% |
65% |
|
|
to |
to |
to |
|
|
35% |
65% |
100% |
N/A |
Activity |
X |
|
|
|
Standing |
|
X |
|
|
Walking |
|
|
X |
|
Sitting |
X |
|
|
|
Bending |
X |
|
|
|
Reaching with arms |
X |
|
|
|
Finger and hand dexterity |
|
X |
|
|
Talking |
|
X |
|
|
Hearing |
|
X |
|
|
Seeing |
|
|
|
|
Lifting, carrying, pushing and or pulling: |
X |
|
|
|
20 lbs. maximum |
X |
|
|
|
50 lbs. maximum |
X |
|
|
|
100 lbs. maximum |