Back to all positions

[Remote] Staff Data Engineer - Emerald

Remote Full-time Now Hiring

Note: The job is a remote job and is open to candidates in USA. H1 is dedicated to providing optimal healthcare information access and is seeking a Staff Data Engineer for their Emerald team. This role involves leading the architecture and scalability of H1’s healthcare entity resolution platform while managing a small team and collaborating with various stakeholders to enhance the platform's efficiency and accuracy.

Responsibilities

  • Lead the design, optimization, and scalability of distributed Spark/PySpark pipelines powering entity resolution and large-scale healthcare data processing
  • Own systems supporting automatching, identity mapping, grouping logic, deduplication, enrichment, and auto-approval workflows across healthcare provider and organization datasets
  • Build and maintain scalable processing frameworks for PubMed, clinical trial, ct.gov, conference, and other healthcare data sources
  • Drive infrastructure optimization initiatives focused on improving throughput, runtime, observability, and cloud compute cost efficiency
  • Partner closely with AI/ML teams to integrate matching and resolution models into EMERALD and improve matching precision and recall
  • Lead complex technical initiatives from architecture and design through deployment, monitoring, and long-term production support
  • Serve as a technical leader and mentor across the team through code reviews, technical guidance, and engineering best practices
  • Collaborate directly with Product and business stakeholders to align technical solutions with operational and customer needs
  • Support production operations, incident response, troubleshooting, and ongoing platform reliability

Skills

  • 8+ years of experience building and maintaining large-scale distributed data systems and pipelines
  • Demonstrated technical leadership experience mentoring engineers and driving complex technical initiatives
  • Extensive experience with Apache Spark and AWS-based big data technologies including EMR, S3, and distributed compute environments
  • Strong coding experience in Python (PySpark), Scala, Java, or equivalent languages used for distributed processing systems
  • Experience optimizing large-scale Spark workloads for performance, scalability, and infrastructure cost efficiency
  • Experience with streaming and event-driven architectures using technologies such as Kafka or Spark Streaming
  • Experience with orchestration and lakehouse technologies such as Argo and Hudi or comparable platforms
  • Experience with containerization and infrastructure technologies such as Docker, Kubernetes, and Terraform
  • Experience working with relational or distributed databases such as PostgreSQL or Redshift
  • Proven ability to operate effectively within highly scalable, production-grade distributed systems
  • Deep expertise with distributed data processing frameworks such as Apache Spark and Hadoop, particularly within AWS environments
  • Strong proficiency in Python (PySpark), Scala, Java, or other modern programming languages used for large-scale distributed processing
  • Experience building scalable ETL/ELT frameworks across both batch and streaming architectures
  • Strong understanding of distributed file formats including Apache Parquet and Apache AVRO
  • Experience with streaming technologies such as Kafka, Spark Streaming, or KSQL
  • Strong grasp of software engineering fundamentals including distributed systems, data structures, concurrency, and system design
  • Experience performing root cause analysis across large-scale distributed systems and complex data pipelines
  • Ability to write clean, maintainable, modular, and production-grade code
  • Experience improving performance, scalability, observability, and infrastructure efficiency within distributed systems
  • Strong communication and collaboration skills across both technical and non-technical stakeholders
  • Familiarity with modern development and infrastructure tooling including Git, CI/CD pipelines, Docker, Kubernetes, Terraform, Argo, Hudi, and JIRA
  • Experience with entity resolution, identity mapping, automatching, deduplication, or large-scale matching systems is strongly preferred
  • Experience working with healthcare, life sciences, Real World Evidence (RWE), or large-scale healthcare datasets is strongly preferred

Benefits

  • Stock options
  • Full suite of health insurance options
  • Generous paid time off
  • Pre-planned company-wide wellness holidays
  • Retirement options
  • Health & charitable donation stipends
  • Impactful Business Resource Groups
  • Flexible work hours & the opportunity to work from anywhere
  • The opportunity to work with leading biotech and life sciences companies in an innovative industry with a mission to improve healthcare around the globe

Company Overview

  • H1 is on a mission to connect the world with the right doctors. It was founded in 2017, and is headquartered in New York, New York, USA, with a workforce of 201-500 employees. Its website is https://www.h1.co.
  • Company H1B Sponsorship

  • H1 has a track record of offering H1B sponsorships, with 5 in 2025, 6 in 2024, 4 in 2023, 9 in 2022, 7 in 2021. Please note that this does not guarantee sponsorship for this specific role.