Site Reliability Engineer (Senior DevOps Engineer) – AWS

Sofia, BG
Hybrid Office
Full Time

Who We Are

Yamasoft is a leading software solutions provider specializing in IoT & IIoT technologies powered by AI/ML. With over 25 years of experience in the software industry, our team brings expertise in building high-performing teams and delivering top-notch software solutions. We focus on delivering high-quality software that aligns seamlessly with our customers’ objectives.

Description

We are building a new team around a biotechnological product for infectious disease diagnosticspharmaceutical discovery, and microbiome analysis. The team will focus on designing and implementing the distributed, cloud-based SaaS bioinformatics solution for both research and clinical diagnostics.

As an experienced Site Reliability Engineer (SRE), you will be responsible for designing, implementing, and maintaining scalable, secure, and efficient cloud infrastructure solutions. This role involves building and managing Infrastructure as Code (IaC) systems using tools like Terraform, ensuring adherence to security best practices, and managing incident mitigation procedures. As a technical expert, you will design the infrastructure to meet defined Service Level Objectives (SLOs) and apply FinOps principles to manage cost efficiency.

Key Responsibilities

1. Infrastructure Automation & Governance

      • Design, develop, and maintain Infrastructure as Code solutions using Terraform.
      • Automate infrastructure provisioning, scaling, and configuration management.
      • Establish and enforce IaC standards to ensure the modularity, reusability, and maintainability of configurations.
      • Optimize IaC pipelines for efficiency, reliability, and auditability.
      • Enforce IaC-only governance for all production changes to ensure auditability and security.

2. Cloud Architecture & Optimization

      • Provide technical expertise in designing and using AWS cloud services to optimize performance and cost.
      • Design highly available and resilient infrastructure solutions across AWS services.
      • Apply Amazon Well-Architected Framework principles across all architectural decisions.
      • Implement and optimize cloud resources by FinOps practices to ensure optimal cost efficiency.

3. CI/CD and Automation

      • Build and maintain CI/CD pipelines to support automated deployment and testing.
      • Develop automation tools and scripts to reduce manual operational overhead
      • Collaborate with development teams to implement DevOps best practices
      • Create and maintain runbooks and operational documentation

4. Platform Reliability and Compliance

      • Implement comprehensive monitoring solutions, including Site Performance Monitoring (SPM) and Application Performance Monitoring (APM).
      • Design and implement Service Level Objective (SLO) frameworks to drive platform reliability, release governance, and continuous toil reduction.
      • Develop, test, and manage Business Continuity Plans (BCP) and Disaster Recovery Plans (DRP).
      • Design and implement the automated incident response framework, focusing on mitigation strategies, root cause analysis, and post-incident review processes.
      • Architect and automate governance controls to ensure high-velocity, auditable compliance with highly regulated standards (GxP, HIPAA, GDPR, ISO27001), leveraging DevSecOps to minimize manual audit requirements.
      • Configure and maintain secure and efficient networks, including AWS VPCs, subnets, routing, and security groups.

Technologies Scope:

1. Infrastructure as Code (IaC):

      • Tools: Terraform and associated AWS SDKs or APIs.
      • IaC pipelines integration with CI/CD platforms.

2. Cloud Services:

      • Strong proficiency with AWS services: Active Directory (AD), LDAP, EC2, ASG, ECS, Lambda, MQ, RDS, S3, VPC, subnets, security groups, AWS Lake Formation, AWS Glue, AWS Athena, AWS Open Search, AWS Bedrock, etc.

3. Operating Systems:

      • Linux: OS management, shell scripting, performance tuning, and security hardening.

4. Automation and Containers:

      • Docker, ECS, k8s
      • Jenkins pipelines
      • Azure DEV OPS pipelines

5. Networking:

      • Security, Proxy, Load balancing, OpenVPN, etc

6. Observability and Monitoring:

      • Prometheus, Loki, Grafana, ELK stack, or equivalent tools (such as Datadog or Sentry, Zipkin).

7. Standards and Practices:

      • Security best practices, performance optimization, and high availability principles.
      • Knowledge of GDPR/HIPAA and data privacy principles

8. Data Lakehouse Ecosystem (Nice to Have)

  • Apache Spark/OpenSearch/Solr/Iceberg/JupyterHub/etc
      • Core Web Stack: ReactJSTypeScript, JavaScript, CSS, HTML.
      • Tooling & Frameworks: Vite, Webpack, Redux, Material DesignPlotly JS, D3.js, and other modern libraries.
      • APIs: REST, WebSockets/SSEGraphQL.
      • Standards & Practices: Expertise in software design patterns, algorithms, and development best practices.
      • DevOps & Environment: Proficiency with Linux as a base system, shell scriptingDocker containerization, and configuration/understanding of Nginx and web security principles.
      • Testing & Optimization: Proficiency with testing frameworks (e.g., JestCypress) and performance optimization tools.

Qualifications:

      • 5+ years of experience in DevOps, SRE, or a similar role.
      • Proven expertise in Terraform for Infrastructure as Code, including advanced use of modules and state management.
      • Strong understanding of AWS cloud services, including automation, scaling, and cost optimization.
      • Solid Linux administration skills, including performance tuning and shell scripting.
      • Experience with containerization (Docker) and CI/CD tools (Jenkins).
      • Strong networking knowledge, including VPN solutions like OpenVPN.
      • Demonstrated experience with disaster recovery planning, business continuity strategies, and incident management.
      • Familiarity with or strong interest in AWS Lake Formation and contributing to the design and implementation of our data lakehouse.

What we offer

      • 25 Days Paid Time Off
      • Additional Health Insurance
      • Multisport card
      • The opportunity to be among the very first team members
      • Excellent career development opportunities
      • Attractive remuneration package

 

If you are interested in this job offer, please send your CV in English.

Do not apply if you are not located in Bulgaria, we have a hybrid office policy in Sofia.

All CVs will be treated in strict confidentiality. Only shortlisted candidates will be contacted.