Masrur Hossain

Site Reliability Engineer | Cloud Infrastructure Specialist

masrur [at] masrur [dot] org

About

Site Reliability Engineer with 10+ years of experience designing, building, and operating highly available, scalable cloud-native applications on AWS and GCP. Proven expertise in infrastructure automation, DevOps practices, and leading cross-functional teams to deliver production-ready, reliable systems.

Skills

Cloud

AWS, GCP, IBM Cloud

Infrastructure as Code

Terraform, Ansible, Puppet, Helm

Containers

Kubernetes, Docker

Languages

Python, Go, C++, Bash

CI/CD

GitLab, GitHub, ArgoCD

Monitoring

Prometheus, Grafana, Zabbix

Big Data

Kafka, Splunk, Flink

Databases

MySQL, PostgreSQL

Experience

Principal Engineer, AI Infrastructure @ IBM Feb 2023 - Present

Leading cross-functional teams ensuring 99.9%+ uptime for critical AI infrastructure services. Designing reliable, secure cloud infrastructure using AWS and IBM Cloud.

Staff Site Reliability Engineer @ Amazon Aug 2021 - Jan 2023

Maintained 5+ customer-facing services with high availability requirements. Built distributed data architectures for streaming data and real-time analytics.

Staff Site Reliability Engineer @ Clover Network Jul 2018 - Aug 2021

Built and operated cloud-native applications on AWS handling high-volume transactions. Maintained 10+ customer-facing services ensuring production readiness.

Lead Site Reliability Engineer @ ASML Jun 2017 - Jul 2018

Created automation tools for cross-compilation of 500k+ line codebase. Implemented comprehensive monitoring and observability solutions.

Education

Ph.D. in Physics - University of British Columbia

Writing

Build Your Own AI Agents: Markdown Files Over Generic Frameworks

Why custom markdown instructions beat generic automation tools

AI-Enhanced Infrastructure: Why Markdown Beats Terraform Modules

Rethinking infrastructure-as-code for the age of AI agents