A Week in the Life: On-Call as a Software Engineer at Amazon
This article provides a glimpse into the life of a software engineer at Amazon during an on-call rotation. It covers the challenges, the learning experiences, and the everyday tasks involved in keeping the systems running smoothly.
What Does "On-Call" Mean?
Being on-call means being responsible for handling any issues or emergencies that arise with the systems and services your team supports. It's like being the first line of defense when something breaks. The engineer is responsible for fixing it, no matter the time of day. While stressful, it's a valuable learning opportunity to understand the inner workings of the system.
The Stressful Reality of On-Call
- The on-call shift can be incredibly stressful. One particular week was so intense that the engineer couldn't even find time to eat, working from early morning until late at night.
First Day Nightmares
On the very first day of one on-call rotation, the team experienced a denial-of-service attack from another internal team. This resulted in an outage lasting several hours, with a significant financial impact.
- The cause of the attack remained unclear, sparking debate and confusion.
Normally, a reverse shadow assists during the first on-call shift, providing guidance and support. However, the assigned senior engineer called in sick. The manager stepped in, but the severity of the issue required the involvement of all senior engineers, even those from external teams. Despite being relatively new to the team (3 months), the engineer was tasked with driving the resolution, implementing a hotfix while sharing their screen with experienced engineers. This was described as one of the most stressful situations ever experienced.
Weekly Operations Meeting
Every week, the team holds an operations meeting.
- The engineer who was on-call the previous week presents a document summarizing the shift.
- They review major incidents, highlight important events, and share key learnings.
- The primary goal is to hand off any unfinished but important tickets to the next on-call engineer.
Handling Support Tickets
Aside from high-severity incidents, the on-call engineer is also responsible for handling support tickets. These tickets can involve:
- Minor bugs.
- User-reported issues.
- Questions about features.
- Operational maintenance ("janitor work").
While less exciting, this maintenance is crucial for maintaining system stability.
Paged After Work
The engineer describes a situation where they were paged immediately upon arriving home after work. They had to address the issue before even having dinner.
Investigating Alarm Tickets
One of the most common and challenging types of tickets involves alarms that trigger when something goes wrong.
- Determining the root cause requires digging through logs, reviewing recent code changes, and checking pipelines.
- It often involves reaching out to other teams for context, much like detective work.
In one particular case, the issue involved an internal customer needing an attribute updated in the DynamoDB, but the change wasn't reflected in the service.
Seeking Help and Collaboration
Despite attempting to resolve the issue independently, the engineer eventually sought help from a teammate. They hopped on a call at 9:30 PM. The engineer notes that it's always acceptable to page a teammate or manager when facing a customer-impacting issue for too long.
Caching Issues
The engineer eventually pinpointed the root cause of an issue after getting help from the team during standup. The problem was caused by the cash not being updated properly, and they began diving into the code to understand why.
Balancing Workload and Team Dynamics
The intensity of the on-call experience varies depending on the team's responsibilities.
- Teams managing services with a large user base (like AWS) face more intense shifts.
- Teams with smaller audiences (like Audible) have lighter workloads.
- This team falls somewhere in between.
The engineer emphasizes the importance of asking questions, especially of senior engineers, but doing so strategically. Before asking for help, they explain the problem, their attempts to solve it, and ask if they are on the right track.
Getting Coverage and Personal Time
The engineer needed to find someone to cover their on-call shift so they could go snowboarding. The engineer wanted to celebrate a friend's birthday.
Additional Activities
The engineer also mentions participating in lunch and learn sessions.
- These sessions involve team members presenting on various topics, ranging from system architecture to personal projects.
Conclusion
The video shows both the difficult and valuable aspects of being an on-call software engineer at a large company. It can be stressful, but with a good team and the proper approach to learning and seeking help, it can also be a great opportunity for career growth.