Recently I joined a new company, was assigned to run a release meeting, after about three months on this journey, we have streamlined the process, and now, the structure is in place. I wish there had been some guidelines available when I first started on this endeavor. So, here I am, documenting and sharing my experiences and thoughts in an effort to provide some assistance to others.
Is Release Meeting a Gatekeeper ?
The first question is, what exactly is a "release meeting"? Is it akin to a gatekeeper, welcoming everyone and regulating the release of changes? This seems like an inappropriate and even offensive operation. But there must be a reason for its existence, right?
The rationale behind this process seems to be the presence of incident, that are released without undergoing the necessary checks or considerations before being deployed to production. This may be the result of several factors:
Knowledge Silos: The current system is a large-scale complex service, with over 30 engineers working on it simultaneously, making it challenging for everyone to be on the same page.
Reorganizations: Due to reorgs, many team members with contextual knowledge have left, and the existing practices have been lost.
Reviewer Culture: As part of the reorgs, new engineers have been hired, and the knowledge or context has been lost. There are not as many experienced reviewers as before.
New stack: We migrated from Java to Kotlin, Cassandra to PostgresSql, introduced new proxy service, it takes times and knowledge for it to mature.
Urgency of Delivery: Fast delivery is expected as part of the performance review process, but not necessarily the quality of the changes. Engineers are focused on quick delivery.
Tooling: We have transitioned from micro-repo to mono-repo, and the practices of default reviewer, local build, and testing are no longer as straightforward. There is a gap in tooling.
Closing these gaps in a short timeframe (say a quarter or two) is an impossible task, and introducing a gate between development and production is an approach to bridge this gap.
Quality at Speed or Quality Over Speed 🙇?
As an industry leader in the space, we aim to move as fast as possible while maintaining quality. At Atlassian, there are practices in place to achieve quality at speed. But what happens when speed does not meet the quality bar, especially when you inherit a business with a history of reorgs?
Firstly, you will need to define the quality bar for yourself. The bar varies at different stages of the business. It is expected for a startup to quickly try and fail and get the first MVP out the door. However, it does not make sense for an established platform to build services like startups and to try and fail. Take some time to understand the legacy barand incorporate your own understanding and thinking
Secondly, you will need buy-ins. Seek advice from passionate engineers , empower engineers to create code review checks and support engineers as role models to showcase the right behavior.
Thirdly, apply changes using the bar gradually and incrementally. I still remember the first time I pushed back a change without testing. The angry voices passed through the internet cable, and it surprised me. After three months, I wouldn't expect any engineer to come to the release and push changes to production without testing. It has become the new normal. A few things I found useful:
Conduct a survey💬, the release meeting survey helped me understand what the bottom line is and the potential pushbacks. The code review survey legitimized the change since there was a demand for it.
Review changes that did not go through the release meeting🚪. Release meeting is a mandatory process, enforced by humans. It is a cultural thing that requires communication. Observe the changes that did not go through, ask the reasons, and socialize it to help formalize the culture.
Empathy. It is easy to say no from an approver's perspective, but we all want to move fast. As a team, wearing the other person’s shoes to make the engineer succeed matters.
Fourthly, prioritize quality over speed when speed takes precedence over quality. This requires a mindset change and lessons for everyone. We found it useful to take incidents as educational opportunities. when there is a incident, on the second day, the first thing to do is review and assign the following mitigation action items.
Lastly and importantly, account reliability when planning. We are fortunate to have the support of TPM(Technical Program Manager) , engineering managers. By sharing this common understanding, we can assist the leaders in creating sensible plans that ensure quality and reliability are thoroughly considered as part of the planning process, including readiness review, progress rollout plan, and rollback procedure.
The Release Train
A release has an owner, frequency, and speed, and it varies from team to team. Given the nature of perms, we find that daily releases, service ownership, risk classification, following a progressive rollout, and ordering are helpful.
Release frequency varies from monthly, weekly, to bi-daily, or daily basis. The system is operated by a large team with over 30 engineers working together every day. There are about 6 changes that need to be released to production. These changes vary from VLUN fixes, service onboarding, feature changes, to major production changes. Piling up changes for a week, actually more than 30 changes, makes it hard to evaluate, as all the critical, high, and low-risk changes need to go together, making it difficult for both the reviewer and code owner.
The release typically goes through an automatic release pipeline, with automatic checkpoints in between. However, there are issues and misses as part of the process. It is important to assign the change owner to monitor the changes in staging and region by region. Having an on-call person to oversee every change is humanly impossible, so the owner must be accountable and take the time as part of the planning.
When there is a major change in one day, we classify that day as special. For example, the event release day. We hold other changes to focus on it. This practice has been applied multiple times as part of the one critical service migration, staging db change, and others.
We also classify critical change packages so that only specific owners can change them. the main focus was on reader migration, so we changed the owners to only reader owners..
Sometimes, there are deployment orderings between releases within a day. Identifying those and asking when to release, who to coordinate, and what to verify helps engineers streamline between the changes and reduce the risks. Within one deployment, leveraging the progressive rollout guideline justifies the timeline pressure and empowers the reviewer to push back the dates.
What's Next ?
Now that we have established a structure and mindset change within perms, the following steps are being taken:
Grant more freedom, starting from low-risk changes. Not every change is high risk, and after verifying many times that the change is safe, we will classify them and trust the existing checkpoints.
Empower reviewers. Software is built by engineers with culture, practices, and knowledge, all of which are driven by human. Coaching is the path to cultivating programmatic engineers and constructing resilient systems, and code review is one of the best ways.
Bridge the gaps within the system, particularly in design, development, testing, monitoring, and alerting. Embracing systems over human intervention is key to minimizing the occurrence of "bitter lessons" such as this.