Do the riskiest thing first – Sachin Dhamdhere blogs

Today’s blog is inspired by my experience delivering works of software in what most teams would describe as ‘aggressive timelines‘, even by typical high-flying software shop standards. A meeting at work triggered me to blog about it.

So the general theme is this: at the outset, list out all the things that scare you about a project, list out all the things you don’t know, list out all the things that’ll break your customers experience – once you have those, stack rank them and pick the riskiest thing(s) to do first – you’ll have much better outcome than otherwise. You’ll have de-risked the future. It is true you will have inaccurate estimates in the beginning making your stakeholders questions ability of the team to deliver – but with open and transparent communication you should be able to persuade them about this approach. Other advantage is at the beginning you’ll have lot more control over schedule slips and dealing with downstream impact. As you make progress you will deliver the necessary features on time and with high certainty because the riskiest items are either done or no longer unknown. Avoid the trap of doing things that you can, are known to you at the beginning, reason is simple – you know them, you can execute later.

Now, to the specific trigger: we are re-architecting and releasing a service where a few new components are in the mix and will be serving traffic from all customers by routing them appropriately (think API Gateway-like services for routing, authentication and authorizing traffic). In the existing version of the service, each cluster of software is isolated, thus there’s no “noisy neighbor” problem. Given current incarnation of service is available for many years now, there’s robust tooling available for conducting performance testing – on existing software stack. The engineers working on performance testing the new version of service had fallen in the track of thinking: “this is how its done in existing service”, “this is something we can do now”, “we don’t know how to write tests, harnesses for new service & it’ll take time”, “we don’t know how to project traffic patterns for the new common components, it is 10K requests/second or 100K or a million” … you see the trap, right? The customer experience will largely be determined by scalability of new components & we were punting them and favoring to do things we can & do them now.

This will lead to rush and stress later on when team figures out how to write those tests, how to measure performance, how to simulate traffic to find out the new components don’t scale well or need a lot of tuning or need a lot more hardware behind them or worse re-design some – putting the go-live date at risk or outright missing it. If it turns out we need to redesign a few things – it’ll lead to even more costlier delays – it’d be best to invest time in unknowns, riskiest things – right now Vs doing them later, yes it’ll be uncomfortable to write tests with unknowns but the discovery phase of this will help not only performance engineering team but overall org by bringing to the front questions we may not have event thought about.