Lessons I learned from working on a high-load legacy application

Published in

Emarsys Craftlab

8 min readApr 8, 2020

In the last six months, I had the opportunity to join a different developer team in the company. I wanted to get deeper into our architecture, maybe try out the newest and shiniest JavaScript framework. However, the team where I was placed was responsible for operating one of the corner-stone services, which handles a huge load and also has a massive legacy codebase.

Oh well, do I really want this?

On second thoughts, I started to see the potentials. Whoever operates and develops this part of our application must be battle-tested and pretty experienced. I’ve learnt a lot in this new environment that I can use on my next team as well. (And maybe there I get to use that newest and shiniest JavaScript framework.)

Sure, let’s hop on board!

The six months have quickly passed, and my anticipation was right: I learned a lot since. To summarize the experience, I collected takeaways in three categories that I think could be useful for any developer:

change management
coding practices
and teamwork

Change management

̶M̶o̶v̶e̶ ̶f̶a̶s̶t̶ ̶a̶n̶d̶ ̶b̶r̶e̶a̶k̶ ̶t̶h̶i̶n̶g̶s̶ Be careful and deliver reliably

Facebook’s former motto, “Move fast and break things” can be interpreted as a way how XP developer methodology works. We tend to value a broken release followed by a fast hotfix more than doing right for the first time if this solution is shipped earlier. Rapid, one-week iterations, continuous integration and deployment all emphasize that we can fix every mistake fast, and you should just keep shippin’. But if your team is responsible for operating a service that is continuously under high load, then even the difference between zero downtime and a 2-minute downtime can be significant. We introduce every new code change behind a feature flag (called flippers internally), and they are rolled out to our servers gradually.

If any code modification has the slightest side-effect we didn’t anticipate, the change can be instantly disabled and examined further before restarting the rollout.

The team uses feature flags extensively, and we even created our tooling to help us manage the parallel rollouts using custom Slack notifications and a rollout timetable. After I learned why and how we work with feature flags (and operate them reliably), I see how a team can still be agile and ship features fast, while also manage change responsibly and try to avoid the slightest downtime.

Know the magnitude of your impact

When code runs in clusters of computers on hundreds of parallel threads, new challenges can quickly arise. As I learned, other factors need to be kept in mind when executing code changes compared to a small microservice. The most important one is to always keep in mind the potential impact of your code change regarding any extra data, traffic or processing resources it will utilize.

You don’t need to be exact, but at least think about its magnitude:

Sometimes fine-tuning a SQL query is not worth the debugging and optimizing effort. In other cases, it can mean the difference between acceptable execution time and disastrous database delays.
Logging a new message can mean one more harmless line in the crowd. Or, if it’s repeated at every iteration, it can also mean that you instantly generate double the logs you used to, and your infrastructure will get overloaded in minutes.
The one-in-a-million edge case you decided not to handle? Well, it might never happen, but it can also occur on a daily basis causing invalid values and alerts all the time.

The hard-to-debug race conditions, actual hardware limitations, and software optimizations will be a phenomenon you always need to think about when the application is continuously running on hundreds of parallel processes. To avoid unnecessary consumption of resources on one hand, but also do not waste hours with optimizations that don’t have enough added value on the other hand, first think about the magnitude of your impact.

Coding practices

Clean code matters

This one probably sounds basic, but if part of your job is to maintain and extend a legacy application, you might tend to give up on those precious clean code principles here and there.

You might even think it’s a lost cause for this codebase anyway, why bother? Well, bother, you should!

Just because the first lines of the codebase were committed to a version control old enough so it can be labeled legacy, it doesn’t mean you are fighting a lost battle. On the contrary, you can achieve some quick wins when you start contributing.

Adding new classes and refactoring time to time following clean code guidelines means you are continuously improving the overall quality and leave the codebase better than you found it. After some time, the team should realize that they have come a long way. The most used classes are already looking the way you like them, and most of the code can easily be tested or reused. One note, though, is that you have to get used to the feeling of incompleteness. Most probably, some of the files will remain untouched because it’s just not worth refactoring.

Our team’s project is not just legacy but to this day, it handles an increasingly high load, too.There were multiple iterations to speed up the application, and most of them included some code refactoring. We always kept clean code principles in mind when touching the codebase. Thanks to these efforts, even though the whole team had changed — multiple times — since the project started, we still can comfortably understand and extend the code. We gained significant speed and now run this not-so-legacy-looking legacy application more reliably and efficiently than ever before. This achievement is significantly due to clean code principles matter.

Smart logging makes a difference

Adding thorough logging to your code is one of the best things you can do to help your future self in possibly every investigation that will occur. This step might seem easy when coding: you get a logger from somewhere and shout your success message into it.

But on a large-scale application, logging doesn’t end here. Well, it doesn’t even start here.

Everything you might forget to think through with your microservice’s logging starts to make sense now. Carefully chosen identifiers can mean the difference between going through the 10.000 lines your application generates every minute and being able to filter it to just the ones you need. Differentiating your logs to debug, info, error, etc. gives you the ability to know about every minor information in staging. Still, you won’t generate gigabytes of logs every hour in production. On debugging occasions, you can even enable the otherwise disabled debug level logs to get even more insight about the fine details.

Proper logging is just the first building block to other advanced instrumentations. With the help of an adequate log processing and aggregating platform, you can create different alerts. You can get a notification if an exception happens (using proper log levels helps here a lot) or if a kind of log happens too often (404, for example, using the right identifiers makes this easy). On top of logs and alerts, you can even visualize the status of your system based on aggregated lines of logs. The application’s throughput or average response time is an excellent first candidate that you can quickly put on a chart, so you know that everything is right with just a glance at the dashboard.

Teamwork

Avoid one-man armies

One of the biggest lessons our team learned in the last months is to avoid the formation of one-man armies. A one-man army evolves when the same team member is working on a specific part of the application for a long time.

As their knowledge grows, they quickly become the go-to-person for every other team member regarding that code segment.

Although this might even seem right at first, the situation can quickly backfire. What if you can’t reach this person? A sudden sickness or other matter can mean you find a hole in your team’s domain knowledge. If more development requests come in than usual, you can’t move forward on multiple tracks either, since one will always pause and ask for guidance bothering the other coding track as well.

We use pair coding, where two developers are working closely at one computer to discuss challenges and solve problems together. If pair coding is done right, it can keep the team away from forming one-man armies. At every stand-up, you should think about if you changed coding pairs frequently enough. Did you pick the task that is the easiest for you, or did you move away from your comfort zone? By regularly doing so, you can learn about every part of your codebase and avoid one-man armies at the same time.

Learn to make decisions from half information

Even after six months of working in the team, I cannot confidently say that I know the project by heart. Sure, I understand how it works, and I have a high-level overview of everything, but when we start to go into fine details, I can lose my confidence. This uncertainty is all right for a while. No one expects a new team member to know everything instantly. But as time passes by, your opinion is asked for more and more often even though you still feel like you don’t know everything. At first, I thought it’s only my concern, but after talking to the others, this seems to be a common issue.

Let me prepare you for the sad truth: in spite of your domain knowledge growing steadily, where millions of lines of legacy code meet with multiple outdated frameworks,

the feeling of completeness might never come.

To overcome this obstacle, you need to learn how to make decisions based on half information.

After gaining a general understanding of the system, join the team’s technical discussions as soon as possible.

They will help you fill in some of the gaps, and you will realize it’s not necessary to know everything before you can start contributing to the codebase. Don’t forget that you have aids, don’t be afraid to use them. If you are in doubt, try to find the relevant source code lines and understand it locally. You can also always create spike solutions and performance tests.

The last six months have been one hell of a journey. It passed fast and provided lots of great insights. After internalizing most of these concepts, I come to think that they are not just separate observations but a whole cohesive way of thinking. They reinforce and support each other.

Although I learned this mindset when working on a high-load legacy application, I think it is useful for any codebase. Before making a change, be sure to know the magnitude of your impact. Sometimes it won’t be precise, but you still need to deal with it. That’s when being able to make decisions from half the information becomes very useful. As a general rule for the team: avoid one-man armies, use pair programming. And most importantly, code clean!

This will make a difference.

Lessons I learned from working on a high-load legacy application

Change management

Coding practices

Teamwork

Written by Soma Erdélyi