In software, every decision comes at a cost.
In this episode, we talk about software mistakes and tradeoffs with Tomasz Lelek, senior software engineer at DataStax and co-author of the book, "Software Mistakes and Tradeoffs: Making good programming decisions." After listening if you want to get a copy of the book, go to the link in our show notes and use offer code poddevdisc21 for a 35% discount.
Ben Halpern is co-founder and webmaster of DEV/Forem.
Tomasz currently works at Datastax, building products around one of the world's favorite distributed databases - Cassandra. He contributes to Java-Driver, Cassandra-Quarkus, Cassandra-Kafka connector, and Stargate. He is also co-author of the book, "Software Mistakes and Tradeoffs: Making good programming decisions"
[00:00:00] TL: Often the smaller part of the code is responsible for most of the users. And once we detect such a code, I'm calling it hot-path, it’s a lot easier to focus optimization efforts.
[00:00:24] BH: Welcome to DevDiscuss, the show where we cover the burning topics that impact all of our lives as developers. I’m Ben Halpern, a Co-Founder of Forem. And today, we’re talking software mistakes and trade-offs with Tomasz Lelek, Senior Software Engineer at DataStax, and author of the book, Software Mistakes and Trade-Offs: Making Good Programming Decisions. Thank you for being here.
[00:00:47] TL: Thanks for inviting me.
[00:00:49] BH: I imagine you are abundantly qualified, if you’re going to be writing a book about mistakes and trade-offs, such a fundamental part of software development. So is there any part of your career or challenge that particularly led you to wanting to author a book called Software Mistakes and Trade-Offs? Or was it sort of more just an aggregate thing? Like any particular challenge informed this book more than something else?
[00:01:16] TL: I think it was an aggregated list of those trade-offs, decisions and also mistakes. So I was collecting a list of those things from the beginning of my engineering career. It’s some kind of a decision log with pros and cons and different paths that we could take as a team. And it was a low level, like code level but also architectural and now with a solution to peak, how to build a more complex system. And then I was observing how does it evolve in the future because I was maintaining those systems. So I have a possibility to observe how systems behave on the production. So what were our assumptions and how it turned out in real life and also what I would do differently if I would be able to get back in time and maybe pick something else.
[00:02:12] BH: How much do you consider these topics to apply in all domains of software development? Or is there any kind of application layer of the stack like if I do systems development versus front-end development? Do you have a sense that like some of these themes apply more to certain domains and are there some that sort of stand out as like the most fundamental no matter what sub domain of this craft you’re in?
[00:02:45] TL: The book started with two backend developers with Jon. On the front end, there are different challenges, different patterns. All of my professional career I was working as a back-end developer. So this is my expertise. Maybe in the future someone will write this in our book about content. But as long as you need to write some back-end stuff, even if you are a full stack engineer, I think you will benefit from this book. I mean, you don’t need to write all the chapters. For example, there is a chapter about the big data processing and data locality. If you're building services, APIs that doesn’t need to deal with storage, maybe you can skip just this chapter. But on the other hand, there is a chapter about distributed systems and codes in them, in distributed systems, how to retry operations, how to design your API to be either important and so on. So even if you are a full stack, I think you might benefit from this chapter.
[00:03:43] BH: Another question about what the fit of this book is versus different folks in different parts of their career. And I am certain, this is broadly relevant to anyone, but I’m curious if you feel like from a specific programming languages and maybe the cultures that surround them as well as specific programming environments, like maybe working in a startup versus a bigger organization, is there a sense of like trade-offs within the trade-offs or anything where you feel certain lessons can apply better for a C++ developer versus a Ruby developer and this sort of trade-offs within these paradigms?
[00:04:22] TL: Yeah. So I had the pleasure to work with both environments in small startup environments and also in more corporate environments, bigger companies, that was not purely technological component, but technology was just a tool to achieve business goals side and all the partners in this book apply in both of those domains for sure. So I would not say that if you are working in something like that that you will not benefit from that, even I would say differently maybe you see different patterns for solving the same problem and how to evolve your data, how to build anti-corruption layer, what technology is used to encapsulate all stuff. And regarding experience of developers, I personally would love to learn such a book at the beginning of my career. But on the other hand, I think that if someone has some experience like one or two years or so on, such a person will benefit from this book more because he can relate the examples, the problems, his personal experience. However, at the Manning Publisher of the review stage, that whole book or chapters are reviewed by 30 persons, 30 engineers, and experience of them varies a lot. There were some junior developers, there were some architects with like 15 years of experience. And I think both of those groups benefitted from the book.
[00:05:52] BH: So the first chapter of the book is called, “Code duplication is not always bad,” with the subheading of Code Duplication versus Flexibility. Can you dive into this topic a little bit? And this seems very broadly applicable as well. Any domain of software development is going to have to think about this type of trade-off. And can you expand on this one a little bit?
[00:06:16] TL: Yeah, this is a good example, also referring to the previous question. So if you are at the university, everyone is teaching you the device so do not repeat yourself. And this would be mostly good rule in most of the contexts. And at the beginning, it’s a great one. But as you gain more experience, you start noticing that maybe you shouldn’t follow this rule blindly. This is when this chapter comes in and it will be discussed in this book. So there are two parts in this chapter. One is architectural level and second one is at the low-level code level. So at the architectural level, I’m considering a case where we have multiple services, microservice architecture or multi-service architecture is pretty popular now, and a lot of architectures involve with this, and the separation. And if you have, for example, two codebases, there is often the case that there is some common code between both of those. It’s going to be like some various simple logic for validating your stuff, original validating the token for authorization and so on and so on. And if you follow this by, it would mean that every piece of the code needs to be factored, obstructed away, and shared by those two microservices. It’s going to be shared using library or all the microservice, but it comes with a cost. So firstly, you need to maintain this separate part of the code, also it introduces tight coupling between this code and your microservices and it made to resound that you started to abstract away your stuff too early. Maybe this common code should evolve in a different way. And if you have this obstruction, you are tied, coupled to it, it’s harder to evolve, and two, you have worse flexibility, and what might impact the speed of delivery of our software. So this is the trade-off basically. And second part is at the code level, so I’m considering inheritance and also design of components that are using inheritance as a pattern to reduce duplication, and in some cases also it limits the flexibility of components that are extending this parent class and also consider their composition there.
[00:08:36] BH: It seems like other chapters in this book kind of help support the argument you’re just making now. So we have a chapter called Balancing Flexibility and Complexity, which is kind of like a supporting argument for the whole idea of Code Duplication versus Flexibility, I imagine. And there’s also like premature optimization versus optimizing hot-path decisions that impact code performance. So when you’re talking about trade-offs of duplication versus flexibility, sometimes the trade-off is a matter of complexity and tying yourself to an abstraction, which may be wrong, and that’s probably a matter of complexity. And then there’s also the optimization decision. And these kinds of both seem to tie into this first idea of duplication versus flexibility. Sometimes it’s a matter of complexity as the trade-off and sometimes it’s a matter of performance. Would you say like those topics all reasonably effectively together? There’s also a chapter called Simplicity of Your API Versus Cost of Maintenance. So I’m just sort of plucking a lot of things that I think like are highly relatable thematically and I imagine why they fit into a single book. Can you speak to kind of the relationship between some of these chapters?
[00:09:52] TL: Yeah. So about Balancing Flexibility and Complexity, one of the aspects could be, as we mentioned, duplication, is described in this second chapter. So in this one, I’m going into a bit of a different direction. It’s from the perspective of someone that is developing code that will be used by other persons, right? So it’s going to be within your company or designing a component that will be used by other cold-paths, other paths in your organization. That’s often the case. Right? In such a context, it is often time to think try to over-engineer your component, right? To guess all possible functionalities, all possible use cases upfront. And for example, I’m showing one of the patterns for achieving that. So library that allows clients to block some custom behavior by a Listener API or Hook API. Right? And this sounds awesome because you are leaving a lot of flexibility to your clients. Well, this is I think one of the most flexible mechanics because clients could do anything they want in those cold-paths. But on the other hand, you are raising the complexity and where the complexities arise. So mainly it is rise because you are not able to anticipate all use cases that customers can use your code. For example, in JVM languages, they control run-time exceptions everywhere. You don’t have influence over it and you need to expect anything. And expect anything means that your code will grow to be more complex and complex or so there it comes writing the model. If you don’t have as clear documentation how it should be used, the customers may block processing and to reduce impact in your whole application, right? If you’re non-blocking event-loop, that could be blocked and also you increase complexity. Even if you are documenting this well enough, maybe not everyone would read the documentation. So at the end, we still need to handle those cases, even if it’s documented.
[00:12:13] BH: It’s one thing to have develop your own awareness for some of these concepts to be able to make the trade-offs yourself between code duplication and flexibility. But how do you go about having this conversation effectively with team members who might see a PR you put up and complain that maybe this is code duplication, whereas your feeling is that it’s an appropriate amount of duplication given the trade-off? How do you go about having these conversations with colleagues and perhaps managers or other stakeholders in the business?
[00:12:53] TL: Yeah. So the goal is to just propose two solutions. If you have enough time, maybe create pull requests and with both approaches and discuss that, right? You can decide which one is better and see in the code if that’s quite isolated and a small change. On the higher level, you need to understand that also duplication requires some synchronization effort, maybe between teams, maybe between the cold-paths. But those two chapters that I’ve discussed are more low level at the code level and this is the first part of the book and second one is transitioning to more architectural and end-to-end understanding of your systems and the fifth chapter as we mentioned about premature optimization and optimizing hot-path. So it’s also premature optimization, this is the well-known rule [INAUDIBLE 00:13:46] and everyone’s afraid of optimizing code upfront. However, I think, and I’m trying to convince that if we have enough data upfront, we can turn this premature optimization and adjacent time optimization or something like this. So we need two things there. This is like some kind of SLA or some kind of a guidance, how big the traffic will be, so the throughput of your application, of your endpoints, of your APIs, but this is not enough, and this [INAUDIBLE 00:14:22] proposing. You also need to have the latencies collected per endpoint. So it’s going to be collected using some performance tests that I’m presenting an approach to do this or also from SLA. And once you have both of those data points, you can calculate, basically you can calculate relevance of cold-paths in your code and often to some that’s most of the endpoints, most of the features impact the lesser part of the system, as lesser part of the users, and on the other hand, smaller part of the systems, smaller code provides most of the business functionality. And it means also that you can isolate your efforts on optimizing the small part of code. And also this is observed in a lot of systems, not only software systems. There is a Pareto principle that states that 80% of the value is delivered by 20% of work, and it’s related very well to most of the software systems because often the smaller part of the code is responsible for most of the users. And once we detect such a code, I'm calling hot-path, it’s a lot easier to focus optimization efforts on this part of the code.
[00:15:43] BH: So once you’ve identified this hot-path and you created the opportunity for some isolation and generally like you’ve gone through the framework you’re proposing, when it comes to optimization at that point, in the optimization process, what are the trade-offs you have to balance in terms of maintainability versus optimization once you’ve identified isolation? So that’s probably the first step ensuring that the optimizations you made can be done in an isolated enough way and structured well enough. But when it comes to doing the work, how do you then ensure the trade-off between quality, maintainability and optimization and performance?
[00:16:29] TL: That’s a good question. So yeah. So once you detect this hot-path, depending on your guidelines, in your code for example, if you detect the code that is touched by every user, it means that you need to focus your efforts there, right? And maybe in such a case, you can sacrifice some code readability over the performance of the code, because often when you are highly optimizing the code, it suddenly stops to be so readable. Maybe you need to reduce usage of some patterns and so on. So you have like hard data justifying your decisions that this is part of the code that we should fight off this performance over readability, over maybe maintainability even because we are improving it for all of the users.
[00:17:17] BH: And this might be a time when you actually create duplication even when don’t repeat yourself practices were already in place because maybe you might come along a shared implementation of a solution that’s used across the application. But in this case, it’s worthwhile to rewrite that shared code in such a way that it’s going to have a meaningful performance impact.
[00:17:42] TL: Maybe yes. But I will say for a bit different titles here, for example, trade-off like also the costs that you can spend there. Right? It’s development cost versus improvement of the performance. Also, what’s interesting is that when you will focus on improvements on the hot-path and you improve it for like, let’s say, I don’t know, 10%, 20% and so on and so on, there may be a point in time when optimizing hot-path is not giving the overall benefits and have to justify changing it. Maybe then you can focus on those paths that are executed less of the time, like some, I don’t know, in initialization logic, that takes a lot of time, but if it’s a variable a lot of time like one request per, I don’t know, minutes, hours, or maybe it takes a lot of time. But as I said, you need to have this data regarding number of requests, throughput, and also latency. So you can base it on average latency, but also to your outliers, how many outliers do you have based on percent. You can find it based on higher percentiles, like 99 percentiles.
[00:18:51] BH: Are there any other chapters in the book that you find yourself wanting to talk to people more about that come up a lot?
[00:19:01] TL: Yeah. So I think Chapter 10 and 11 and the reason is that they are focusing on distributed systems and trade-offs in them. Nowadays, I think like almost every engineer needs to tackle some kind of a distributed system to interact with such a system called service system or designs.
[00:19:22] BH: So Chapter 10 is Consistency and Atomicity in Distributed System and Chapter 11 is Dealing with Delivery Semantics in Distributed Systems. So before we get into this, can we even just briefly define distributed systems?
[00:19:41] TL: Yeah. So I can define as a system where you need to call something that can file and you are call it maybe by network.
[00:19:51] BH: That’s very concise and well put. So consistency and atomicity. So what do you mean when you even use those words in this context?
[00:20:02] TL: Those words may be complex. I mean, someone can discard by them. That’s why in this chapter, I’m trying to build up on the very simple example. So assume that you have one service and you need to call a second API, and that’s it. We have HTTP protocol. You are executing an HTTP code. And what happens when the code is failing? Well, the cold code failed and you need to make a decision. In that moment, you have a distributed system problem. Why? Because when you are calling one service to the other, the call may succeed, but when the response was coming back to your system, that could be a network partition, network failure, and the response didn’t arrive to your system. And what does it mean? And so it means that from the perspective of the second service, to which you are sending the request, everything went well, but from the perspective of the caller, there was a failure. In that scenario, we have not consistent view of the system, although the data may be good and we need to make a decision on what to do at the caller’s side. Once [INAUDIBLE 00:21:14] retry their request, but you need to have the idempotent operation, meaning that you can retry it safely without any problems. So all data retrieval operations should be idempotent, also the lead operation should be primary key that’s unique should be idempotent because it doesn’t matter if you did it once or two times. It always will be deleted. But there are also other operations that maybe not idempotent. And in that situation, it may be not so easy and why still retry. For example, if the second system is sending email based on this notification, then it is connected with the next chapter, so about delivery semantics. And also this behavior that you are retrying means that your system is basically providing at least once guarantee at this cold-path because the receiver application can receive this message once or more times, depending on the retry behavior, but it will be delivered eventually, assuming that the network will rework again. On the other hand, you could add the caller side, do not retry such a request, but in that case, you are risking the situation was different. So the second, when you are calling this initially, that request will not respond when it comes back to the first application but the request. And in that case, you have, again, inconsistent system. But if you are not yet retrying, the receiver application may not receive it at all. So you have the utmost ones. It could be received zero times or once at most. And there is like golden solution maybe for someone, some people that is effective exactly once. So trying to implement a system in a way that this duplication will not occur at some level. So it’s going to be achieved using some kind of global locks on the system. It’s not performant. It’s going to be assisted by some of the duplication mechanisms. So duplicating those retries based on a specific ID, but also it has some problems. The duplication mechanism often needs to be persistent because you don’t know after, because the retry could arrive after one hour, depending on the system. There are some [INAUDIBLE 00:23:39] systems that would retry even after quite a long time if you have a backup strategy and so on. And in that case, you need to persist this information about persistent events for the case of the duplication mechanism, and when the persistence comes in, you have another distributed cold. Right? So the same problems, you also need to solve them. Ideally, you would like to have some kind of an atomic way to modify, structuring the database. And some of those databases provide such stability. It’s called lightweight transactions, but also you need to be aware of those mechanisms and this API and those problems.
[00:24:38] BH: The final chapter in the book is called, “Keeping up to date with trends versus the cost of maintenance of your code.” This one seems like a good final chapter, has a lot to do with where you’re spending your time, even the trade-offs in your own learning, but then obviously rewriting code or reimplementation. Can you get into the lessons you’re trying to promote in this chapter?
[00:25:07] TL: Yeah. So often I think once a couple of months or once a couple of years, there are new technologies, new opportunities that are gaining traction a lot, like microservice architecture was the introduction years ago, reactive programming, some specific partners like functional or [INAUDIBLE 00:25:25] programming, functional programming, dependency injection as a pattern and so on and so on. And there is a moment when everyone wants to use it, regardless of the context in which people operate and I'm advocating for that. This is a mistake in most of the cases because every new technology comes with the cost and sometimes cost is hidden because it’s not tested by everyone yet, or some inherent complexities that, for example, in reactive processing. This processing is totally different on interactive processing and you can reason about your code easier if you learn this model firstly. And peaking reactive processing patterns may add those as complexity maintenance and even box, and if you do not benefit from the pros of such an approach, maybe it’s not worth adding it to your codebase because you need to also understand the training model of the solution that you picked, the ecosystem and so on.
[00:26:29] BH: What about the trade-offs in your own process and career between staying focused at the problem at hand and writing a book, taking a step back to take the time to interest back into creating some of these rules? What about your own trade-offs and even deciding to do this? Like if I want to write a book, I feel like a lot of, even the decision, is a trade-off between where I want to spend my time and even where I want to deepen my expertise. And maybe where I’m getting with this in some capacity is that you’re most likely going to be writing a book about things you deeply understand in the moment. So there’s maybe a trade-off versus like expanding to a new technology. So while you were writing this book, you were probably not learning a new paradigm, which could be interesting or valuable. What was that process like as a sort of framework-oriented thinker in that regard?
[00:27:32] TL: So from my experience, when you are trying to write about some topic, you need to express your thinking in a different way and also it lets you understand things in a different way and also sometimes even notice something that was unnoticeable before, that’s why also engineers, when they are solving some problems, sometimes they want to talk with other person, even if the other person will not respond, just asking a question sometimes will allow you to solve some problem. Also, I have the same experience when I’m doing some workshops from specific technologies. I always think on average workshop there’s one or two questions that allow me to expand my knowledge, the process also for defining what should be written, I will need to learn new things as well.
[00:28:23] BH: Do you consider any of the chapters to be controversial at all? Any major pushback on the premise you’re putting out there with any of this?
[00:28:32] TL: I think maybe a bit the second one. I’m proposing a post to actually calculate the cost of the duplication that comes from the fact that you need additional synchronization. Once the code is shared between people, they might be adopting the Amdahl’s law for that and do some simple calculations, but I don’t think so.
[00:28:56] BH: So like the book's place in the ecosystem is not necessarily to tell the world something they’re not ready to accept, but to sort of explain it in such a way that is helpful to the individual?
[00:29:12] TL: I think there was a stage in their career when everyone wants to jump to the next level or something new or to do, I don’t know, combine it as a bigger scale, different problems, maybe combine [INAUDIBLE 00:29:24] complex technical problems, technology problems, then such a person will need to learn about it and I think books are good resources of that, for sure.
[00:29:38] BH: Thanks so much for joining us today.
[00:29:41] TL: Thank you also for inviting me.
[00:29:52] BH: This show is produced and mixed by Levi Sharpe. Editorial oversight by Jess Lee, Peter Frank, and Saron Yitbarek. Our theme song is by Slow Biz. If you have any questions or comments, email [email protected] and make sure to join us for our DevDiscuss Twitter chats every Tuesday at 9:00 PM US Eastern Time. Or if you want to start your own discussion, write a post on DEV using the #discuss. Please rate and subscribe to this show on Apple Podcasts.