Think fast, not furious.
In this episode, we talk about web performance with Todd Underwood, senior director of engineering and SRE at Google.
Ben Halpern is co-founder and webmaster of DEV/Forem.
Jamie Gaskins is principal site reliability engineer at Forem.
Todd Underwood is a director at Google. He leads machine learning for site reliability engineering (SRE) for Google. ML SRE teams build and scale internal and external ML services and are critical to almost every product area at Google. He is also the engineering site lead for Google’s Pittsburgh office.
[00:00:00] TU: We often subdivide these things and say, “I got to do this part this fast.” You’re like, “No, the only thing you have to do is you have to meet your user’s expectations and you have to meet those users’ expectations all the way at the end.” Right? Like the thing that they get.
[00:00:24] BH: Welcome to DevDiscuss, the show where we cover the burning topics that impact all of our lives as developers. I’m Ben Halpern, a Co-Founder of Forem.
[00:00:31] JG: I’m Jamie Gaskins, Principal Site Reliability Engineer at Forem. And today, we’re talking about web performance with Todd Underwood, Senior Director of Engineering and SRE at Google. Thank you for joining us.
[00:00:40] TU: Yeah, thanks for having me.
[00:00:41] BH: So today’s episode is all about performance and I think really about performance within the context of site reliability engineering. We have Jamie from our team. We have Todd from Google. And before we get started, Todd, can we talk about your career and your background and how you got to where you are?
[00:01:00] TU: So my background is like most people who are of a certain age, I got hired into an ISP that I had no business working in because I didn’t really know what an ISD was, but that’s okay because nobody else did either. We made it up as we went along. I did a bunch of systems work, a bunch of networking work, a certain amount of software development and security, got puzzled why these things were separate. They didn’t seem like they should be separate, but everybody in them was convinced that they were completely separate disciplines. So I found that confusing. When I moved from New Mexico to New Hampshire, I started working at Renesys, which is this company that did internet analytics, and it was a really interesting way to get in the rapid growth period of the internet and understand how things were interconnecting. I came to Google a long time ago, 2009, and have been doing mostly machine learning work at Google, so back before other people were talking about it. So I started the first SRE team for our machine learning teams here and I’ve been growing those ever since. So the first one we started was not surprising because it’s Google and advertising machine learning platform team because advertising makes some money and we wanted to target those ads well so that they were useful and valuable. Since then, I’ve sort of expanded that to a team that works on machine learning and other sorts of software related to ML for the whole company. That’s a nutshell. In a side job, I also am the engineering site lead for the Google Pittsburgh Office. Pittsburgh is a wacky, interesting, cool place to live. So feel free to move here if you want to. I think we have some cheap housing.
[00:02:26] JG: Like old point, did you start focusing on performance at Google or even before?
[00:02:31] TU: What’s interesting about performance is I think we set it aside as this extra characteristic, but it’s really not. So one of the things that happens when I see people talking about performance, they say like, “We’re given some idea from someone else, We’re trying to achieve that thing.” And that’s fine. There’s nothing in principle wrong with that. But what I see as when that’s divorced from the whole point of the service you’re building or the offering that you have, it doesn’t make any sense. I can give lots of examples where people say like, “I need this X thing to be Y fast.” You’re like, “Okay, cool. Tell me what happens when it doesn’t do that.” And they say like, “Oh, well, nothing in particular.” You’re like, “Okay, well then I don’t think you actually need it to be Y fast. Maybe you need it to be 2Y fast or 10Y fast.” So often I’ll see performance ideas when people focus on it as a standalone aspect, I see that as divorced from the whole point of the service, whereas the reality, like one of the things that I found fascinating at Google is we bake these things into service level objectives, and frequently performance is part of another reliability service level objective. So I’ll give a concrete example. One of the tricks that I saw the ads team use a decade or two ago was they would talk about, so you have this problem if you don’t serve the ad, you don’t make the money. So you want to serve the ads, but how slow is too slow to serve the ads? And I think we can all do this reductio ad absurdum of, “Well, what happens if I serve the ad tomorrow?” Well, that can’t count, right? Like tomorrow, you can’t send me an ad tomorrow for a web click I just made. That doesn’t work. What if it’s in 10 seconds? No, 10 seconds does not count. What if it’s in 40 milliseconds? Absolutely counts. I’m like, “Okay, we’ve bracketed it somewhere between 40 milliseconds in 10 seconds is this line where we stopped counting it.” And so what they did is they would put the line into the availability metrics. So then you say, “Is that a performance metric?” And they would say, “Look, we count five nines or four nines and a half or something of availability at an X 100-millisecond deadline. And then you say like, “Oh, those are baked into one metric, but the metric is tied to what is acceptable to the end users, what is going to count as performing the needs of that service.” So that’s for me, I don’t enjoy the people who fetishize performance for performance sake, and I’ve seen plenty of bad examples of that. And so I’m much more interested in baking it into the purpose of the service itself. I hope that didn’t derail the whole conversation…
[00:04:57] JG: No. Actually, I thought that was a really good way to talk about it because even before I ever heard the term site reliability engineering, I heard a lot of discussions about it and I participated a lot of discussions about performance budgeting. So if you have some performance budget of X number of milliseconds or even the X number of seconds, when you take it out of the site reliability engineering, performance can often be a feature, especially something you’re making money on. If something takes too long, people are going to close the tab. You’re not going to make them money, similar to how you were talking about with the ads. Right? You have a very low number of milliseconds to get that ad, especially when you’re talking about header bidding and things like that, because you’re racing against everybody else as well.
[00:05:35] TU: Yeah.
[00:05:35] JG: I’ve never actually had to implement that, but I imagine that the performance requirements on that are much tighter than your typical web page.
[00:05:43] YU: Yeah, I think they are. But I also think like what’s tricky is it’s tied into expectations and it’s tied into user behavior and culture and those change over time in different contexts. So I’ll give it like an example from close to home. If you think about Google Search, right? So if you used Google Search 10 or 15 years ago, it was 10 blue links, right? You did a thing, you got 10 links and there were links to web pages. There was really nothing else on the page. Some people pine for that, but what people don’t think about are the other information that Google delivers now without making you click on any page. So when you say like, “Hey, what’s the capital of Ethiopia?” Or you say, “When did the Spanish-American War end?” Google will just tell you. You have to click it. You have web pages you can click on to get more context, but the answer’s right there. And so what’s interesting is there’s this question of, “If the first page, if the 10 blue links I could serve in 200 milliseconds, but the answer I can serve in 500 milliseconds, is the answer slower?” Well, the first case you had to open the page, look at it with your slow human eyes made of meat, trying to figure out what’s happening on there, click on a thing, wait for that page to load, to find the answer, which is the Spanish-American War was in 1898. Right? So that takes a while. That’s not that fast. Whereas like if you take two and a half times as long to load the page, you just get the answer right away. I often think of that case, not because I think there’s an obvious answer, like I think most users within limits prefer the second to the first, which is why Google and other people have headed in that direction, but I just think it’s an interesting way to think about. We often subdivide these things and say, “I got to do this part this fast.” You’re like, “No, the only thing you have to do is you have to meet your user’s expectations and you have to meet those users’ expectations all the way at the end.” Right? Like the thing that they get. You can divide it however you want and measure it in little slices, but we shouldn’t kid ourselves at pretending that those little slices are the truth or necessary or absolute. Those are just like monitoring and analysis artifacts while we’re trying to achieve this objective, which is to meet our users’ expectations.
[00:07:54] JG: Right, the whole cumulative performance, like the end to end.
[00:07:56] TU: That’s right.
[00:07:57] JG: What happens when I click a link or type google.com in my browser?
[00:08:01] TU: That’s right. And there are some of them that are super crispy and clean, but there’s more of them recently, like even if you say, like, “I type google.com into my browser,” and you’re like, “Yeah, okay, cool, but what were you trying to do?” You're like, “Well, actually I was trying to send email.” Well, maybe there’s a way to just get you to your mail client faster than that. You know what I’m saying? So at every given point, just when you think you’re like, “Look, there’s a simple thing and I’m measuring it and I’m doing it.” And ultimately, we’re trying to make lots of things easier for lots of people and some of these tasks should go away. So the next step would be starting to send an email, why returning to send an email, because my dad wrote to me about this thing and I had to give an answer. What if I could just have automatically answered that for you? Oh, well then there’s no white page load. There’s no client load. There’s no mail message. There’s no typing this because you already answered it for me. There’s no way to measure that. But anyway, sorry, I’m digressing, but I do think there’s some really interesting points about like trying to actually meet users’ expectations at a deeper sense.
[00:08:59] BH: How does Google actually go about embedding some of these ideas into onboarding or team dialogue? And how do you even speak to the end users’ needs if maybe your work is lower level or further from the end user? How do those things go together? If you have an SLA and it’s sort of abstract, but maybe you’re talking about the end user, how does that dialogue come together in decision-making?
[00:09:29] TU: I actually think it’s really difficult. I’ll give a concrete example. A lot of engineers early in Google were working on ad-words and working on like the Google advertising system. Well, you would go talk to those people and you say like, “Hey, how many ad campaigns have you run?” And there’s not a single engineer who’d ever placed an ad word ad. They didn’t know what a campaign was. They didn’t know how to type a creative. They didn’t know why something worked or why it didn’t. If you ask people like, “Hey, roughly, how much does a click cost? Is like a 10th of a cent, a cent, 10 cents, a dollar, $10?” People would just guess. They’ve no idea. Right? And so there’s this interesting complete separation between the people implementing these things and the people using them. And actually, I think that’s not just early Google working on ads, that’s across the industry. Most of the people working at Salesforce who are working on customer research management systems, they’ve never dealt with a customer in their whole life. They’re software developers. They hope never to talk to other people. It’s just like we don’t. Right? There’s a big gap here that’s really tricky. So I think the most important thing to acknowledge is it’s difficult. It’s really, really hard, and it’s not a thing you just do. It’s the thing you have to keep doing. I’ll tell you what Google does. We try to bake some of that function into the product management function because we really think of product managers as specifying what it counts to be a successful version of this product. So if you tell me, “Make a web server.” I'm like, “Okay, cool. Does it have to be like HTTP 3.0 compliant or can I ditch all of that?” Does it have to be fast? Does it have to be certain kinds of secure? Give me some requirements about what it has to do. Does it have to scale really well? Does it have to be super memory light? What are you expecting that?” Well, the product managers are supposed to specify that for us and then do that on an ongoing basis by talking to customers. There’s another thing we do. You had asked about the sort of lower-level engineers or the more focused engineers who are working on a particular feature. I think there’s as much of a problem in management and like leadership as well that if you look at big chunks of not just Google, but the industry who work on cloud services or work on enterprise software is lagged by people who have never consumed cloud services as a customer or never consumed enterprise software as a customer. And so really trying to deeply understand with some empathy what those other people are going through is pretty tough. So I think for those, there’s two solutions. One is to really make those products available for free to internal users and let them play with them. It’s not a substitute, but at least it gives you, like you try to use something and you’re like, “This is just unusable. I can’t even figure out what to do. The thing throws errors. My config was wrong. I spent 17 hours and I didn’t even get hello world up,” whatever, like you get these experiences. And the other is to really have more extensive engagements between leadership and large customers where large customers on an ongoing basis, not once, not on one project, but like every week, every month, every quarter come back and say like, “Hey, we’re doing this implementation. It’s not going well. Let’s talk through why this has been hard. Let’s talk through why you think your service is fantastic, but I think your service is terrible,” and both of us could be right. That’s what’s frustrating is I can have a service that’s amazing and a customer is using it in a way that might be perfectly reasonable, but it’s terrible. So I have some service that returns single records pretty fast as a different API call to return thousands of records. And the customers put it in a tight loop and they’re returning single records one after the other and it’s pretty slow. You're like, “Your service is slow. Don’t use it that way. Use it this other way.” And they’re like, “Well, we didn’t know about that or we didn’t implement it that way.” That’s a simplistic example, but there’s lots of examples like that. So those are the techniques I know, but I think then it’s a good question. It’s really hard to have that level of understanding and empathy.
[00:13:34] BH: What would you say the important moments in the history of performance on the web go maybe from your perspective? We didn’t necessarily bring you on to be a web historian, but the Google we’re talking about today is different from the Google as it launched. Although I think performance was a big part of that product forever and probably why it’s still so important today. How have our ideas had to shift about what performance even means as a developer who’s been around a while?
[00:14:07] TU: The analogy I’ll use of what I’ve seen in my time in the industry and mostly just to clarify, you’re right, I am not a web historian. I have grown up with this stuff and helped build some of the systems and network side of it, but I’ve definitely not studied the history of it at all. But I think like the analogy I’ll use is that similar to Gould’s punctuated equilibrium, as you think about evolution, that you don’t have a smooth improvement of performance, which you usually have is an accepted standard and then someone who leapfrogs, someone who does something massively better. Everyone else freaked out because they thought like, “I didn’t think that was a reasonable thing to do. I might not have thought it was possible, but I knew it wasn’t reasonable and now somebody else, one of my competitors or some new upstart just did it. So now I have to do that too.” And then the whole industry gets better. So I think for me, there’s all kinds of moments like that where I think the way Google did really distributed computing on a bunch of crappy computers, like most people before that was like, “I don’t think you’re going to get the system software right.” Like, “I just don’t think you’re going to get the message passing the RPC stuff.” I mean, going back into the dawn of time, we thought SOAP was an improvement. We were like, “SOAP, we’re going to use SOAP for a message.” I mean, it was terrible. Marshaling like API calls into like these huge XML documents and throwing them around like it was an RPC system. And so when Google came out with this Protobuf thing, a lot of people were like, “What is that? That seems fascinating. That seems utterly unlikely to succeed. It’s not enterprise quality.” And then Google was like, “I know, yeah, but we can run a search query on a thousand computers all at once and return results in like under half a second or in a few hundred milliseconds.” And like a lot of us, like I was not at Google at the time, freaked out. We’re like, “Really? That seems both impossible and completely amazing.” And I think there were other examples like that. I mean, I’ll give a good early example, web serving backend, like if you to remember when the send file API call came out, but like web serving in the late ’90s, early 2000s, there were early versions of web servers. And Microsoft was just coming out with IAS and it had like NT 3.5 and NT 4.0. And NT 4.0 came out with a web server that was just massively faster than everybody else’s web server at static web serving. And then if you go back and read like Linux Kernel Archives and on the Apache Mailing List, basically people say, “Well, they’re just cheating.” You’re like, “Okay, cool. How is it they’re cheating?” And they’re like, “Well, they made up a new system call that takes a file handle and just sends it to a socket without bouncing back through user space and that’s cheating.” Like, “Okay, that just sounds different. Is there a security vulnerability or problem?” They’re like, “Well, no, not really because we already have the file. We’re just sending it to the socket.” You're like, “Okay. So instead of cheating, could we say hyper optimized and massively better than what we do?” And in the end, that’s what happened, right? Like Linux experimented with an in-kernel web server, which is madness. That’s not something you want to do, but then what they did from that is they took out a few of the things and like doing zero-copy send files so that you basically just have a file and you send it to a network socket, that changed everything about web serving. I think there were some other moments, encryption, when we all decided we wanted things to start to be secure, but we’re like, “Actually, RCPs don’t do that that well,” and we’re all buying accelerator cards for encryption and stuff. So I think there were these moments where things got a lot better. And what I would say is I’m as skeptical and cranky as the next person, possibly more so since it’s in my job description, but I think it’s important to watch the people who are doing something wacky, but something that’s much, much better. It’s not important because they’re going to succeed because they’ll be amazing or something, but it is important because they will inspire the rest of us either with their example or their bad example, but they will inspire the rest of us. So every time I see somebody doing something much bigger, much more ridiculous, much different, I try to pay enough attention to figure out what it is we should all be learning from it. So those are some of the examples that we do not.
[00:18:37] JG: You touched on a little bit of this a little earlier, but at what point are you actually considering performance?
[00:18:43] TU: There’s the answer to that question and then there’s the answer to a variant of that question, which maybe that question is getting at. So the answer to that question for Google is we start right away. It’s right from the beginning. And in part that’s because we’re blessed/cursed with a very high profile and a very large amount of users and a very large amount of external connections and networks and data. It’s not freedom. It is a kind of set of requirements and responsibilities that you can’t just do a thing here. You can do a thing and it could be amazing, but you can’t just like, “I’m going to slap this together and see what happens,” because see what happens, like unsuccessful products at Google, get a few million monthly active users. And so like that is an unsuccessful product that is basically ignored by everyone. It’s like, “Oh, man! If people like this, I’m going to have tens or hundreds of millions of monthly active users.” Before I came here, we would measure things in like hits per day or queries per hour. And at Google, it’s usually measured in thousands or millions of queries per second. Sorry. There’s a device over there that heard me say Google. Are you listening to me? But the question I think you might’ve asked is like, “When should other people worry about performance?” And I actually think our industry worries about performance too soon.
[00:20:07] JG: Interesting.
[00:20:07] TU: Because there’s no point in having an amazing product that nobody wants to use because it doesn’t solve an interesting problem. If I’m staring at the early product implementation, I would have a fork in the road, not in the Yogi Berra sense of, “When you come to a fork in the road, take it.” But you stare at that and you say, “Look, on the left is the performance of the system is critical to whether my users will accept it, whether they will get value out of it, in which case I better think about this right now. And the right hand side is actually like people are willing to tolerate something slow or clunky or a little bit not stellar because of the new thing it might do and let me have a path to making it better in the future, but let me first figure out whether I’ve got something that people actually want.” There’s been a lot written about this, but I think like the essence of the argument was what we’re looking for is learning more about what our users want as fast as possible. So it’s not acquiring new users, it’s improving the value we’re delivering to them. So that’s what I would say, like, yeah, you worry about performance just as soon as you have to but not before that, because if you worry about it before that, it actually has a cost to it. I know this is a bummer, right? You guys are like doing this show on performance and I’m like, “Forget it. Don’t think about performance.” Sorry about that. But what I’m trying to do is say like it’s important when it’s important. There’s no substitute for it, like availability as a feature, speed as a feature, they’re frequently some of the most important features, but not always. Sometimes if I want to report, I’m like, “I got a big report,” and I’m like, “I don’t care if you do it to me in five minutes or five hours, I’m going to look at it tomorrow anyway.” And so like, “Okay, well, five hours is cheaper and easier than five minutes. So let me do that instead.” So I think that’s the way I think about it.
[00:21:51] JG: Most of my career, I’ve worked with startups, and startups tend to prioritize features above all else. And one of the things I’ve seen is that a lot of times engineers are so focused on building those features that either product people or salespeople are throwing their way when you’re doing sales-driven development and that performance is often neglected, at least at the companies that I’ve worked with throughout my career. Performance is neglected until a point where it’s like they can’t ignore it any longer.
[00:22:19] TU: Yeah.
[00:22:19] JG: Probably the first half of my career, the first 10 years or so, was a bunch of startups coming to me and going, “We paid somebody pennies an hour to build this out for us, but it’s too slow now. We’ve focused on performance too long. These features got too successful. We need to go faster.” And so it’s a flip side of the same coin, right? Like people that did forget about performance for way too long. And it’s probably different for every company where there’s like a sweet spot when you should probably start thinking about performance. Not necessarily trying to make everything fast, but at least thinking about what the performance requirements are and what they’re going to look like six months or a year from now.
[00:23:00] BH: And to speak to that from my perspective and the founder of our startup and sort of like the first person to write code for a while, I have my own ideas about the whole conversation, especially with performance. So early on, we wrote all of our admin dashboards in such a way where they had N+1 queries sometimes and sometimes they felt like really slow and we sort of knew the users who were using them were like sitting in the room with us when we were very first starting, like before we went remote and when we were just a few people, like admin dashboards are very, very slow. And I personally, as the one writing a lot of the code at the time, was really, really tolerant of slow admin dashboards, as long as people knew that like way to use them is to open a tab and then go get coffee. As long as the person I knew was using them, I knew how to engage with them in that way. But on the flip side, our user facing site, DEV, DEV.to, was literally like only about performance. And from a user experience perspective, I took the approach of really feeling like time was like the one thing to give people. And I actually got into this space having been kind of like adjacent to the New York City media tech scene and I thought there was like a lot of bad ideas about how to serve a web page in news media and stuff. These are the sort of practices that led to Facebook instant pages, accelerated mobile pages, and that was like sort of a coincidence, but a reaction to a lot of slow web stuff going on. That was actually like part of the original fabric of like my mental model for what mattered, which was on the backend, like utilitarianism was important. So we just splattered everything in there as needed. And on the front end it was about minimalism and a notion that we had to care for the user experience, but the only objective user experience thing that comes to the number ever as like performance. And Google is a company that kind of preaches this sort of stuff. So when you’re talking about it, the best words to use are stuff that Chrome puts forward in their web tools and things like that. But yeah, that was the drum I beat. The only thing we can possibly measure from a UX standpoint is performance, assuming people want to start reading as fast as possible. And then a few other things were just kind of like random ideas. I had a thought that a lot of Stack Overflow’s success is that you sort of know where your eyes need to go on the page before you land there, and that’s like a sense of the user experience and that’s a huge part of performance. If I’m on Google first, and I’m looking at a bunch of links, I have trust in Stack Overflow from an expectation that I know where to put my eyes when I land there and I know the page is going to load fast and the average blog out there on the internet is not going to deliver that kind of trust to me and a lot of that stuff is based on performance. So we kind of baked that need for performance into the product early on. But I observed that like from a cultural perspective, it was hard to communicate stuff like, “Oh, I don’t care how slow the admin backend is,” but then it was also tough to kind of change people’s perspective or at least even change my own perspective when we started needing to. So we transitioned to Forem, which is our open source extracted version of DEV. So we sort of took all the performance stuff we’ve built and all the other features and we shipped them, so we want other people to use them as well. And now when the admins are outside of our walls, we need to deliver them in a certain type of experience that we can’t just trust that people are going to adopt this because they’re paid to adopt it by us. The admins don’t work for the company anymore. So that is a subtle shift that we didn’t necessarily need to build a new feature because the buttons were right, but the performance was no longer up to the needs now that we can no longer expect people to kind of just deal with a little bit of slowness. And then there’s also stuff, like we’ve been starting a few new projects, which do not have the same needs of our core project and we sort of need to teach people a little bit to remind them that they maybe actually don’t need to cache things in the same way they would. We’ll build an internal tool that doesn’t serve a different type of user and stuff, and that’s sort of like the journey we’ve been on. But I think, of course, for someone like you, Jamie, like it’s probably a little easier to come in to a company that has this performance mindset baked in to some extent. I don’t know how much we even talk about these things, like the journey of admin versus the front end and some of those differences, but those ideas kind of get baked in and the nuance just has to come through. So from our startup perspective, like the topic of performance has always been a thing, but we’re a content platform. So our answer has to be yes to performance. But depending on where else we fit in the tooling spectrum, it’s probably a little bit more nuanced.
[00:28:45] TU: I think both of those are good points. So one of the points that Jamie was making was like when do you start thinking about performance? And I really didn’t answer that question. I answered when you start doing something, but I think you start thinking about it right away, right? Because I do think you need to know, in particular, you need to do something about it. But then Ben as you were talking, one of the thoughts I started having is there’s two aspects to getting your approach to any kind of service level objective right. And the first is, do you have the service level objective right? And the second is when you’re doing engineering, are you interpreting it, thinking about it? Are you just cargo culting the last thing you did? And so one of the things that I see us do is we just guess what people want. So like, “I don’t know what people want. I bet they want fast.” And that’s not a bad answer. In the end, we’re all working with less information than we’d like. Right? I think at some point we are all just guessing, especially if you’re doing something kind of new or you’re doing it for the first time or you’re doing it for a new group of people. You're like, “I don’t know what they want.” This happens to me a lot in the machine learning space where what I see is like companies shipping stuff where they’re like, “We think you want this.” And customers are saying, “I don’t know what I want. Maybe I want that. Let me try it out and see.” And you’re like, “Wow! This is going to take us all a while because you don’t know what you want and I don’t either. So we’re just going to build stuff and we’re going to see how it works.” So that happens a lot. But I also see the other thing, which is we forget what we actually need and we just build the last thing again. And that’s what I heard you say, Ben, where you’re like,” I don’t know. We’ll just use that cache because we have that cache lying around.” You’re like, “Hold on a second. What are we trying to achieve again? Do we need to use that cache? Does that cache even make any sense? Maybe we should definitely not use that path.” A lot of us sort of get in these ruts of like just do the last thing. And so one of the things I think is useful, that’s been tough during pandemic to have really any energy to do this, but the more you can get some perspective and take a step back and say, “Okay, maybe I won’t just do the thing in front of me. Maybe I’ll pause for a second and reflect on what do these people really want. Or if I think that’s what they want, what do I have to do? What might I do? What could I do next and really have a little bit more thoughtful and reflective approach to this?” And it seems exhausting because it is, but the more you can do this from first principles every time, the more you are to save engineering time later because you don’t do a bunch of stuff because it turns out that’s not important. You just do the stuff that’s important first, which to your point Jamie could be performance, could be performance earlier, but it doesn’t necessarily have to be.
[00:31:19] JG: Ben, you were talking about performance in terms of things like latency to the end user, and especially when you’re a content platform. And one of the things that I typically think about, sometimes I forget about latency entirely and I’m thinking more along the lines of throughput. And sometimes the distinction between those two can be hard to grok because a lot of people think that throughput is basically just latency divided by N, right? Or N divided by latency. And at some level that’s probably right, but maybe we can talk about some of those distinctions between latency and throughput.
[00:31:55] TU: So when I did super-computing work, I think we learned is that interleaving and pre-computation have high latency, that’s it. Right? So if you interleave your requests such that you’re always returning stuff, it’s just stuff from a while ago, but you have more stuff pending, you definitely can hide some of the perceived latency. But in the end, actually, I wanted to make a different observation, which is it’s about percentiles and it’s about actual user experience. So I ran Google’s payments system for a while and one of the things you learn is if you’re failing a hundred percent of the time for 0.01% of the users, that’s still pretty bad. So you can have three nines of everything perfect. But if the people who are having problems are almost always having problems, that’s really bad for them. Not to be too gruesome, but a medical risk is like this too. When people say like, “You’ve got an 80% chance of living five years after this particular diagnosis.” Well, that’s not what they need. You have a hundred percent chance of the thing that’s about to happen to you happening. We just don’t know which bucket you’re in. We just don’t know which experience you’re going to get. So when I think about the way we interact with applications, I often think like, yeah, looking at average cases and looking at aggregate cases. So in some ways what I think is when you look at throughput, you’re looking at aggregate latency, that’s not actually what you want to do. What you want to do is look at the worst latency. So you want to say like, “Well, show me the 99th percentile latency, show me 95th, 99th, and 99.9th. Because if I’m like, “Well, 5% of the time people wait 48 seconds to load my web page, but the average is like 20 milliseconds, that’s still a pretty bad web page.” Right? That’s not a good web page, even if the average looks really good. So that’s one thought I had about throughput versus latency.
[00:33:47] BH: One thing your point, Todd, just made me think of from a very little concrete thing, something I’d bring up to Jamie if it were more urgent, but a thing I noticed about pages on our site, DEV, the way a typical user interacts with our site or that they interact with popular pages, so like people getting pages in their feed are going to be interacting with pages that a lot of other people are getting in their feed because their similarity is that they’re recent and people coming from search to find content that isn’t recent, what they have in common with other folks are they’re landing on pages that are high ranking. So they’re also going to land on pages that other folks are landing on. And the way that works out well on average for our platform is that that means max caching because just more reads, your writes on that type of content and that works really well on average, but we actually have noticeably worse performance on just any page, which is old and not popular. If it hasn’t been accessed by anyone in the last 24 hours, your access isn’t going to be fast enough, but not so slow that we really pay a lot of attention, but not fast enough to be really useful. And it’s the sort of thing that is addressable, but it just means like our baseline ways we deal with things on average just don’t quite apply there. But they could in different ways. So when I, as a user, started noticing this more, and I kind of like knew this was the reality of how the system worked, but I started noticing it more when I was playing around with a different search engine. So not even our internal search engine, but a non-Google Search engine, which generally works like Google. So I was playing with like a different type of software and searching for content in our ecosystem, and it was just returning results that weren’t the same ones Google was. So my experience just was a little different, coincidentally just a different page was ranked or hit more often and things like that.
[00:36:05] TU: Yeah.
[00:36:05] BH: That was like an experience where my takeaway, and it wasn’t like an emergency, but there actually are a lot of use cases on our pages where the average is great and the long tail or the long, long tail, like not even long tail like a couple of people read this every day, but basically zero people do, but then there’s a vicious cycle that those pages maybe crawlers, they’re slower for crawlers because not enough people see them. So because they’re slower for crawlers, they’re going to not rank as well and then there’s the cycle where less popular stuff is going to continue to be less popular and from certain things we care about as a business, like search engine optimization, it has an unknown effect of like how much does the crawler care that this like long tail of pages is slow and maybe they will actually never allow those to improve because they might crawl them again every few days and they’re the only visitor, that spider is the only visitor, so they’re always getting a cold version of the page. So from a concrete thing, affecting our work and our business, we have good average numbers on all of these fronts. And we don’t even have egregious worst case scenarios, but it’s like that performance in between those two areas, like it’s not so bad enough that we can pay attention to and the average is good enough that we might be able to sort of avoid it, but it’s just this scenario that just doesn’t get the same attention all the time.
[00:37:40] TU: Yeah. You know, it makes sense. I think it feels super common to me that lots of the systems lots of us have worked on have had like a fast path and a fallback path or like the cached version and like going to get the live version from the scrolls that we locked in the basement behind the vault that Geraldo opened or whatever, right? Yeah, there’s this slow version there. There’s a couple of techniques that I think might be helpful in that case that are good generally. One is you started to do, and I think as we’re thinking through it is decide, “Why do you care about those slower ones?” Like, “What are the use cases that matter?” You’re like, “Well, we care about robots. We might care about people seeing things for the first time.” You're like, “Okay, well then like let’s measure those things.” It’s measured like page abandonment rate or click-through rate on people who land on slow page versus a fast page and let’s see like is it a lot higher. If it is a lot higher, like, okay, we’re burning new users because new users on average 83% of the time show up on a goofy page that nobody else has looked at, but they’re going to end up on the popular pages and also form part of our community and make new popular pages. So we want to keep them and we’re losing new users or you’re not. Like I don’t know, but I think that would be a thing to look at. The other thing I would say is I would measure those separately. Now this is one thing I tried to do. I would treat those like they’re two separate systems because they are. One of them is like sending out web pages basically from RAM or from memcache or something that like pre-computed web pages, not generated. We’re not thinking about it. There’s no database fetch. This is a bundle of bytes we already had ready and we send it out on the network. And the other area is a complex multi-tier web application that was a hassle to put together and difficult to maintain and has all kinds of performance constraints. If you measure those separately, one, and then two, do you know why it matters in which case? You can look at that second one to make the rational decision. Actually, we should not ignore this or I think it’s equally likely. It’s likely that your instinct is correct, in fact, that this is ignorable. You’re like, “Yeah, it’s not great, but it’s not worth five engineer months to fix.” Or, “It’s not worth a big effort or a bunch of hardware.” It would be nice, but lots of things would be nice, right? A pony and a unicorn would be nice, but not all of this happens. So those two techniques might help with figuring out whether that stuff is ignorable or not.
[00:40:03] JG: Absolutely. And that’s actually a really good point about talking about the trade-offs that you make for any sort of performance. What are some of the trade-offs that we ended up making in order to make this page serve 20-millisecond faster or 100-millisecond faster?
[00:40:17] TU: Yeah. There was a funny time I ran into that when I was working on a payment system and someone was like, “What are our plans to get this payment system to five lines of reliability on a transaction basis?” I was like, “I got no plans. I don’t think anybody runs a payment system that works like that.” I’m not saying we can’t do it. We definitely can do it. It’s going to cost tens of millions of dollars. After rewriting things from scratch, we got to start from first principles. We’re going to have to do some basic research on. And they were like, “Oh, okay, never mind.” I'm like, “Good. That’s okay.” It’s actually okay. If somebody says like, “What’s it going to take to get this unicorn?” And you’re like, “I don’t know. Give me a penny and I’ll give it to you.” Well, then you got a unicorn for a penny. That’s amazing. It’s okay to ask. As SREs, as developers, as curators of performance and of our systems, we shouldn’t be offended when people ask. We should do our best to answer the question because it’s fun to work on this stuff when it makes sense, but it’s pretty disappointing to work on this stuff when it gets canceled in the middle because people didn’t realize how expensive it was and actually were never committed to it. So it’s really good to have those conversations that are like, “Hey, it’s going to be a big effort. If you think it’s worth it, I’m down. I’m going to do it. This is going to be amazing. But if you think this is kind of a lark and I’m going to do it on a Tuesday before lunch, this is not what this is. This is a big thing.” And just making sure we have those conversations between sort of business and product leaders and engineers and reliability engineers is like, “Let’s make sure we’re on the same page of the kind of thing we want to accomplish, what it’s going to do for us, and what it’s going to cost.” Because everything we do in our organizations has these trade-offs and everything we do in life has trade-offs. So now we’re back doing philosophy. I don’t know.
[00:42:16] BH: Are there any types of problems you’ve observed in your career that transitioned from being the type where you have to tell your boss or the client or some ambitious product visionary that this is impossible, you can’t get these many nines? Has it been a problem that transition from that type of instinctive reaction that you are able to deliver to actually over time you had to reevaluate how hard that was because things change? For example, just think of something in computer vision where not too long ago that was an impossible problem and now it’s like table stakes, like not from scratch, but you download a library and it can recognize a butterfly. What are the types of problems in your space and career that have transitioned from seemingly kind of impossible or not worth attempting to actually not so bad and did a library for that?
[00:43:12] TU: I mean, so working at Google, not to give Google too much credit, but that kept happening to me when I got here. It was like, “Oh, you can’t do that.” I'm like, “But we already did it.” They’re like, “Oh!” There are things that happened when I got here where I was like, “We’re going to need, I don’t know, it’s going to be like a terabyte of data. We’ll have to store it somewhere.” And people are like, “Did you mean a petabyte or a terabyte?” Like, “A terabyte.” Like, “Okay. Did you mean RAM or DIMMs?” It was like, “All right. I need to go educate myself a little bit.” I was like, “You have a terabyte of RAM?” They’re like, “Well, we do, but only in a few data centers for our user right now.” It’s like, “All right, I’m going to come back when I know what’s going on.” So I definitely had some of those right at the beginning where I was like, “I don’t understand the scale of what you guys just did. And then of course I work in machine learning. So your computer vision example but times a hundred, like when I look at what we’ve done with language understanding. So Google Translate, and a lot of the natural language processing and natural language translating efforts, not just at Google, are like Babelfish science fiction stuff when we think back six, eight years ago. Like you’re going to translate from Korean to Urdu and it’s not perfect, but it is absolutely intelligible. If you speak a few languages and you look at this, you’re like, “I don’t know if I would have done it exactly that way, but I know exactly that you’re trying to get to your hotel and you wanted to buy a book, but it was too expensive.” Like stuff like you probably don’t want to have sophisticated, moral, ethical, or philosophical conversations straight up with Google Translate, but you can definitely live your life and maybe even negotiate a job or something. And so I find that stuff really both humbling and encouraging. There was a big effort at Google to transition and the industry has done this as well, to transition from doing machine learning on general-purpose CPUs to using special-purpose machines. So we use TPU. We use these tensor processing units. Most of the rest of the world is using GPUs from companies like NVIDIA and ATI. Like all these things are is like very low precision linear algebra. So take a vector and do some stuff to the vector and don’t worry too much about the exact details, but do it fast. And importantly, for us, when we think about scale, you all as well as I do, we think about like power budgets, like how much are you spending on electricity. Because in the end, the computer you buy once, but the electricity you pay for every single second. So the electricity is usually what costs money. And these kinds of accelerators are just phenomenal at doing things that we couldn’t conceive of doing at this level of cost a few years ago. So I’m super humbled by those things and it continues to happen. What I think is another good aspect of your question, Ben, is like I really think it is incumbent upon all of us to remember why something is hard and remember why we said that it couldn’t be done. Because we all get lazy and we get sloppy and we’re like, “Nah, that doesn’t work. That doesn’t work. We tried. It didn’t work.” Well, when did you try it and how did you try it and what has changed since then? And in particular, like, I think here’s a super concrete example that most of us can understand. If you take a POSIX file system semantic, so open, close with a pointer in the file that you move around, being able to add stuff anywhere in the file and upend the file, close the file. Nobody I’ve ever seen has made a scalable, incredibly high-performant, incredibly reliable POSIX file system semantic compatible cluster file system. It’s just very hard, in particularly because the offset in the file and the ability to append stuff to add data at different places in the file is very difficult to do in a clustered file system. But it turned out we didn’t need all of that. We liked it. It was fine. And we got used to it. I like reading and writing local files. That’s super convenient. But if you give up just a couple of those things that we get things like S3 where you’re just like, “Oh, give me a bucket, I’ll get the stuff. I’ll give you a new bucket.” You’re like, “Oh, but I didn’t have a point or I didn’t have an offset.” They’re like, “No, you don’t get any of that stuff, but what you do get is horizontal scaling forever. You do get terabytes and terabytes and terabytes. You get petabytes. You get exabytes because you gave up these things. And that’s the other thing I think is like often what happens is we remember that something doesn’t work, but we forget why it doesn’t work. And then if we remember that, we’re like, “Oh, maybe I don’t care about that as much anymore. Maybe it’s okay. I punched a little bit of complexity from this layer to this layer of what I get.” I think NoSQL databases are like that. Right? Relational database, like the relational calculus is amazing. It’s great. It does all these things for us. If you give up on it, you don’t have to have your database in one place anymore. Okay. Well, that’s the trade-off. For some of us, it’s a pretty great trade-off. For other people, it’s a disastrous trade-off. It depends on how much you care about consistency and how much you care about horizontal scaling. But anyway, that’s sort of the thought I had is yeah, I’ve had that happen continuously. And I think if we’re all lucky enough to stay employed in this industry, I hope it keeps happening to us. If somebody comes along and snacks you with a macro and was like, “Hey, buddy, that thing you said can’t be done? I did it yesterday. Let me show you.” Those are actually humbling and also really rewarding experiences.
[00:48:21] BH: If you’re the one that says it can be done and your belief is it can be done because you can give up a few things, and S3 was a good example, how do you then go about convincing anyone, either a whole market like S3 trying to go to market as this whole gigantic thing, or just your colleagues, like, “Oh, if we can give up this one thing or like start just shifting how we think about it, this whole area is going to be more performant or more reliable or anything like that,” and then obviously, maybe your boss, how do you convince the designers that this one component of the page, which they love so much, is actually probably the best thing to let go? And how do you have that conversation?
[00:49:09] TU: I mean, I think in the first case where you’re trying to convince people to use something and do something, you really just have to give people options when you can. So sometimes you just can’t. Sometimes you’re like, “We can only do one of these. The one we’ve got is this. We’re going to swap it with the other one.” But usually what you do is you say, “Hey, there’s this other one. It doesn’t work the same way, but it works better in these other ways.” And then you have to do a little bit of evangelism and a little bit of handholding. And I think the main thing that I’ve seen work really well is we all respond to someone like us who says it’s good. If I’m the service operator and I say like, “Hey, I got this thing with buckets,” like you get a bucket and you put stuff in it. If I'm writing bytes to a file or records to a file, why am I writing buckets? What’s a bucket even? I don’t know if you all remember, but S3 took a really long time to catch up. S3 was, like, I think that played with it in like 2008 or ’09 or so. It was super early. I can’t remember the exact day. You couldn’t do anything with it. There were these command line interfaces, but really, it wasn’t integrated into anything. But once people figured out what you could do with it and they started doing those things, there were backup solutions implemented on top of S3 pretty quickly, because they’re like, “This is right mostly read almost never stuff and so we’re just going to do that and that’ll be good enough.” So I think giving people options is one. I respond well when somebody who’s like me says, “Yeah, I was skeptical too, but I tried it and here’s what was good about it.” So that’s one thing. But in the case you’re talking about Ben, I just think it’s tough. There’s the sunk cost fallacy and there’s a sense of personal ownership. People don’t like giving up something that they’ve worked hard on, that they love, that they care about, that they put a bunch of themselves into. What you really have to do is convince them that somehow it will be better for… like if we’re all in this together for some set of users and some set of customers and what they want, if we can agree on that, then we start measuring that and we start talking about that and we realize, “Hey, this is the least clicked part of the page or this is the part of the page that is not used by anyone and it’s consuming 40-60% of our CPU budget or 30% of our latency budget, we’ve got to make some trade-offs.” And often I think if you can get someone to make some trade-offs within their own sphere of control. So you say like, “Hey, you’ve got three elements on this page. Which two do you want to keep? Because we really need to cut this down.” Then people will start to make good decisions and some of the right decisions. But it’s a tough thing. None of us likes giving up something we love.
[00:51:45] JG: So we talked about anything regarding performance. What are some of your favorite tools, Todd, that you’ve used for measuring performance and monitoring and uptime and all that stuff?
[00:51:54] TU: I think one of the things that surprised me when I came to Google and I didn’t really understand how it was going to work was the idea that you could easily export and then analyze and understand internal metrics from the applications in a way that was meaningful to that application at that time. So one of the things we do a lot of in the industry is we measure like start to end of something and then whatever happens in the middle. It’s just this black box. A miracle occurs and 800 milliseconds later a web response comes. Right? But reality, we’re like, “Well, why is it…?”
[00:52:30] JG: Why does it take 800 milliseconds?
[00:52:31] TU: What is the web server? Yeah. Why does it? Who’s spending all that time? And so I think this has shown up recently. So internally Google has had a set of frameworks for doing that within all of Google’s binaries for many, many years. And so you’re able to set alerts on, enable the set, build dashboards on, enable to do analysis on sub metrics. So there can be things like class initiation time, life if I’m running at an object-oriented platform and I’m like initializing a class before I instantiate an object, you can measure that if you want to or you can measure memory fetch times for particular things or particular times for individual method calls. Now you can drown in that stuff. You have to be a little bit thoughtful about it. There’s a bunch of modern observability platforms that are now trying to make that stuff make more sense and give you tools for analyzing it. So that’s stuff that excites me because I think it does two things. One is just the mechanical thing, but one is the cultural thing. So the mechanical thing is as somebody who didn’t write the code, if you’re interacting with the service whose code you didn’t write, it gives you an enormous amount of visibility of what is happening, what is going on here, and you can see problems where you’re like, “Hey, the service is taking twice as long today as it does yesterday.” Cool! What’s slower today than was yesterday? And you can see it. And you’re like, “Oh, it’s this RSA subroutine because somebody switched the keys from 1K to 2K and we don’t actually have hardware optimized 2K keys. We’re going to switch to 4K tomorrow.” Okay. So fine. So you can find those kinds of problems. But what I think is even more important is on a cultural basis, it makes developers think, “What is someone who’s not me or me in the future?” Because me in the future is also not me, but what is someone who’s not me right now want to know about this code? What should I export? What variables should I keep track of, like success of this, failure of that, number of times of this, delay from this, total space for that, retries from this, latency for this sub call? And because you’re doing it inside of the binary, the semantics you get or what the binary sort of knows about or thinks about or the binaries collection of use of the world, enormously powerful, and the cultural impact is that developers think about what’s going on in the binary from a functional point of view, rather than getting lost in the implementation of the code itself. So I think the ability to do that is probably the single most important shift that I’ve seen and I really enjoy it.
[00:55:13] JG: Just to kind of clarify, make things a bit more concrete there. You’re talking about things like traces. You mentioned that those were extreme low level details, like how long does it take to initialize this class and fetch from memory. Is that kind of what you’re talking about there?
[00:55:28] TU: Yeah. So I think one of the problems with a stack trace is that you get the stack and then you get another stack and you’re just like, “That’s a lot of stuff. It’s all the stuff.” Right? So that’s great, but it’s a lot of stuff. So I think it’s a little bit like the difference between data and information. So data is just all the stuff, but information is sort of what some human being thinks is important. So it depends on what your application is doing, but if you have something that’s serving stuff from cache. So one of the things you could export is cache hit rate over the last 60 seconds. Right? Now something that’s serving from a cache can just do that like how many times did I have the thing that we’re looking for? Just keep track of that and every second report is trailing 60-second average with that. That’s actually incredibly useful and no one other than the thing that’s serving the cache can do that for you easily. Right? Because otherwise you’ve got to kind of figure out what’s going on. Do you know what I mean?
[00:56:26] JG: Absolutely.
[00:56:26] TU: So yeah. It’s really that like export from the application or make queryable in the application what it cares about. So from a performance point of view, I mean, if you pick dumb stuff, time of day, your applications could export time of day continuously, that is not a metric that it will help your applications be better in life. Or imputed temperature in the location of the query. Right? They could do some lat-long of like where the query is coming from, I think it’s 27 Celsius where this query is coming from. Well, that’s great and all, but I don’t think that’s going to help me serve this web page about like CSS. That’s not going to do my job for me. But I do think once you’re like, “Well, what am I trying to do and what information can I get to people who are interacting with this to really make it work better?” I think that can do phenomenally good.
[00:57:15] JG: Definitely.
[00:57:17] BH: Well, thanks for joining us today, Todd.
[00:57:19] TU: Yeah. Thanks so much for having me. It was a great conversation. I really enjoyed it.
[00:57:31] BH: This show is produced and mixed by Levi Sharpe. Editorial oversight by Jess Lee, Peter Frank, and Saron Yitbarek. Our theme song is by Slow Biz. If you have any questions or comments, email [email protected] and make sure to join us for our DevDiscuss Twitter chats every Tuesday at 9:00 PM US Eastern Time. Or if you want to start your own discussion, write a post on DEV using the #discuss. Please rate and subscribe to this show on Apple Podcasts.