Testing from A to B.
In this episode, we talk about A/B testing with David Sweet, adjunct professor at yeshiva university, and author of the book, Tuning Up: From A/B testing to Bayesian Optimization. After listening, if you would like a 35% discount on Tuning Up: From A/B testing to Bayesian Optimization, go to the link in our show notes and use offer code devdsrf-38BF.
Ben Halpern is co-founder and webmaster of DEV/Forem.
David Sweet is an adjunct professor at yeshiva university, and author of the book, Tuning Up: From A/B testing to Bayesian Optimization.
[00:00:00] DS: So this is one of the things you want to specify at the beginning. The outset is, “What’s my practical significance? How big of a move do I need to make for anyone to care?” Whether I can measure it or not is a different question.
[00:00:21] BH: Welcome to DevDiscuss, the show where we cover the burning topics that impact all of our lives as developers. I’m Ben Halpern, a Co-Founder of Forem. And today, we’re talking about A/B testing with David Sweet, Quantitative Trader at 3Red Partners, and Author of the book, Tuning Up: From A/B Testing and Bayesian Optimization. Thank you so much for being here.
[00:00:45] DS: Thanks for having me, Ben. Excited to be here.
[00:00:47] BH: So David, can you tell us a bit about your background and what brings you here today?
[00:00:51] DS: Sure. I was educated as a physicist, but I went into industry right away to work in finance. I’ve worked in statistical arbitrage a little bit, mostly in high-frequency trading and US equities. And I co-founded a crypto trading company. We did crypto arbitrage in 2017 and 2018. That was called Galaxy Digital Trading. And it was one of multiple crypto ventures funded by a guy named Mike Novogratz. And we all got combined together into something called Galaxy Digital Holdings, which is the first investment bank for crypto. So my company became the quantitative trading desk and the OTC desk. We kind of split into two departments and we wrote all the trading infrastructure and models that were used for execution and arbitrage and various other trading operations in the company.
[00:01:44] BH: Can you give a sense of what it means to be a quantitative trader, like the nuts and bolts of what you do every day?
[00:01:50] DS: It’s a lot. I can actually think of it a little broader, like, “What’s it like to be a quantitative trader or an ML engineer?” I did that at Instagram for a while. And what I noticed was that there’s really nice parallels between the two. Your typical flow daily is you have some idea of how you can improve the system, whether it’s a trading strategy or a recommender system or an ad serving system. It could be various systems that have machine learning models at their core. So you have some idea, maybe add a new feature to the machine learning model as an example. When you test it offline and you get a better fit, does it work better at a sample? Maybe you have some kind of simulator or a way to estimate an improvement in, if it’s trading an improvement in the revenue or P&L, the term we use for profit and loss. Or in the recommender system case, you might be interested in estimating offline. Whether people will click on more of whatever it is you’re showing, the more press like, whatever it is you’re showing. And then finally, after all of that’s done, you go online and run an experiment to see how well it works. So in trading, you actually run your trading strategy with your new features or what have you, and you ask, “Well, does it make more money than the old version of the system?” And a recommender system, you might actually run, do an A/B test and ask, “Well, do users stick around longer with my new feature? Do they click like more often with my new feature?” Whatever the metric happens to be. And so it was actually seeing these parallels between the two that made me think, “Hey, you know what? There’s a broader audience for these kinds of things I’d like to talk about than just quantitative trading. People across trading, ML engineers, and even software developers especially who work on infrastructure, like network infrastructure, that kind of thing, they might follow this same process to improve the reaction time of the latency of a piece of software, like web server components, and then go estimate offline, how long it takes for their new code to run. If it looks like it’s faster, put it online to see if it really has some kind of impact on the latency or other business metrics that are relevant to a live running system.
[00:03:48] BH: So in high frequency trading, is this the sort of thing where you’d need constant improvement in order to continually exist? Because there is a lot of competition here? Do profits kind of trend towards zero without making constant improvements?
[00:04:04] DS: For sure. Markets are, I would say they’re the special case of these kinds of systems and that all of the data, almost all of the data, is publicly available and there is an objective of the exchanges to make that data available in a fair way. Right? So there’s a term that we use in finance, fair and orderly markets. Part of fairness is that everyone gets the data as quickly as everyone else ideally, but there are mechanisms in place that drive all the systems toward that as a goal, like that’s an objective of these changes. So you’ve got all this public data and everyone has fair access to these changes ideally. So the question is, “Well, how do you make money while you have to compete on the quality of your predictions”? for example. An interesting effect happens which is as you capitalize on your predictions, the quality of the predictions decreases, it’s the act of trading on your predictions that makes them go away. So they go away for you, but they also go away for anyone else who might be capitalized on them. And that’s one of the ways the competition eats away your profits and you get competition’s profits, but that’s what makes markets efficient, right? It’s that competition that eating away of these little predictions. These little predictions are really inefficiencies in a way.
[00:05:18] BH: So let’s get into the big topic and your book, Tuning Up: From A/B Testing to Bayesian Optimization. First, what inspired you to write this book, step away from your day job and get into this?
[00:05:32] DS: Well, I had been giving lectures. So once a year for a trading systems and strategies class at NYU, Professor Vasant Dhar invited me to do this for a few years in a row. So it’s kind of collecting my thoughts on how do you build a high-frequency trading system, how do you tune a high-frequency trading system? To me, that’s always been the most interesting part of it. I feel like in some sense, it’s the hardest part. It’s also the part that I didn’t learn as I was a student. When I was a student, I was a theorist. And so I did a lot of work at simulation and work with pen and paper, but only touched on experiments a little bit. Tuning these live systems is more of an experimental process. So anyway, I had been talking about this and getting my thoughts together to put these lectures year after year. When I went to Instagram, that was kind of the inspiration. When it clicked, I was like, “Hey, you know what? Everybody’s doing the same thing. Not just HFT. I’ve been very narrowly focused. So I thought, “Well, maybe more people would like to hear about this than just HFT people or just this one class.”
[00:06:29] BH: Who would you say the main target audience of the book is?
[00:06:32] DS: I think of it as the intersection of the common work between quantitative traders, especially on a higher frequency, shorter time scale, quantitative traders, ML, engineers, and software developers who do infrastructure. Now you could think of the engineers who work on systems like an HFT system, either building the trading strategy or building the infrastructure to make it fast and low latency and no jitter. You could think of a recommender system, putting the best tweets first or an ad system, which is a kind of recommender system putting the best ads first, the ones that people are going to like and click on and make use of. So you’re building a web server. You want to build Google’s search engine. One of the famous results they have is that people are sensitive to very small amounts of latency. It can hit your web page and it takes too long and too long could be a second, less than a second, too long for it to load and render. They might just say, “Forget about it. I’m not as interested as I thought I was and I’ll go somewhere else.” They balance the bounce rate. So if you’re an infrastructure engineer, you might be very sensitive to that latency number and work hard to get it down. These three types of engineers I think would benefit from this book. So in that case, the three types of engineers, but also I think at an introductory level, coming out of school and going into industry, the book would be really useful because these kinds of experiments are done across technology and finance companies. And so it would be worthwhile to understand how they’re going to work when you go and do your job, even if you’re more experienced and you just make use of these kinds of systems, right? Like you’re not building the tool that Facebook or Google for everyone else to use. You might be a consumer of the tool, but understanding a little more deeply how it runs will make you a better user.
[00:08:17] BH: You know, depending on how low level you get, maybe these problems aren’t as different, but between high-frequency trading and working at a social media company or e-commerce, every, every industry might make use of this at some level of the stack. It’s going to be different contexts. But from a cultural perspective, have you thought about the organizational cultural challenges around just having the right conversations around A/B testing, whereas in high-frequency trading maybe it is so obvious that this should be a core part of what we care about that you don’t even need to have that conversation, but in another type of organization, maybe either it’s hard to get buy-in on the value of this sort of stuff? Or perhaps the buy-in that does exist is maybe misaligned or not taking the technical answers or the expertise of the actual A/B test implementer into consideration? So like just high level, like in different situations, is there a capacity to sort of start that conversation effectively?
[00:09:22] DS: There are a couple of angles I can think of on this question. One is if you were coming from the perspective of the ML practitioner, if you were in building models, you might’ve spent a lot of time reading and learning about fitting a model, taking out a sample and all the work that goes into that. It’s a lot of work. A lot of that has been studied and you have the opportunity to iterate very quickly because you’re completely offline. You get a dataset and you can try a feature, try another feature, try another feature. You can do this balance as your model. You could do this multiple times an hour, multiple times a day, multiple times a week if it’s a large model, experiments are very slow and not the kind of thing you can necessarily run in the classroom unless you have a lab. So you might not have had much exposure to that. So when you get into industry, advancement from school, your mind might be geared towards answering a question. Can I lower RMSE or can I reuse the loss in my out-of-sample set? Like that’s kind of like the gold standard metric for model building. But when you run a business, you mostly don’t care about that. Well, maybe you care and as much as it affects revenue or as it affects user retention or as it affects click rates. So these are the business objectives. Now the benefit to understanding how to translate your changes in loss let’s say into changes in revenue is that you can then go communicate to someone outside your specialty. Everyone at your company will speak the language of dollars. So I made this much money, this much more money this month, right? So that’s one thing. You want to speak with the language, more of the more common language, which is of the business metrics. If you present the argument and someone said, “Hey, we should run an A/B test,” you might say, “No, I don’t think that’s a good idea. It’s a cost. What’s the benefit of going to be?” But if you can kind of come to someone and say, “Hey, you know what? Here’s how I would run a test to figure out how much extra money it’s worth,” they might start to see the value of it, right? To say, “What am I going to do with this A/B test?” The value is that it’s going to translate everything into the business metrics. Some of the pushback you might get, a common pushback, is that one is the idea of domain knowledge. You might say, “Well, I’ve made the change to the system. I’m familiar with how a Twitter works. I’m so familiar with Twitter and recommender systems in social media that I can see the change you made and I can tell that it’s going to be better because I’ve just seen this happen so many times. I don’t need to run a test.” You’ll run into example after example, after example, that’s just not panning out. There’s these statistics that I love from Amazon. I looked up Microsoft, Netflix, where they talk about the percentage of A/B tests. Now nobody’s going to run an A/B test unless they think the thing’s going to work. They have some reason to believe it’s going to work like domain knowledge, right? Nobody runs an A/B test with an idea they think it’s going to be bad. It’s going to fail. They think it’s going to succeed. It’s going to be better than what’s existing. At Amazon, about 50% of these A/B tests will suggest that the change, the new idea is actually better than the old one in a business metric sense. That actually produces more money, for example. Microsoft did a similar kind of study of their A/B test and they reported only 33% were actually showing improvements. And Netflix, perhaps they’re more aggressive with their tests, shows about 10%. That’s what they report. Now I have done over the years in quantitative trading and HFT have asked people, the engineers who build these systems, just informally, “Ah, you come up with a new idea. You try it live.” How often does the idea pan out and give you a better trading strategy? And the answer, every single person, except for one, answered 1 in 10, everyone, except this one guy who answered 1 in 100. And I think it’s just because he’s cynical and he thought it was funny, but no one was telling me 90%. No one was telling me I’m always right. No one was even telling me I’m right half the time. They were telling me 10% of the time. And so this seems to be common. I think the reason is these systems are very complex. They’re so complex. I think an engineer will get this when I say this, people I’ve talked to, I’ve said this to outside of engineering look at me like I’m making a joke” The system very quickly grows to be so complex that no one understands how it works. You might understand at a broad level. You might understand your piece of it very well, but no one understands all of the pieces very well and certainly no one understands all the interactions between the pieces. Right? If there are any pieces that are in order and squared interactions. And so it doesn’t take long for there to be too many interactions for you to understand. Right? And that’s just the complexity of the system. Now the system interacts with the rest of the world. So your trading system interacts with the exchanges and all of the other trading systems. Your recommender system interacts with the users who themselves then go and interact with each other, with other systems and so on, and it’s just more than you can accurately model in your head just by kind of thinking it through, even if you do have good domain knowledge. Right? That doesn’t mean that people, there aren’t people out there who understand their field. It’s just that even the best understanding can easily be afforded, even 90% of the time by the complexity of a real system. It’s kind of amazing. And the more I look into it, the more this idea is confirmed that, like I said, Amazon, Microsoft, and Netflix all kind of have the same report. Now there’s another counterargument that I’ve run across, which is if I build a simulator, a simulator that can produce a P&L answer or a click rate answer, I could run that offline and it’s cheap, I can run it over and over again, and I don’t need to run experiments because in the simulator, you can put in whatever features you want, whatever model you want, or whatever ideas you want. It’s the same code ideally. It will be the same code that you would run live. So you’re really testing your code. Now the catch is that a simulator is a model. The simulator itself is a model. So you’re evaluating all of your work with a model of reality. And so what happens when you build a model? Well, it’s less complex than the real system. And again, the complexity of the real system makes your model, your simulator inaccuracy or simulator bias or it’s a kind of model bias. That being said, simulator is a great compliment to experiments. Simulators tend to be fast, cheap to run, safe to run, right? You run it offline. You’re not going to scare users away by putting a bad idea into a simulation of a recommender system. And they can be precise because you can put lots of data into them. You can run them over the past day, week, month, year, however much data and computing time you want to put into it you can put into it so you get a very precise result. Experiments on the flip side have low precision. You can run them for a few days, maybe a couple of weeks, and then you run out of time to run them. And also there’s risk. There’s risk that you’re interacting with. If you’re a trader, you’re interacting with exchanges, you can lose money. If you’re running a recommender system, you can annoy users, and those users can go on some other social media and complain about you very loudly. So experiments are going to be less precise, more costly, but they’re going to be accurate. Accurate because you’re experimenting on the real system. It can’t be wrong. That is the standard by which you measure accuracy. Right? So simulation and experimentation or other compliments is a good tool, but it’s not really an either/or question.
[00:16:49] BH: Do you have a small test or a rule of thumb for whether a problem is worth the setup cost to do this effectively? Like if I’m staring at something, how specialized should the people managing the infrastructure of the A/B test be like from a tooling perspective? What’s your sense of confidence and off-the-shelf tooling in different domains or at least close to where you work? Is there a certain type of situation where you know just like people try to set up an A/B test and don’t know what they’re getting into in terms of like the commitment?
[00:17:27] DS: I think the general pattern I think that holds probably everywhere is when you’re first starting out, when the complexity of your system is small, we haven’t done a lot, you’re going to run into the problem of not being able to collect a whole lot of data. Right? If you have end users and they’re all from your high school, they’re just helping you out is not going to be like a great dataset. Right? But on the flip side, at that point, at that stage, you’re building all of the high signal stuff. Right? You’re looking for big changes. Did the system go offline? Right? A big change like that. Or did anyone click on anything in any of my ads? Big changes. So you’ll see strong signals. So it’s a good match. You’ve got high noise. And from a statistical point of view, you have high noise, but you have large signals. And so you can kind of see them without designing an experiment because they’re big signals. But as time goes on and you get more data and you have more ideas, as you bake more ideas into the system, the size of the improvements that your ideas will make will get smaller and smaller. And at some point, the noise will overwhelm the signal of your improvements. You’ll put something online and you won’t know if it helps. At some point, you won’t be able to answer the question that I improve the system and then you’ll be forced to find a way to do it, and the way to do it is to design a test. Now the good news is an A/B test, the basic A/B test is fairly simple to plan out and to run. You don’t really need a lot of special tooling just to get going. Right? And you can make an estimate on pen and paper, if you know the noise level of your measurement, right? If you know the volatility of your P&L from that and you will know this, if you’ve been running your system for any reasonable amount of time, well, long enough to run an experiment. If you’ve been running it for a month or you’re running it for a year, however long it’s been, hopefully you’ll have some data log and you can estimate the noise level. And given a certain noise level, you could say, “Well, if I were to run an experiment for a week, how big of an improvement could I detect?” And you can just do that on pen and paper. And you might find that, “Hey, you know what? This improvement is around the size that we need to measure. It’s large enough. It’s worth running an experiment.” Or you might find that, “You know what? If I wanted to get enough data to measure a one percent improvement in my KPI, in my metric of interest, I’d have to run for a year. You don’t have to run A/B tests. And the only the time it took you was the time it took to analyze on pen and paper. So you can ease into it step-by-step, and what I would imagine doing is you’ll get to the point where you can’t tell that you’re making improvements anymore. You do the pen and paper and you’ll say, “You know what? It’s going to take a week to measure this, two weeks to measure this. I’ll run the experiment, but I’ll set the whole thing up by hand.” You don’t have to build an infrastructure automation or anything, just like everything else you’d want to do. Start simple. You do it by hand. You crank through it. And if you like it, then you start automating, then you start scaling, then you start improving. Right? And treat it just like you would your main product with this kind of lean mentality or agile mentality of iterative improvements to make an internal product. Now the other question is, “Why build it all? Why not just use something off the shelf?” This is a question that applies to anything you would build, any tool you would build in line with this thing. I don’t have like the best answers to this question, but I know kind of what people think about this and the way I feel about this. My feeling is build a little bit internally, right? So you can get a feel for how valuable the tool just in the general sense, how valuable it is to have a tool, and you can answer questions for yourself, like what features or capabilities would make the tool more valuable. Then you can start to estimate, “How hard would be for us to build and what’s tooling outside has this?” You can also, if you want to evaluate an outside tool, you can compare it to what you already have and you’ll find yourself saying, “Oh, it would be great if we could or such and such a feature is so precise or advanced or whatever, we won’t need that for a few years,” and really it’s going to depend on your business, but you need some way to evaluate. And I find that building at least a prototype, something simple that you can use internally is a good way to start that evaluation process.
[00:21:31] BH: So can you maybe give the process of how to design, run, and analyze an A/B test?
[00:21:40] DS: So the basic idea. So A/B test, actually for designing any experiment, the way I think of it is you want to begin with the end in mind. If I have the data, if the experiment were done and it was time to analyze it, what question would I want to answer? Once I know that, then I can ask, “Well, what data do I need to collect to answer that question?” And so for an A/B test, a question you want to answer is, “Am I wrong in thinking that this new idea is better?” Right? And you can’t make it black and white. It can’t be a binary decision because it’s statistics, because there’s always uncertainty. So what you can say is, “Well, I want to probably not be wrong.” Right? So if I put the new idea online, I want the probability that it wasn’t actually better, the probability that I’m wrong to be let’s say less than 5%. And if I reject this new idea, because the old version of the system looks better, I want the probability of that to be small as well and people usually think about 20%. Right? So probably not wrong either way, whether I make the change to the new one or I don’t. So now once you’ve written that down, then you can, you might be familiar with a T test, your listeners might have heard of T test or the idea of statistical significance, you’re going to write down the formulas for those and say the probabilities are small enough. And from that, you can then solve for the number of replications you have to do, the number of samples you have to collect or the number of measurements you have to collect. I say the same thing in three different ways: samples, measurements, replications, all the same thing. Now they estimate the number of measurements you see. You can make that all before you run the experiment. Everything I described right now, you can actually do before you run the experiment, just sort of on pen and paper or in the computer. Then you run the experiment and then you actually carry out this analysis and then you ask the question, “What is it probably better? Is my new idea probably better?” That’s kind of the best you can do.
[00:23:27] BH: Is there a rule of thumb where you can possibly find yourself not having to run the test at the design stage where you might discover information that might lead you to realize like this is most likely does not even worth following through on, because it’s going to be a big commitment?
[00:23:44] DS: Yeah. One of the things you can ask, I like to think of it at the idea of practical significance to distinguish it from the statistical significance. That’s not my phrase. It’s from a book I read a long time ago, but I can’t remember which one it was. But the idea of practicing was like, “How big of a change or a bit of an improvement what I actually care about?” Right? I could run in principle, run an experiment for a year and measure a very long year running your experiment, the smaller of a change you can detect or smaller of a signal you can detect. I can run the experiment for you and I can measure a very, very tiny change. Maybe $10 extra a day, some small number. But would anybody care if I made $10 extra a day? Probably not. Right? And I didn’t do a business as we were talking about here. Nobody would care about 10 extra dollars a day. So it wouldn’t be worth running that experiment. So this is one of the things you want to specify at the beginning. The outset is, “What’s my practical significance? How big of a move do I need to make for anyone to care?” Whether I can measure it or not is a different question. So once you’ve specified that, then you can look at the noise level in your system. If the noise level is too high for you to measure a large enough change in a reasonable amount of time, then you can’t run the experiment. But it’s an interesting situation to be in because everything I’ve said so far doesn’t actually depend on what change you made. It only depends on the system you have available to you. So if you’re stuck in a situation where you’d be interested in $10,000 a day, extra, let’s say, but you can only detect $100,000 a day change in one week. And what do you do? Well, maybe you run a long experiment, but it’s going to be too long. Or maybe you have to find a better way to run experiments, right? Or maybe you have to find a better way to test things. And so you need to do some deeper thinking about business. Maybe a business is just too small to make for you to be able to find that little bit of extra money. But when you get into that situation, you have a deeper problem of how can I reduce the noise and make better measurements. The other piece of the puzzle is what about the thing that I’m actually measuring? Can I ask a question? How big of a change would I expect this thing to make before I even put it online? And there, what we typically do is run some kind of simulation, right? That’s what simulation is good for. You can ask if I can only detect changes in my live system of $10,000 a day or more, well, then what I can do is I can run a simulation. My simulation says the new feature I added to my trading strategy is only worth a thousand dollars a day. And should I bother putting it online? I could never tell if it worked. I know that putting it online runs the risk of breaking the system because the simulation could be wrong with this model. It also requires me to change the system and any change to the system is a risk. Right? We know that as engineers. That’s why in a continuous deployment pipeline, there’s test after test after test, lots and lots of stages of the test because you run the risk of breaking the system. So do you bother putting this online? The answer is probably no. You don’t. Don’t put this online because you’ll never know whether it works. Or what you can do is you can keep improving your idea, changing your feature, adding more features until the simulator says, “Ah, this has an expected improvement of more than $10,000 a day,” and then you run off and run the expense.
[00:27:15] BH: So when you’re running these experiments in your specific domain, it seems like your sector seems really tuned in with this because you said like lots of public information. So are you running trading A/B tests on history, like you run on like what happened yesterday and how would this have turned out differently with a different model in place?
[00:27:41] DS: Yes, that is the goal of the simulation, to ask if I had run a different strategy, would things have been different? Just try to tease out this counterfactual information. It’s very hard to be right in situations like that. When I say counterfactual, but that’s the goal, that is the intent of it.
[00:28:00] BH: And that the financial space is probably at least a decent sense of like it’s a closed game, a little bit more. There’s probably certain types of domains where the simulation part is an order of magnitude, more difficult to construct.
[00:28:16] DS: The key parameter, I think to key in on the question like this is what is the holding time of your strategy? Right? You tend to buy and sell, enter a position and then exit at one minute later or is it one day later or is it months later? The longer the holding time the noisier is going to be, whatever you’re measuring. Right? Because there’s more time for the volatility for the price to change. There’s a higher volatility on longer times, but it’s also harder to get the same relative amount of data. If I were interested in one minute samples, in a single day in US equity they get 390 of them, right? I go back a month and I can get 22 because there’s 22 trading days, 22 times 390 of them and I could go back maybe like three months, maybe even a year and a half of the data still makes sense, definitely three months on that time scale. The reason I worry about going back further at a certain point is that it’s non-stationary, this idea that the data generating process, the system itself is changing over time. Right? So we’re not just faced with the problem of out of sample failure. We’re faced with a problem what’s called out of distribution failure. Last month is different from the system 12 months ago. So you can only go back so far in time and have the data makes sense for tomorrow straight. Now if your holding time is three months, you can’t use one month of a data to build a strategy because you won’t even know what was a single position. If your holding time is three months, you’d need many, many years of data. If your holding time is two weeks even or a month that you probably want like 10 years of data. So then the question is, “Where is the simulation more or less accurate?” Well, where the simulation fails is at the time you trade. Right? You can say if I own Amazon in 1999, if I bought it in 1999, I still owned it today, what would be my P&L? How much money would I made? And you can make a very, very precise estimate of that because you only traded once. In your simulation of this trading, you might have purchased that, I don’t know how low it was in 1999, but let’s say it was $50, right? You might have simulated a buy at $50 in 1 cent. But in real life, had you executed, it would have executed $50 and 5 cents because the market was moving fast at the moment you executed. So you’d be off by 4 cents, but that 4 cents is drown out by the phenomenal change in the last 24 years in the price of Amazon so you don’t really care. It’s a little bit of bias and a much larger signal. Now if you’re turning over a minute, there’s lots and lots of executions. There’s lots and lots of opportunities to be wrong. So the shorter the time scale you’re on, the harder it is to get the simulation to max because you’re faced with these, what are called counterfactuals. Like what would have happened had I done things differently? You’re faced with answering that question over and over and over again, and every time probably a little bit wrong. So that’s the main factor thing about simulation is how long, at least in a trading simulation how long.
[00:31:14] BH: I’d love to just dive into a few terms and you can maybe define them briefly for the audience, but also just like where to even get started as far as a preview for maybe reading deeper. So can you tell me about multi-armed bandits, that idea?
[00:31:28] DS: Sure. So the way the book is constructed is I start out by spending a lot of time on A/B testing, because that gives you all the basics of experimentation. And then I go into these other, talk about other methods, which I think of as kind of a little bit more specialized and the first one is multi-armed bandits. Multi-armed bandits, for the multi-arm bandit, you can even run an A/B test and compare it to two different versions of your system, but the multi-arm bandit is going to give you two features beyond A/B testing. One is instead of thinking about this idea or designing for the idea of being probably not wrong, like A/B testing, it’s worried about how much of your business metric, like your revenue or the number of likes being liked, how much of this business metric are you capturing while running the experiment? So if the multi-arm bandit sees that your Version B has been doing better than A, it’ll run B a little more often than A. It won’t just be 50-50. And A/B test is 50-50, half of the traffic goes to A, half goes to B, let’s say. But a multi-armed bandit will start biasing towards the better one and it’ll work its way back and forth until the end where it’s almost completely focused on the better one. And in that process capture more of your business metrics. It also makes it very easy to compare multiple versions at once, A, B, C, D, how to run a test with multiple versions. In A/B testing, you can do that as well, but multi-armed bandits will do a lot more efficiently.
[00:32:49] BH: I’ve always really been drawn to the use of that algorithm to feed my impatience in these types of processes that has a major appeal in that sense. So can you tell us about Bayesian optimization? Obviously this is way bigger than the time I’m giving you, but why don’t you just tease out what do you even mean by that and where does that lead us from A/B testing?
[00:33:15] DS: Sure. So I’ll describe it the way I presented in the book. So in the book, I talk about multi-armed bandits and I talked about something called a Response Surface Methodology, which is I guess at this point it’s a little bit old fashioned, but it’s a nice idea where for analyzing experiments where you have continuous parameters, right? So let’s say you have some weight in your system that can go from zero to one and you want to ask what value is the best value. What RSM allows you to do, what RSM does is it makes a model of the function of business metric versus that parameter. So you measure a few points and it interpolates between them and then says, “Oh, here’s where the best value of that parameter would have been.” So interpolates between a few measurements to find the best value, which might not have been measured. Right? So we can interpolate, and you can do this with multiple parameters too. So you have a surface instead of just a curve that you’re interpolating, hence the name Response Surface Methodology. Now Bayesian optimization is kind of like a combination of Response Surface Methodology and multi-armed bandits. So multi-armed bandit, what it pulls from multi-armed bandit is this idea of actively and adaptively targeting whatever’s working better. And from response surface methodology, it pulls in the idea of building a response surface. So a Bayesian optimizer will design an experiment for you. It automates the entire process as well. It’ll design an experiment. They measure these few parameters then they’ll build a response surface. It’ll optimize over it and then it’ll say, “Okay, now you know what? Here’s the best place to measure.” And when it says, “Here’s the best place to measure next,” in your next experiment, it’s best in the sense that it’s balancing the idea of giving you as much business metric now as it can find, given what the measurements you’ve taken, with the idea of saying, “You know what? If we explore a little bit, we measure places on the response service that we haven’t and maybe in the future we’ll get even better business metrics.” So it’s exploration versus exploitation is the term of they use, you're balancing. So another way to look at it is balancing reward now with reward later on. Right? So you invest in later reward by exploring now.
[00:35:21] BH: To just wrap things up, what would your best advice be for like next steps or how to kind of go deeper or like build on what you’ve kind of like taken in today if somebody outside this domain has gotten curious?
[00:35:37] DS: Well, several chapters of the book are online. You can go and read some of it for free or you could buy what’s called a Meep and read the six chapters online. So far, I think for what I found when researching this book before writing it is that all of the information here is presented one way or another in free resources that you’ll find on the web. You find them in Coursera courses, you find them in blog posts and definitely in journal papers, read a lot of journal papers to do this, and an occasional like 500-page specialty book on just one of the topics. So there’s definitely lots and lots of room to explore. A really nice place is blog posts from big tech companies where they like to brag about how great their system is. So like Google has the Vizier system. You can look up and see how they ran Bayesian optimization experiments on a chocolate chip cookie recipe in their kitchen. Right? And it’s interesting and it gives a good flavor, I don’t know, but a good sense of how the system works with something practical that’s easy to visualize, like baking and eating chocolate chip cookies. Or Google has got their AKS system, there’s papers and open-source software out there. That’s their Bayesian optimization system, like describing it, announcing it, kind of bragging about it up Twitter, LinkedIn, Uber, am I leaving out? I think they’ve all got systems like this. And actually, if you find me on LinkedIn, I did a post where I just listed links to all of these different systems. I think that’s a great place to look. And if you just Google any one of these topics, you’re bound to find some educational material. But I hope at some point, you’ll want to come and read the book too, which brings it all together and presents it in a cohesive way. Like each of the chapters presents a new algorithm building upon the previous knowledge. I think it makes it easier to digest. So by the time you get to Bayesian optimization at the end, it’s a small step. You’re saying, “Oh, I already understand what a response service is. I understand exploration versus exploitation. And now it’s just one small extra step to turn it into Bayesian optimization rather just diving into Bayesian optimization but cold, which can be daunting.
[00:37:40] BH: What advice would you give for a business leader who doesn’t understand this stuff well enough, may never read your book, but needs to hire four people who do this well, maybe make a choice on whether ultimately to make the trade-offs like high level commit to this sort of stuff? Any high level advice for that individual who will never have the full context, but like shouldn’t get some of this stuff wrong at a high level.
[00:38:11] DS: Yeah. I think within the organization, if you’re doing it, like the EM or a higher up business manager, look for quick turnaround and results that you understand. Right? If someone’s running an A/B test, it doesn’t matter what their specialty is or what your specialty is, they should be able to communicate it to you. This is one of the great advantages I think of an A/B test is it’s simple to communicate to someone. You’re comparing two things, each of which is domain specific, right? I’m showing one out of two and you’re presenting them in terms of business objectives. Right? What you want to understand is that can the person do this reasonably quickly and get you some results that you think communicate to you that you understand? Right? Because you’re going to have a lot of talks in the future. You want to go and do that. And you want to believe the results, right? Both intuitively and quantitatively. Right? If you’re not the quantitative person to analyze it, you trust it to the analysis, but it should also be intuitive, right? It shouldn’t ever look mysterious. Right? If you find that you’re being presented with a result where someone says, “Hey, this works a lot better,” you’ve got to trust me I did the analysis, you wouldn’t understand because you’re not a math guy. No, this stuff should be like hitting you in the face when you’re all done. That’s the whole point is to get the error bar so small that it doesn’t look mysterious anymore. That’s what experiment is.
[00:39:32] BH: Awesome. Well, thank you so much for joining us. This was great.
[00:39:36] DS: Thanks for having me. This was a really good time.
[00:39:46] BH: This show is produced and mixed by Levi Sharpe. Editorial oversight by Jess Lee, Peter Frank, and Saron Yitbarek. Our theme song is by Slow Biz. If you have any questions or comments, email [email protected] and make sure to join us for our DevDiscuss Twitter chats every Tuesday at 9:00 PM US Eastern Time. Or if you want to start your own discussion, write a post on DEV using the #discuss. Please rate and subscribe to this show on Apple Podcasts.