Do the legality and ethics behind the creation of GitHub CoPilot fly?
In this episode, we talk about Oculus' new experimental API, which blends virtual reality with your real surroundings, and we get into the sudden boom of QR codes, and the security issues it brings. Then we talk about some potential ethical and legal issues regarding Github Copilot with Andres Guadamuz, Senior Lecturer in Intellectual Property Law at the University of Sussex and the Editor in Chief of the Journal of World Intellectual Property. Then we speak with Laure Wynants, Assistant Professor at Maastricht University Department of Epidemiology about why hundreds of AI predictive models built aid in the covid-19 pandemic fell short.
Saron Yitbarek is the founder of Disco, host of the CodeNewbie podcast, and co-host of the base.cs podcast.
Josh Puetz is Principal Software Engineer at Forem.
Andres Guadamuz is Senior Lecturer in Intellectual Property Law at the University of Sussex and the Editor in Chief of the Journal of World Intellectual Property. His main research areas are on artificial intelligence and copyright, open licensing, cryptocurrencies, and smart contracts. Andres has published two books, the most recent is "Networks, Complexity and Internet Regulation", and he regularly blogs at Technollama.co.uk.
Laure Wynants is interested in methods to handle heterogeneity between populations when developing and validating diagnostic and prognostic models, and in the utility of models in clinical practice. Her applied work includes models for gynecological cancers, hospital-acquired infections, and covid-19. Since March 2020, she leads an international consortium to systematically review models for covid-19, for which she has been awarded the Edmond Hustinx science prize. This review already has over 1300 citations and has been picked up by policymakers, including the European Commission and the WHO.
[00:00:10] SY: Welcome to DevNews, the news show for developers by developers, where we cover the latest in the world of tech. I’m Saron Yitbarek, Founder of Disco.
[00:00:19] JP: And I’m Josh Puetz, Principal Engineer at Forem.
[00:00:21] SY: This week, we’re talking about Oculus’ new experimental API, which blends virtual reality with your real surroundings, and we get into the sudden boom of QR codes and the security issues it brings.
[00:00:33] JP: Then we’ll talk about some potential ethical and legal issues regarding GitHub Copilot with Andres Guadamuz, Senior Lecturer in Intellectual Property Law at the University of Sussex and the Editor in Chief of the Journal of World Intellectual Property.
[00:00:46] AG: It boils down to the question of whether or not it’s fair for all of this code to be used without people’s permission for something that they did not sign up to do.
[00:00:56] SY: And then we speak with Laure Wynants, Assistant Professor at Maastricht University, Department of Epidemiology, about why hundreds of AI predictive models built to aid in the COVID-19 pandemic fell short.
[00:01:09] LW: I think if the goal is really to help, to benefit patients, and to help clinical practice and this validation step that’s something that you cannot skip.
[00:01:21] SY: So if you’ve been listening to the show for a while, you know that I’m a big fan of virtual reality. I’ve had almost every headset except for the Index. It’s the one that I missed. That’s so expensive. But by the time it came out, the Oculus Quest was out. We kind of didn’t really need it. So we skip that one. So I’m super excited about this story. So Oculus announced in a blog post that its Quest 2 VR systems upcoming v31 SDK release will include an update which will allow users to incorporate their real world surroundings into their virtual reality experience. Basically, augmented reality is coming to the virtual reality world, which is really cool. So this is going to be done by using something called “Passthrough API” to use the device’s headset sensors to scan in physical surroundings and allow users to customize how their surroundings appear, add filters and render it onto in-game surfaces.
[00:02:17] JP: Whoa!
[00:02:18] SY: Yeah, pretty cool. In the announcement, the company says that this will enable users to collaborate remotely with friends and coworkers, engage with things in the real world, like their pets, while they’re wearing the headset. I assume babies might be useful too. You want to keep track of your baby while you’re doing Beat Saber.
[00:02:38] JP: Don’t Beat Saber while you’re home with a baby.
[00:02:40] SY: Beat Saber and baby, new VR game. “Create games that blend the excitement of a virtual world into the comfort and familiarity of the real world like zombies hiding in your living room.” The Passthrough API will first be available for Unity developers. So Josh, what do you think about all this?
[00:02:58] JP: This is really interesting. I have to give props to Oculus. They keep adding new features to this headset. I think it’d be really easy for a manufacturer to be like, “Yeah, we’re going to do this augmented reality thing.” They added hand tracking a couple of months ago. It’d be really easy for them to say, “We’ll do that in the Quest 3.” You’ll have to buy a whole new headset, but they keep adding stuff to this headset, which is really, really interesting to me. And I also think it’s really interesting that Oculus is approaching this. They’re like stepping down. They have the full VR headset outfit experience and now they’re blending in augmented reality where a lot of other developers and a lot of other devices are starting with augmented reality and that’s all they’re going to do or they might try to layer more and more stuff on there. This is the only developer I know of that’s like saying, “Oh, yeah, we have a VR system, but people are interested in augmented reality. So why don’t we work backwards a little bit and use the cameras that are on the device and bring in the real world?” I think it’s really interesting. Have you seen the demo videos of this stuff?
[00:03:59] SY: No, I don’t think I have. Actually, how are they?
[00:04:01] JP: Oh, it’s really interesting. There’s a user and they have like a little console and they’re in VR and they’re moving a slider all with hand tracking, of course. It’s so cool. They’re moving a little slider. And as they move the slider from a hundred percent back to the left to zero, the VR world fades away and the real world fades in. It’s very strange. It’s really cool.
[00:04:22] SY: So one thing we have to remember is that this isn’t really Oculus calling the shots. It’s Facebook calling the shots.
[00:04:30] JP: Okay. Yeah.
[00:04:31] SY: So that’s when you can remember. So then they’re like, “I love VR. I think it’s awesome.” To me, it is absolutely fascinating how much I believe in the real… I told you about my Everest experience, right?
[00:04:46] JP: Yes.
[00:04:46] SY: When I was on top of Everest. Yeah.
[00:04:48] JP: Amazing.
[00:04:49] SY: I legit cried. I believed that I was on top of a freaking mountain watching a time-lapse sunset. That was very real to me. I’m a big believer. However, I’m also kind of like, “All right, why are we doing all this?” You know what I mean? I’m still a little suspicious of the whole thing.
[00:05:04] JP: Yeah.
[00:05:04] SY: So at the end of the day, Facebook’s agenda, as far as I can tell, as far as I’ve read up about it is they believe that the feature of social media and interactions and all that is going to be some form of VR.
[00:05:16] JP: Right.
[00:05:17] SY: So they want to create a system, an ecosystem, a world where you can hang out with friends in a very realistic, authentic, emotionally engaging way. They want you to be able to do your work. Don’t forget, Facebook has a work product. I always forget this. Facebook has a work product. They have a suite of tools for work.
[00:05:34] JP: Facebook for work is the worst idea ever.
[00:05:37] SY: But it’s there. So they can make that work. For remote, especially now that remote culture has totally sped up, if they can create real time collaboration and make it comfortable and make it real, there was a demo that came out a while ago, I think maybe a year or two ago, that was basically hand tracking, but it was like face tracking. Have you seen this?
[00:05:58] JP: Yes. I have. It’s crazy.
[00:05:59] SY: Yes. It’s insane. I mean, you can see the face, the head, and its movements and it looks really realistic and it’s really creepy, but that’s the future that they’re trying to push us towards. So this kind of falls right in line, bringing that the real life space into VR and making people feel more comfortable with that transition, making people feel like it’s all one big reality. It’s exciting, but also to me it feels very in line with the vision. You know?
[00:06:26] JP: Yeah. I forgot the Facebook thing. Oh!
[00:06:28] SY: Sorry.
[00:06:28] JP: No, that’s fair. It’s something to remember about the Oculus. There were a bunch of stories that went around about a week, week and a half ago where Mark Zuckerberg was going around to different media outlets, touting their plan for the future, which is something they called “The Metaverse” and they want to create experiences and a whole Facebook thing that spans physical and virtual and just bringing up the Facebook name reminded me of that. And it’s like, “Oh, yeah! There’s going to be a feed in here and there’s going to be my uncle’s updates.” And maybe it’s not going to be an actual Facebook feed. But yeah, I think this might be one reason Oculus is driving so hard on these augmented reality things is because it meets up with the Metaverse goals of Facebook, which is kind of a bummer, I think, frankly. I would love this to stay a gaming and experience device, but you have to take the whole part and parcel when you talk about this advice.
[00:07:22] SY: Yeah. I mean, the future of virtual reality to me is really interesting because when I think about what Facebook is, as you said, I mean, I’m not on Facebook these days, but Facebook is basically a feed, right? As far as I know. It’s a long list of things that are happening, things that people are posting, right? That’s basically what it is. How does that translate to a virtual reality? How does that translate into a 3D? So I can’t imagine you’re walking around and there’s literally a virtual list of things your friends will do. You know what I mean? That can’t be it.
[00:07:50] JP: That would feel like the worst Metaverse ever.
[00:07:53] SY: That’s a terrible metaphor. No one would want to be in that. So what is it? I’m both afraid and insanely curious to see. How does that translate? How is the feed? How does requesting a friend? What does requesting a friend look like? The only example I can think of is, have you played Rec Room?
[00:08:12] JP: Yes, I have, actually.
[00:08:13] SY: Yeah. So Rec Room I’m pretty sure is the number one most downloaded VR game. It’s really interesting. They make no money, but they are valued at well over a billion dollars, a billion-dollar company, of course.
[00:08:23] JP: It’s like Roblox in terms of a space to hang out and play stuff, right?
[00:08:28] SY: Basically you're in this huge recreation center. There’s all these different games. There’s like bow and arrow. There is paintball. There’s basketball. There’s just a bunch of different rooms you can go into. You can go in with strangers. You can bring friends. You just play games. I think they’ve done a pretty good job of creating really good social cues. I actually interviewed the head of community there some years ago on moderation tools and how do you report people and how do you create a safe space, how do you block people. I mean, there’s all these things like ghosting and there’s putting people on mute. There’s all these tools. So I think that Rec Room is the closest I can get to imagining what social looks like when it comes to strangers, when it comes to friends in a VR world.
[00:09:12] JP: Right.
[00:09:12] SY: And that feels very different from what Facebook is. You know?
[00:09:16] JP: Right, but I think that’s kind of what they’re going for. I definitely don’t think they’re envisioning, I hope to God, they’re not envisioning a 3D feed just hovering in front of your face. That’s pretty dystopian, even for Facebook. But yeah, the idea that like Rec Room, it’s a space where community can form people play games. It’s all of course oriented towards gaming right now. And I think Facebook is seeing the future of not just like VR, but online communities and connections. They want to own that space. They don’t want another company like Rec Room to come in and be WorkRoom.
[00:09:50] SY: WorkRoom.
[00:09:51] JP: WorkRoom, the least popular VR game ever.
[00:09:53] SY: Worst Room ever. Right?
[00:09:55] JP: Or a hangout room or social room or anything like that. They want to be the ones to provide that. I think maybe that’s what they’re going for.
[00:10:05] SY: Yeah. I would not be surprised if Facebook acquired Rec Room for that reason.
[00:10:10] JP: I can see that.
[00:10:11] SY: You know what I mean, like their V1 of Facebook on VR, like kind of using that as the prototype, the starting point. I totally see that happening.
[00:10:19] JP: So they have this thing called Facebook Horizon, which I haven’t gotten to try out because I think it’s still in like invite-only phase right now, but that’s kind of their hang out, talk with people space. And what’s interesting about that it’s both in VR and on 2D screens as well.
[00:10:37] SY: Oh, I don’t know there’s a 2D component. Interesting.
[00:10:39] JP: Yeah. Yeah. They published it both on the Quest and Windows. So I don’t know. Maybe that’s what they’re going for. I think their sense is there’s something important happening here and the best way, frankly, for them to monetize and control it is to get in early.
[00:10:54] SY: Yeah, absolutely.
[00:11:05] JP: So one thing I immediately started to notice after coming out of my post vaccination social cocoon is the appearance of QR codes everywhere I went, pharmacies, restaurants, museums. There were no more physical menus at restaurants, always just a QR code on the table that would direct you to a website or maybe a Dropbox folder with a PDF of the menu. It seemed like big QR codes sent out an army of lobbyists overnight during the pandemic while we all weren’t looking. But in reality, it’s most likely due to the desire to have more socially distant interactions. And even though QR codes were first invented in 1994…
[00:11:40] SY: Oh, wow!
[00:11:41] JP: Yeah. And it looks like this latest proliferation might actually stick in the US. According to the National Restaurant Association, half of all full-service restaurants in the United States have added QR code menus.
[00:11:53] SY: Wow! That’s a lot.
[00:11:54] JP: Yeah. That’s really shocking to me. Okay. So you’re ready for the bummer?
[00:11:58] SY: Yes.
[00:11:58] JP: Because it wouldn’t be DevNews without a bummer. Well, QR codes might be convenient and there’s definitely something to be said about minimizing paper waste by just letting people download your menus and other documents digitally, the sudden proliferation of the technology doesn’t come without some serious security concerns. A recent New York Times story pointed out that QR codes don’t just allow someone to easily access a website link. It can also be used by businesses to track and amass data analytics about those that scan them. Companies with names like Checkout and Mr. Yum, my favorite…
[00:12:35] SY: Yes.
[00:12:36] JP: Those companies create products which allow restaurants to create QR code menus that customers can order through, but they also could let those companies collect things like a customer’s name, phone number, payment information, and order history. The concern is that by being routed to a place to like put in your order, they know you’re there. They’ve got your phone information and they can start to collect more information about you than if you were to have just ordered off the menu. I do suspect it’s very similar to if you place an online order with a place.
[00:13:05] SY: Right. Right.
[00:13:07] JP: So it’s not so far-fetched to see how this information could be used for really targeted marketing. There’s another really interesting concern brought up and that’s hackers potentially creating their own QR codes and sticking them over the legitimate QR codes in restaurants to easily hack into your phone.
[00:13:24] SY: Oh my gosh!
[00:13:25] JP: Can you imagine? This totally makes sense to me though. It’s basically phishing, but in QR code form.
[00:13:28] SY: It does make sense. That really is phishing. Oh my gosh! That is phishing.
[00:13:32] JP: So say you scan a QR code at a bar. I think bars really seem susceptible to this. You scan a code in a bar and it takes you to a Dropbox with something that looks like the restaurant’s menu file, but it’s not, or maybe it’s a web form to enter your order, but it’s not. Yeah. So that’s the downside of QR codes.
[00:13:52] SY: It has to be the easiest phishing attack you could… because you’re going to trust the restaurant, right? Obviously, who’s going to go to a restaurant bar, have the waiter or the bartender say, “Here’s the menu,” and then think twice about, “Oh, are you phishing me?” You know what I mean?
[00:14:09] JP: Right.
[00:14:09] SY: That is fascinating. That is absolutely fascinating. Okay.
[00:14:13] JP: Have you noticed QR codes more in your life?
[00:14:15] SY: Oh, yeah. I was really confused because the first time I went to a restaurant after my quarantine wasn’t that long ago, a couple months, and after I got fully vaccinated. And when I did that, I was like waiting at the table for a really long time thinking the host was going to come back with our menus and then she never did. And I was like, “This is terrible service.” And finally, I flagged someone down. I was like, “Can we get the freaking menus?” I didn’t say it like that. I was very polite. And they were like, “Yeah, it’s just the QR code.” And I was like, “What?” And I did it. And I was like, “Whoa! This is insane!” So yeah. I thought it was pretty cool.
[00:14:49] JP: I’ve been very surprised. I actually was thinking about this article and have been trying to think about all the surprising places I found QR codes in my life lately, and it’s a lot. They really are just everywhere, it seems like.
[00:15:02] SY: Where else have you seen them?
[00:15:03] JP: Well, I’ve seen them in stores and non-ironically as a, “Scan this to go to our website to find out more. Scan this for some more information about an event.” I’ve been seeing them on event posters for different festivals and parades around town. There’ll be a little QR code in the corner. I feel like maybe even five or three years ago, it was kind of a joke, right? QR codes were a joke. Who’s actually going to scan this? You’d see one on a bus or something. You’d be like, “Oh, that’s cute. They think people are going to actually scan it.” And something flipped. I don’t know if it’s this particular New York Times article posits, it’s the pandemic that it’s a convenient way to keep social distance, but get that information.
[00:15:45] SY: I believe that. Yeah.
[00:15:46] JP: They also pointed out a couple of years ago, Android and iOS, both added native QR code scanning. So all you have to do now is open up your camera app.
[00:15:53] SY: Yep.
[00:15:53] JP: Point at the QR code. There it is. Yeah. Oh, okay. That’s so easy. But I mean, my relatives are using QR codes. It’s bonkers.
[00:16:00] SY: Yeah. Recently I noticed there was a QR code on my water bottle. I don’t know what the brand was. Maybe it was Dasani or something. There’s a QR code on my water bottle. And I don’t know if they just added it for pandemic reasons. I don’t really know why they even sound relevant. Maybe they just always had QR codes and I never noticed, but I used it. I scanned it and I was like, “Oh, I wonder what this is.” I mean, it was pretty cheesy. It was like this fun, little sort of interactive story about how water is bottled or I don’t know or something. It was cute. And I was like, “Oh, this is nice.” I do not think I would have scanned it if I wasn’t used to seeing QR codes around. You know what I mean?
[00:16:39] JP: Interesting. Yeah.
[00:16:39] SY: I think I would normally have just ignored it and be like, “What the hell is this?” But at this time, I was like, “Oh, well, the last time I used a QR code I got shrimp poppers. Maybe something fun will happen now.” And I scanned it and I was like, “Oh, this is cute. It’s fine.”
[00:16:53] JP: But you can certainly see how it’s like an attack vector. It’s just so juicy. I mean, I think people now are just pulling out their phones and scanning QR codes willy-nilly. It would be an interesting experiment to put up a random QR code, put it on a light post somewhere and see how many people would scan it.
[00:17:07] SY: You scan it. And then when you scan it, it should say, “This could have been a virus. You should not have scanned. Do better.”
[00:17:12] JP: Like, “What’s wrong with you, just scanning random QR code?”
[00:17:13] SY: Do better.
[00:17:14] JP: Do better. The only other thing I will say is the other takeaway I had in this article, do you know what QR stands for in the QR code?
[00:17:20] SY: I have no idea.
[00:17:22] JP: Quick response.
[00:17:24] SY: What?
[00:17:24] JP: Yeah.
[00:17:25] SY: That doesn’t make sense. I’m not responding. I’m scanning. It should be QS code, Quick Scan.
[00:17:29] JP: I don’t know what I thought it stood for.
[00:17:31] SY: I don’t agree with that name.
[00:17:32] JP: Quick response. Today you learned.
[00:17:35] SY: Now we’re going to get into something that a lot of you developers out there might’ve heard of, GitHub Copilot. So GitHub Copilot is GitHub’s AI pair programmer that helps you correct your code, like an autocorrect suggestion, which was released in June of this year. And after a little over a month of people using this, there’s no doubt about how powerful this tool is. However, what there is doubt and concern about is some potential ethical and legal issues that have been brought up about the data that was used to train its machine learning algorithm. Coming up next, we dive deeper into this topic with Andres Guadamuz, Senior Lecturer in Intellectual Property Law at the University of Sussex and the Editor in Chief of the Journal of World Intellectual Property after this.
[00:18:43] SY: Joining us is Andres Guadamuz, Senior Lecturer in Intellectual Property Law at the University of Sussex and the Editor in Chief of the Journal of World Intellectual Property. Thank you so much for being here.
[00:18:55] AG: Thank you for having me.
[00:18:56] SY: So tell us about your research and career background.
[00:18:59] AG: I’ve been an academic in the topic of intellectual property for over 20 years now, here in the UK.
[00:19:07] SY: Wow!
[00:19:07] AG: And during that time, I’ve been writing mostly on the interface between technology and intellectual property, particularly copyright. I write about everything from open source software to artificial intelligence. Recently, I’ve gotten very interested in NFTs and cryptocurrencies.
[00:19:26] JP: For anybody in our audience that might not have heard, could you give us a brief overview of it?
[00:19:31] AG: So Copilot is an open source project that was created by GitHub. And calling it an open source project because it is mostly trained on open source software. GitHub is a software repository that hosts millions and millions of projects, about 190 million repositories. And they are owned now by Microsoft, which in the open source community hasn’t been particularly welcome. Now Copilot is a program that was created using the open AI implementation of GPP3. GPP3 is roughly speaking a text recognition machine learning program in which it’s been taught to guess what is going to be the next word in a sentence based on input of millions or even billions I think it has been taught with. Now this works quite interestingly when it comes to text. So generally, it guesses quite well what is going to be the next word in a sentence based on all of its training. What’s interesting about the Copilot system is that they have taken this same idea or the training of text and replaced text with code. So what Copilot does is by having been trained with all these millions of lines of code is going to guess something quite accurately what should be the next line of code.
[00:21:05] SY: So let’s get into the legality of all this. The short version, is it legal to train your machine learning algorithm on any publicly available code without the consent of the developers?
[00:21:18] AG: Roughly speaking, the answer is yes. I’m sorry, I’m going to give a very honest answer to that.
[00:21:23] JP: Oh, I love it, I love it. Yes!
[00:21:25] SY: Please.
[00:21:26] AG: It depends. Now in this instance, it depends a lot on the jurisdiction. GitHub is international. So it always helps to have a very international idea about this. From an American perspective and the US perspective, this has now been recognized in a few cases, particularly the Authors Guild versus Google or better known as a Google Books Case in which the Second Circuit found that this is very useful. So it’s generally acceptable that you can train a machine learning algorithm with text or with data. And this is considered fair use, even if you take snippets from texts, in this case it was books, they took the entire snippets from a book, it was considered to be fair use. So the training itself and the taking of snippets from books is considered fair use in the US. In places like the UK, we have had an exception to copyright, which is specifically created to allow what we call Text and Data Mining, which is also, roughly speaking, it allows the people to legally train machine learning algorithms with large amounts of texts and data. And this has been in place in the UK since 2013. In the European Union in general, now that we are not part of the European Union, thanks Brexit, the European Union has had a directive since 2019 that also allows for texts and data mining, and this for lots of reasons. It can be for commercial reasons. It can also be for other reasons. So generally, it’s now accepted internationally that text and data mining is going to be allowed. If we bring this forward a little bit, that means that the training of the artificial intelligence itself is acceptable. We have now legislation across the world, case law across the world that allows trainers or researchers to train their machine learning algorithms using someone else’s data and text. So it’s legal in many countries.
[00:23:41] JP: So even though a lot of the code on GitHub is open sourced, different projects have different licenses. Are you aware of any licenses that would let a project be open source, but might prohibit the use of it in something like training in AI?
[00:23:59] AG: Here, this can get a little bit complicated and depends on the licenses, of course. Now generally, 30% of all code in GitHub is under the more restrictive licenses. The most restrictive is the GPL, the General Public License. This is known as a copyleft license. Copyleft is not the opposite of copyright.
[00:24:23] JP: That’s what I thought.
[00:24:25] SY: It felt a little tricky.
[00:24:26] AG: It is a little tricky. Sometimes people say, “Okay, copyright, mostly copyleft, it’s all for lefty communists and things like that.” It’s not at all. Interestingly, it’s just a license that we like to call it a hack. It’s a legal hack in which by using this license, you are creating an ethical mandate on the people that are reusing the software. And this means that, for example, I take some code and I’m using software that is under GPL. So I have an obligation of not only re-sharing the fruits of my changes to the code, but I have to share it with the same terms and conditions of the license, with which I received it. So to give an example, I'm coding a new version of a Linux library, and I’m going to receive this under GPL. I can make all the changes that I want and I can use this code and I can reuse it, but I’m going to have to make this available with the same terms and conditions. In other words, I have to make it available under the GPL. And that’s generally what we understand as copyleft. Now because this license has imposed this ethical restriction, it’s ethics, but it’s also legal, and it’s been tested in court several times and people have to comply with the terms of this license or otherwise they are infringing copyright. Some people are arguing that under the terms of the GPL, if the machine learning was trained using code that is under GPL, this would preclude anyone from sharing the code.
[00:26:07] SY: Interesting.
[00:26:09] AG: I happen to disagree with that argument because I don’t think that the terms of the GPL are broken by things like Copilot. The idea is reading the license very restrictedly, in my opinion, because what a license is when you boil everything away, a license gives me permission to perform something that I’m otherwise not allowed to do. And this is exactly what a license is. If I have a license to play music, I can play music in a bar. Now this license says, “If you are not sharing the products of your programming or whatever you have done, whatever modifications you have created with my program, I can take you to court and I can sue you for infringement of copyright.” My argument is that actually it doesn’t meet the requirements of copyright infringement. If I was GitHub or if I was Microsoft, I would not require a license because what I’m doing is lawful in the first place. In other words, I can do as much machine learning as possible as I want because the law and the case law tells us that I can because it’s fair use.
[00:27:23] SY: So are there any past examples of this, past examples that support either the for or against side of this argument?
[00:27:32] AG: We are in completely uncharted territory, honestly.
[00:27:35] SY: Okay.
[00:27:36] AG: There have been lots of cases on the implementation of the GPL and similar open source software licenses. So this is widely recognized in cases both in the US and in Europe. We, however, have nothing related to artificial intelligence itself in this way. We haven’t had a case specifically dealing with the training of an artificial intelligence or a machine learning program that has been using open source software. And therefore, all we can do right now is to have fights on the internet about it. Some people are adamant that this is clearly a breach of the license. It’s copyright infringement. They’re planning already to bring civil suits against Microsoft.
[00:28:30] SY: Wow!
[00:28:30] JP: So outside of the legal question, would you say that the way GitHub went about training this model could be considered ethical?
[00:28:40] AG: Yeah. That’s a different question. Isn’t it? Now from an ethical perspective, I can see their point. I completely see why people that have been using GitHub and uploading their software and sharing it to the world are bothered and are concerned about the outputs that are going to be used by large corporations, sometimes people that they don’t like. I can see the ethical argument there. And the real ethical argument really is it boils down to the question of whether or not it’s fair for all of this code to be used without people’s permission for something that they did not sign up to do. So when people are using open source software and they are sharing their works in a repository of open source software, what they’re doing is sort of trusting that the community is going to do the right thing. They trust the community as they trust the license so they know, “Okay, someone cannot misuse my code because I am protected by a license and this license stands up in court. And if someone does something like that, I can bring them to court.” And this breaks down because the training of the machine learning algorithms in particular to create the code that’s produced by Copilot is something that was not foreseen by anyone. And maybe we have to change the licenses to not allow something like this. Now I have a bit of an ethical problem with that. I can see why people have ethical objections. My problem would be that people are getting too hung up on the fact that this is Microsoft.
[00:30:23] JP: Right.
[00:30:24] AG: And maybe the advantages of having a system that can generate code in this manner is actually really useful and it may lower the barriers of entry to a lot of smaller-medium enterprises will still have to hire programmers, but may not have to hire as much or have to work as hard in creating some working software.
[00:30:47] SY: Yeah. That was going to be my other question. It sounds like a very useful tool that would benefit everybody, open source people, proprietary people, kind of everyone who’s doing any kind of code. And so from where you sit, do you see there being an actual issue or problem that would disadvantage the open source repos that were being used to build this feature? Or do you feel like it’s kind of a principal thing where it’s Microsoft and I don’t like the fact, they didn’t ask me? Is it a principle or is there actually some risk or danger or some downside for these open source advocates?
[00:31:23] AG: I think it’s mostly the big object to who is doing the training. I think that Open AI, even though it’s sort of an open project, it has a lot of large corporations behind it. I think that what they’re doing is fantastic. I have seen GPP3 in action. I have seen all of these projects. I really think it’s advancing human knowledge. Now that is my personal opinion, of course. I can see why people object to all of these large corporations taking advantage of the small guy and the small programmers that toil away sometimes over weekends in their own time and they produce all this amazing code that is now being seen as being appropriate. That is sort of the negative part of it. The positive is that it’s a public tool. Everyone can use it. I think that the potential for it being a useful tool for everyone is great. I can’t wait to use it myself.
[00:32:26] JP: I played around with it a little bit. It’s kind of shocking what it can do.
[00:32:31] AG: Yeah. Yeah. There are valid concerns, on privacy for example. It’s been shown that apparently it’s leaking people’s keys, private keys. So yeah, there are lots of API keys that are being now reproduced apparently. I’m guessing that there was going to have to be some cleaning at some point of the code. Also, some people have been arguing that maybe there is going to be some copyright infringements because under some circumstances it reproduces code that has been created by someone else. So people have argued that there is a copyright infringement there.
[00:33:08] JP: So the ethical questions are wide and I think with ethical questions, there’s never really a great resolution to them. But with legal questions, we sometimes get resolution to them. What would it take for the legal arguments or the legal concerns involved with Copilot to be resolved? Is it going to take a court case? Will it take a lawsuit? Do you think we will reach any kind of legal decision or consensus and what would that look like?
[00:33:39] AG: That’s a tough one because it may depend on whether or not people are angry enough to pursue a lawsuit. There has been a lot of grumbling on Twitter about it and people have been saying, “Oh, I'm going to sue. We’re going to start a GoFundMe just to try to fund a lawsuit.” And it’s going to be a class action lawsuit, supposedly. No one has fired their first shot. I think the reason for that is because the legal argument is going to be very thin. I’m guessing that open source software advocates that are thinking about suing may be listening to experts and legal experts and even people that are working on their projects that have a lot of legal experience and they are probably telling them the same thing that I’m telling you right now, which is there is no legal case. And if you bring up a lawsuit, you have to prepare to lose a lot of money. Lawsuits are expensive. Now unless something like one of those lawsuits that gets funded by someone else, thinking of the Gawker-Peter Thiel type of lawsuit where you have someone funding a lawsuit or the very famous Oracle versus Google in which Oracle just kept going and going and going because they had a vendetta, let’s put it that way, against Google. Unless something like this happens, I don’t see this ever getting to a court. Open source development is very famous for not going to court. There are very few cases in open source development and the reason for this is generally the community is very, very good at policing itself.
[00:35:19] SY: Well, thank you so much for joining us.
[00:35:21] AG: Thanks very much.
[00:35:32] SY: Coming up next, we talk to Laure Wynants, Assistant Professor at Maastricht University, Department of Epidemiology, about why hundreds of AI predictive models built to aid in the COVID 19 pandemic fell short after this.
[00:35:57] SY: Joining us is Laure Wynants, Assistant Professor at Maastricht University, Department of Epidemiology. Thank you so much for being here.
[00:36:05] LW: Thanks for having me.
[00:36:06] SY: So tell us a bit about your expertise and research background.
[00:36:10] LW: I was trained as a biostatistician, but for the last few years, I’ve been working at the Department of Epidemiology. And the common thread throughout my research at first has been prediction modeling. So clinical risk prediction modeling. So making models that can assist a doctor in making a diagnosis or prognosis for individual patients.
[00:36:31] JP: So throughout the pandemic, we’ve seen a slew of artificial intelligence projects, hoping to use machine learning to help in this pandemic in various ways. You recently came up with a research paper, looking at the validity of hundreds of different predictive models for Coronavirus. Can you walk us through this research and tell us how it was done?
[00:36:51] LW: So it’s a systematic review, which means that we, in a systematic way, delved into the scientific publications on COVID-19 prediction model. So any diagnostic or prognostic model, including not limited to machine learning models. So we searched all the literature and then we reviewed relevant papers with a team of experts on the topic and we did standardize robust risk of bias assessments, which gives us an indication of the quality of those studies.
[00:37:25] SY: So what were some of your major results? What did you find out?
[00:37:28] LW: The first finding was that there was an overwhelming amount of research on the topic. So we identified over 200 models in the first few months of the pandemic and now that has more than doubled. So we’re still working on the last update. So hundreds of models, no one has an overview and that’s just what’s published in the scientific journals. So we know that also some data science companies have developed algorithms, some doctors or hospitals have developed them in-house and haven’t published them. So there’s more out there. So the main finding, the first finding is that there’s just a tsunami of models for diagnostic and prognostic decision-making out there.
[00:38:11] JP: So you looked at models for diagnosing coronavirus for patients with suspected infection, for prognosis of patients with COVID-19 and for detecting people in the general population at increased risk of COVID-19 infection or being admitted to a hospital with the disease. Were any of these models more effective than the others?
[00:38:32] LW: What’s interesting is that if you find how effective they are in terms of how well they predict, each class of models had excellent predictive performance. So if you quantify that as a C statistic or the area under the receiver operating characteristic curve, then it seems like you have perfect models that can perfectly distinguish between who will get COVID-19 and who will not, or who will die from COVID-19 and who will not, or who has the disease and who doesn’t. But if you delve a little bit deeper, you see that there are severe issues with the quality of how the analysis was done, how the research was set up and how the models were validated or tested. So in terms of effectiveness, we are not sure if we can rely on the reported predictive performance. We think that that is a very severe overestimation.
[00:39:29] SY: So as technologists, we have tons of tools at our disposal of things that we can use to solve different problems and it feels like a lot of technologists chose machine learning. Why machine learning for something to help this pandemic?
[00:39:42] LW: Well, I think the idea of trying to learn from data is a very obvious choice in this situation. If I think back about the situation when we started this review, which was early March, 2020, doctors in my country and in my neighboring countries, they didn’t know very much about this disease. So they have a lot of questions on how to manage these patients. There are hardly any cases at that moment in our country. So it wasn’t such a crazy idea to see if there were algorithms available that learned from data or that could learn from the available data. For example, data from China. So it would have been an obvious choice.
[00:40:26] JP: So what would you say is the biggest challenge for technologists trying to use these models? And does this show a wider issue or limitation when it comes to machine learning and medicine?
[00:40:40] LW: I think the main challenge is machine learners from my review often seem to be very focused on setting an algorithm to data and trying to get the best performance in-depth dataset, but that doesn’t say a lot about how well that model will perform when you use it in actual patients. So there’s many more things to consider. For example, the patients which you have data, are these actually representative of the type of patients that you would get in the clinics during this crisis? And that was a major issue in a lot of the models that we saw. The second is how the predictors that are inputs for the models are defined. Are they available at the moment that the model is going to be used, for example? That was an issue. Or is the quality of those measurements as good as the quality of the data? That was also an issue. Our output is defined. So what is it actually predicting? So even that was often very vaguely defined. You didn’t know what you were predicting. For example, when you predicted mortality, not everybody was followed up for the same amount of time. So for one patient, you would be predicting mortality within a day. For another patient, you had follow-up data, for example, for a month because this person is very sick and stayed in the hospital for a very long time. So that was often not clear, that prediction horizon. Another aspect is outcome definition. The analysis itself, it often seemed to be very much optimized for the dataset at hand, and it appears to fail. There’s a lot of overfitting going on. So asking a lot from datasets that are very limited in size and you see that when you are going to externally validate this model. So getting a really independent dataset collected from, for example, another hospital or by another researcher, we see huge drops in performance, even to that extent that a machine learning model that incorporates age on an external dataset performs worse than age alone, which is crazy if you think about it. And that indicates that there was just a lot of overfitting going on, I think.
[00:42:53] SY: So what would have helped this technology be a better fit for clinical use?
[00:42:58] JP: I think keeping in mind what are you developing it for, keeping the application in mind, giving the end user in mind, keeping in mind that it’s going to be used in real patients if you are successful. So I think we could have done better if we had access to high quality data, if we had collaborated. We saw a lot of highly advanced technical models, but with a clear mismatch of the clinical situation, it was intended to be applied in. The reverse also happens. So we saw models developed by doctors who clearly didn’t have much experience in this type of technology. So multidisciplinary collaborations I think will be key. And the last thing, I’m very much in favor of making your models available in a way that another researcher can test it in their data. And that would have been in this context a huge advantage because we have hundreds of models available. And the problem is no one knows which of these work best. So if you have a database, instead of developing your own, you could potentially compare several existing models that have been developed by others and see whether they work in your database. And if multiple researchers or different centers did that, we could identify very quickly and efficiently the robust models that develop those further instead of everybody starting from square one every time they have a research ID.
[00:44:35] SY: Why do you think a lot of people started from the beginning instead of maybe starting from or building on other people’s existing work?
[00:44:42] LW: Well, if you want to talk about scientific research, there’s clearly an incentive to publish your new novel work and validating work that’s out there maybe more valuable for the science, not for your publication list…
[00:45:00] JP: Oh!
[00:45:01] SY: That’s interesting.
[00:45:01] LW: Because it seems to be harder to get those works published because it’s perceived as not very normal, whereas it’s so much more valuable in my opinion, especially in situations like these where you have hundreds of models available and they all differ only to a tiny extent. In terms of data science companies or startups, perhaps the problem is that you need to have a product that you can sell, right? So what are you selling if you didn’t make the model yourself? So that is a challenging discussion.
[00:45:36] JP: Would you say that the problem lies mostly on technologists and potentially doctors for not understanding the limitations of machine learning and modeling? Or does more of the fault lie with perhaps governments for not having adequate data collection systems in place?
[00:45:56] LW: I think they’re equally contributing. I see a lot of calls now for making data lakes, where you have an update data or good quality data freely accessible. But I think that might underestimate how difficult it is to have good quality data and to really understand what is in your data. For some of the projects that I’ve worked on, cleaning the data and understanding what’s in the data and doing initial data analysis, that takes so much more time than developing a model, that can take months, even years just to really know what’s in your data and to talk to the clinicians to understand what’s going on, to talk to the people who collected that data. I think we need more of that. And I think we need to acknowledge the effort that goes into collecting data and updating it in a very transparent way such that anyone can understand what is in a database when you open it.
[00:47:00] JP: Yeah.
[00:47:00] LW: Because that’s not straightforward.
[00:47:02] SY: So given your expertise and looking at the pandemic in hindsight, what do you think would have been a better use of this time and resources that was put into a way that just ended up not working out?
[00:47:14] LW: Well, I think the answer to that question would be very similar to what I said before. Instead of competing and all trying to make very similar models, even the same datasets, we could’ve been collecting high quality data and we could have been testing available models on higher quality data, rather than everybody getting their own small datasets with their own shortcomings and developing their own model, which has exactly the same issues as one hundred other models available. I think if the goal is really to help, to benefit patients, and to help clinical practice and this validation step that’s something that you cannot skip because we know that even in absence of overfitting, models just don’t transport very well always.
[00:48:10] JP: So where do we go from here? Do you see these models being refined? We’re potentially starting a new phase of the pandemic with the Delta variant, starting to come into play. Are you seeing the same mistakes being repeated as the community learning from what happened with COVID? Where do you see it going from here?
[00:48:30] SY: We haven’t looked at the most recent models yet. We’re still working on the last updates, so we’re still rushing the most recent papers and I don’t have those results yet, but I think we need temporal validations of the models that are out there. Because as people are getting vaccinated and as new variants emerge and new interventions, public health interventions are in place to control this pandemic, the underlying reality changes and you don’t know whether the models that have been proposed and maybe also those who do have good quality, they may not work if things change or if treatments change. So you constantly need to re-evaluate these models to temporal validations, and that’s something I’m really looking forward to. So is it actually robust to these changes? That’s something that makes me very curious.
[00:49:25] SY: Thank you so much for joining us. This was absolutely wonderful.
[00:49:27] LW: Well, thanks for having me.
[00:49:40] SY: Thank you for listening to DevNews. This show is produced and mixed by Levi Sharpe. Editorial oversight is provided by Peter Frank, Ben Halpern, and Jess Lee. Our theme music is by Dan Powell. If you have any questions or comments, dial into our Google Voice at +1 (929) 500-1513 or email us at [email protected] Please rate and subscribe to this show wherever you get your podcasts.