Data Privacy in the Age of COVID-19
Want to share your content on python-bloggers? click here.
Hugo Bowne-Anderson, the host of DataFramed, the DataCamp podcast, recently interviewed Katharine Jarmul, Head of Product at Cape Privacy.
Introducing Katharine Jarmul
Hugo Bowne Anderson: Hey Katharine.
Katharine Jarmul: Hi Hugo.
Hugo Bowne Anderson: How are you?
Katharine Jarmul: Good. How are you?
Hugo Bowne Anderson: Pretty good. So I’m going to read your bio as Ryan read mine so everyone knows who you are. Katherine is doing something that I’m very excited about. She’s head of product at Cape Privacy, which is a company that builds out systems to leverage secure, privacy, preserving machine learning and collaborative data science. Katherine’s been using code since 2008 to, in her own words, solve and create problems. She helped perform the first PyLadies in LA in 2010 coauthored an O’Reily book. She’s a Data Camp instructor. She enjoys following the latest developments in machine learning, adversarial attacks, data privacy and ethics and workflow automation infrastructure and is generally chatty. Chatty and crabby, these are her words, once again, keep up with her latest shenanigans and it’s @KJAM, K-J-A-M.
Hugo Bowne Anderson: Katherine, as I just said, is a Data Camp instructor. She has a course on introduction to natural language processing in Python. And I’ve forgotten until I looked this up, you also have a project called Who’s Tweeting? Trump or Trudeau? Where I presume, people do some NLP to figure out whose Tweets, what I presume… Sorry, go on.
Katharine Jarmul: No, go ahead. Yeah, we collected some Tweets between president Trump and Trudeau and we basically train a classifier there. So, if you’ve finished my course and you haven’t done the project, check out the project and see it out. Cool thing is, is that at the end you’ll inspect and it might be a little bit deceiving how well it works.
Hugo Bowne Anderson: We’re here to talk about data privacy and where we are today. And I’ll just mention we’re all around the globe currently. Thank you so much for all of you tuning in at all hours of the day. Katherine’s currently in Berlin, Ryan’s in Florida, I’m currently in Australia. So, we’re bringing this to you from around the globe, but we’re here today to talk about data privacy. Katherine and I have enjoyed many conversations about data privacy in the past. If you enjoy this conversation, you can go and listen to a conversation we had a couple of years ago about privacy when GDPR came in, on the DataFramed podcast. But we’re here to talk about how data privacy in general, but what’s changing now in the age of the coronavirus? But before we jump into that, I thought, you know, everyone has a different origin story in data science, Katherine, so I thought you could just tell us a bit about how you got into it originally.
Katharine Jarmul: Yeah, really interesting story. So. I was really good at math and computers growing up. Did not think it was weird as a little girl to be really advanced in math and computers, and I can thank my parents for that support as well as a bunch of really amazing teachers. However, when I got to university is when I first realized maybe it was a little odd since it was less than 1% of my incoming class were women. So, that was a little awkward for me. I ended up shifting and becoming an Econ and Poli Sci double major and I kind of thought computer science wasn’t for me anymore. Data science at the time was quite dead other than the statistics I learned, for example, in political science and economics were actually quite helpful, and yeah. So, this was a while ago now. That would have been, I graduated in 2004 and then I kind of accidentally fell into data science. So, I ended up getting a master’s in journalism and I was working at the time at the Washington post and I would hack together lots of data visualizations and other things to go alongside stories. Interactives we used to call them, and I caught the eye of the person who ran their apps team and he said, "Hey, do you want to maybe come help us with some stuff?"
Katharine Jarmul: And then I started building databases for the reporters, different interactives and maps for the reporters. I started diving back into statistics and that way it kind of fell into it and have been doing that since. So, that was around 2007 to 2008, and yeah, I’ve been doing that since. It was On and off, but yeah, definitely since machine learning has really changed and brought a lot of promise since the early 2010s been into that. So, really love DataCamp and people self teaching and self-learning. I think I learned a lot just by doing an asking and finding good mentors and getting curious. So, I really appreciate kind of that element of the Data Camp mission.
Hugo Bowne Anderson: Awesome. We appreciate that a lot. I am fascinated that you recognize your maths and computational skills and ComSci skills really early. And then you went to journalism, and then you were drag back in by database and sys admin stuff, or whatever. But you’re not only, this is going to sound horrible what I’m about to say, I was going to say you’re not only a data scientist but you can actually work computers as well, but what I really mean is that I suppose a bunch of data engineering stuff, and actual computer science as well, and software engineering skills, right?
Katharine Jarmul: Yeah. Yeah. I mean back in 2008, 2009 I think most people thought of data science as, "Can you build a graph to go along with this database that I have?" I’m sure that it was more advanced outside of my circles, but that was about as far as we were in the data journalism field. And then, in some of the startups that I initially went to work with, I kind of became obsessed with data quality and data wrangling and this type of thing because that was most of the problems in the time was to answer the question properly, we had to make sure that we were even collecting the right data or even looking in the right places.
Hugo Bowne Anderson: And that’s still the problem, right? And of course you’re also, your O’Reily book is on data wrangling. It may even be called data wrangling, I can’t remember, but-
Katharine Jarmul: Yeah, it is. Yeah, yeah, yeah, yeah. I mean, it’s an unfortunate case. I think it’s getting better, at least more people are paying attention now to the problems that we have with data quality, which is a understanding and testing as things come in. I got to give a shout out to Great Expectations, there. They’re doing some good work in that space as well as the governments lineage, where does the data come from? Under what circumstances was it collected? I think a lot of these things kind of help drive conversations around quality and can drive conversations around privacy too, which is what we’ll chat about today,
Hugo Bowne Anderson: Very much so, and I think data testing, you’re absolutely on point there, and data testing with respect to model drift and concept drift if you have new data coming on in and you want to maintain your models as opposed to assume they’re working however you thought they were.
Katharine Jarmul: Yeah.
Cape Privacy
Hugo Bowne Anderson: So, maybe before we jump in, you can also tell us a bit about Cape Privacy. I gave a very short elevator pitch but maybe you could tell us a bit more.
Katharine Jarmul: Yeah, yeah. So Cape Privacy I joined about six months ago. It’s an amazing team. They were together before I joined and they were really focused on research around encrypted machine learning. And so really determining is it even possible to use some encryption schemes within machine learning in a reasonable way. That’s part of that process. They built an open source TF Encrypted, which stands for TensorFlow Encrypted that uses an encryption scheme called secure multi-party computation. You might have heard of secure computation before. This is the backing of that to do TensorFlow, which is pretty darn cool if you asked me. And so I got in touch with them. They were looking for a Head of Product, and so I joined and really part of the quest has been getting this amazing group so research oriented, deep learning specialization, cryptographers, lots of really deep knowledge engineers on cloud architecture and these types of things and getting our minds together to create something that people can use, which yeah, we’re excited. Our initial launch will be in late June.
Katharine Jarmul: So, stay tuned and it’s really around collaborative data science. So, how can we bring different experts to the table? Maybe security experts or data privacy or legal experts and data scientists altogether to really leverage advanced privacy and security techniques.
Hugo Bowne Anderson: That sounds like very exciting work and a very exciting group of people to be doing it with as well. You said TensorFlow, right?
Katharine Jarmul: Mm-hmm (affirmative).
Hugo Bowne Anderson: I don’t want to go down the rabbit hole too quickly, but the idea of doing secure privacy-respecting machine learning on an open source package, which has governance from a company like Google, it’s a web of ideas and possible futures, right?
Katharine Jarmul: Yeah, yeah, yeah, and the team definitely has plans to support Pytorch. There’s already been some work in PyTorch as well. Of course when one could argue that as also mainly backed by a larger company. It’s interesting to think about these things, like we need to integrate with systems where people are, and we’re also working on some things in the federated space. Some of which overlap with other federated learning platforms like TF Federated, or TensorFlow Federated, and some which might kind of be built on their own, but we’re open core. So, there will always be open source versions. We really are looking forward to the community feedback. So, if you want to play around with it, please join.
Hugo Bowne Anderson: That’s awesome. So, before we jump into privacy, I wasn’t to do this, I’m probably going to mess this up. I’ve been reading this book recently that you just reminded me of. This is like Reading Time with Hugo. It’s turned into a book club, but it’s actually very relevant. It’s Human Compatible by Stuart Russell who thinks about a lot of things including privacy. So, he says "there are multiple layers to the privacy story. First, can a personal assistant really be useful if it knows nothing about you? Probably not." Now he’s being provocative. Then he says, "Second, can personal assistance be really useful if they cannot pull information from multiple users to learn more about people in general and people who are similar to you? Probably not. So, don’t these two things imply that we have to give up privacy to benefit from AI in our daily life?"
Katharine Jarmul: Oh okay. Okay.
Hugo Bowne Anderson: It says "the reason is that learning algorithms can operate on encrypted data using the techniques of secure, multi-party computation so that users can benefit from pooling without compromising privacy in any way."
Katharine Jarmul: Yep. That’s what we build.
Hugo Bowne Anderson: Oh, you’re about to explain with a cuss word, which is definitely inappropriate, but one of the reasons, this is a really exciting part of the conversation. We’re not really going to be talking about how to encrypt data for developing principled machine learning algorithms, but in what I read and what we’ve been talking about, there’s an assumed trade off between convenience, utility and privacy, which maybe a straw man and something we’ll be talking about is we’re constantly sold that there’s a trade off between privacy and security and that, in times of crisis, in order to be secure, we need to give up privacy. And we saw what happened post 9/11, right? Where we gave up a lot of privacy, and then with the Snowden revelations, we gave up a whole bunch of other privacy and it didn’t even bloody work, right? Like there is a lot of things we gave up not knowing that seems useless in the end. So, thinking about whether this is actually a false dichotomy I think it will be very interesting.
Katharine Jarmul: Yeah. Yeah.
Rundown of the Data Privacy Space
Hugo Bowne Anderson: So, before jumping into COVID-19, perhaps you could just give me a general rundown of the data privacy space. I know that we could spend days talking about that, but just what you see the biggest challenges as being.
Katharine Jarmul: Yeah, one of the biggest challenges and something that really drew me to the work of Cape Privacy is about productionizing research. So, there’s amazing depth of research in the data privacy space. There’s people that have been working on computational or quantification of privacy for decades now. When we look at, for example, Cynthia Dwork’s work around the initial origins of differential privacy, right? And so, there’s all this amazing theory and quite a lot of deep mathematics around quantifiable privacy, and yet even when we see things like at NEURIPS, there was a great panel on differential privacy in production. And a lot of it was, yeah, kind of. Yeah, and that’s okay. It’s okay to not be there yet, but I think it’s really time for us to look at "can we productionize research? "And "can we really push research forward with real world use cases and real world threats that we know about and that we care about?" And so this implies, of course, like good relationships between people in research and people in industry. And I think that’s something that the overall machine learning community has been bringing up time and time again.
Katharine Jarmul: Also within ethics research is how do we kind of take what we know about it from a theoretical standpoint and create it for this kind of imperfect production system, or imperfect society and all of these other things.
Hugo Bowne Anderson: And Cynthia Dwork did some of this work at Apple, is that right? Or did I make that up?
Katharine Jarmul: She did some of it at Microsoft, and I don’t remember where she went after. Apple has a differential privacy team that works on productionizing differential privacy. So, they released a paper, I guess about two years ago now, nearly, on how they do differential privacy around your keyboard data collection. It’s a really good article. So, they have some work there too, and I think one thing that’s really important now and that I think hopefully continues to be important is I’m seeing more people ask questions and be concerned about data privacy. I’m seeing it from news outlets, I’m seeing it from Python Data Scientists, I’m seeing it from overall communities at large and I think, as we get deeper into the COVID 19 reactions, I think there’s a lot of really productive public conversation now around some of these topics that I hope that that continues because asking questions and being informed and being curious is really, really important for something like this to go from theory to reality.
Hugo Bowne Anderson: And I do think it is really important that we’re having the conversation now and there is a window open I think for privacy, for data ethics, for AI ethics, but it isn’t clear how long that window will be open. So, let’s do the work. Right?
Katharine Jarmul: Yeah. Yeah. I hope that it’s not a passing fad. That we really say, "Oh, hm, I’m concerned about this from many aspects. I would like to learn more." And I’m not saying everybody has to become an expert, but I hope some people here today are like, "Oh, hm, I’m curious about this. I want to know no more. I want to contribute. I want to try asking some of these questions on my own projects." Just that would help a lot.
Hugo Bowne Anderson: For sure. So, without getting too technical, what are some of the possibilities in the solution space look like to the things where we’re talking about?
Katharine Jarmul: Yeah. Yeah. It depends sometimes on the problem. So, there’s many different, what we often refer to as advanced privacy techniques or advanced security techniques, including a variety of secure computing or encryption techniques. Techniques like differential privacy, which I believe we’ll dive into a little bit later. K-anonymization or other anonymization schemes, so on and so forth. There’s also some techniques that are maybe more simple and straightforward, like obfuscation, hashing, pseudonymization, so on and so forth. And so, there’s many, many of these different tools. The problem really is, or the problem that I’m definitely thinking about daily and in my work at Cape Privacy is: without becoming an expert, how do you know what tool is right for the job, and how do you then assess what the security threat model is? Or what you’re actually concerned about from a privacy perspective? And this is really hard, and you shouldn’t have to become an expert. You’re already an expert in whatever part of data science you’ve decided to dive into.
Katharine Jarmul: You can’t really ask everyone to also add a whole other area of expertise. And so some of what we’re working on and what I think would really help with the solution space is kind of better awareness, better education, and the ability to empower data scientists to say, "Okay, I can go reference here, I can see was useful. I can understand what’s the actual threat and then I can implement something easily without having to become a privacy expert." And some of this is also about giving input from multiple people. So, sometimes if you have a legal department, maybe they should be a part of the conversations in a productive way. If you have a really awesome security team, then they should be a part of these conversations. And it’s about kind of everybody being able to share their expertise. You all, who are here, as the data science experts and then others within their own areas, and I think together, if you have this cross collaboration or cross disciplinary teams, they actually can come to a whole other level of understanding where the solution fits the problem.
Hugo Bowne Anderson: And do you see? I mean, so companies didn’t always have security experts, or legal teams, or these…
Hugo Bowne Anderson: Security experts or legal teams or these types of things. I wonder if you see a future in like most companies will have a data privacy team, or a data security, or data risk team, or experts or something like that?
Katharine Jarmul: I think the overall, let’s say consumer awareness, I think the awareness of the average person on the data collection that’s being done regularly is increasing. A lot of times now, when people just want to say, "Hi and what do you do?" I say I work on privacy machine learning.
Katharine Jarmul: And they’re like, "Oh, okay, yeah, I’m worried about that. I have an Alexa at home. Is it okay?"
Katharine Jarmul: And I’m like, "Yeah, hopefully. Here’s some things to think about. Here’s some articles if you want to dive into it." But I think more often than not, I challenge kind of my peers, my fellow data scientists and machine learning folks to say, how much do we need to take in these considerations when we’re thinking of systems?
Katharine Jarmul: And I think the movement in the past, let’s say eight years or so of also pushing a more ethical approach to machine learning and data science, I feel like it goes together well with the data privacy thing too. Because really, to some degree, privacy is about consent. It’s about awareness. It’s about understanding and transparency. And some of these overlap quite well with the debate that we have around ethics.
Hugo Bowne Anderson: I love it. The reason I love what you said it’s about and not what it is, is because I think that is an amorphous term in a lot of ways now. We’re trying to figure out what privacy actually means.
Hugo Bowne Anderson: So for example, I mean it was around 10 years ago, probably a bit more that any of us would have been surprised had we been targeted ads based on keywords in our email correspondence, right? And then suddenly a few years, we were outraged when that happened. And then a year or two later it was like, "Oh no, that’s just Google." That’s just Google.
Katharine Jarmul: It’s totally fine.
Hugo Bowne Anderson: Right? And of course on that spectrum we have Zuckerberg as well who’s like, "privacy doesn’t mean, blah…" Whatever he says, right. I actually don’t listen to the words anymore. But he does have a point in the sense that our idea with privacy has changed in the age of Alexa and in the age of smartphones and these types of things. But we do need to develop, and this is a lot of things you’re working on these kinds of robust, principled approaches to even how to have these conversations and empower individual users, citizens, consumer users, whatever we want to call ourselves these days, to make these decisions.
Katharine Jarmul: Yeah, I mean I think that there’s also something there obviously about collective action, right? And there’s also about non-shaming of people. Sometimes let’s say from a more deep involved privacy community or security community, people may say, "well, if you buy an Alexa, you deserve it." Or something like this. And I think actually similar to the book quote that you shared earlier that people should be open to having or trying new things, but creating some awareness around potential risks and especially creating responsibility on behalf of the teams that build these technologies and the companies that build these technologies to actually think and ask the right questions and to test things and to innovate and to try to again productionalize the research.
Katharine Jarmul: I think that’s not only a really full technical challenge, something that we’re working on pretty regularly, but also kind of a goal for us to set for the field itself is, remember those times when it was hard to figure out how we could reason about data privacy in machine learning systems? Well, that was in the past, now we have a good roadmap of how to make those choices.
Hugo Bowne Anderson: Exactly and I think that type of thing that you hear a lot like, "Oh you buy an Alexa, you know what you’re getting." It’s a form of shaming as well, especially when people use terms like you deserve it. And it reminds me of when people say, "Why do you care about privacy? What do you have to hide?" Right? And there are so many, I mean that can catch you immediately. But to put the burden of proof on someone who wants their own privacy preserved I think is the height of absurdity.
Hugo Bowne Anderson: And also it doesn’t even acknowledge that people in positions of power historically we know time and time again have abused having access to private things about people. And on top of the fact that we define, I mean we develop our senses of identity and define ourselves through things we only know about ourselves and that a lot of ourself is defined through what we consider private and what we consider public and having control over that. So what happens when that’s lost?
Katharine Jarmul: Yeah, I really like, when I think about privacy, I like to reference dana boyd’s definition. She spent some time researching how young women interact with different technologies and reasons about private versus public spaces. And it was really about expectation, and it’s about the expectation, lets say, when I share something with you here, I don’t expect it to be private because I’m here and we’re all talking together and having a great time, I hope.
Katharine Jarmul: But if we were on a phone call and I’m saying, "Hey," you know, I’m telling you something personal about my life, there’s a shared understanding. I actually usually don’t even have to tell you, "Hey, please don’t share this with anyone." It’s implied given the context. And so the difficult thing is how do we actually implement that type of context and that type of consent in a technological system, right? That’s a good challenge.
Hugo Bowne Anderson: Absolutely. And I can highly recommend to anybody who’s interested in this type of stuff, dana boyd’s book, "It’s Complicated" about the network lives of teens. It’s a tech sociological study about teenagers use a lot of social media and it’s seriously eyeopening. Fantastic, Katharine. At a time of crisis this, Katharine, what changes around the conversation about data privacy? And why? What changes have we seen and why are we seeing them? And what do we need to think about?
Katharine Jarmul: Yeah. There’s quite a lot happening right now. I spent a lot of this weekend reading debates in the security community around different types of architecture. This, that, so we can go deeper technical wise, deeper debates on centralized versus decentralized, contact tracing, et cetera, et cetera.
Katharine Jarmul: But what would really impressed me that I read a few weeks ago was Ross Anderson who is famous in the security engineering space, his article called "Contact Tracing in the Real World." If you haven’t read it, I can highly, highly recommend it. He tears down a lot of false arguments around as the trade off that you mentioned Hugo, which is the only way that we can fight coronavirus is to immediately give up all privacy. And kind of punches a bunch of holes from a security perspective and just from a citizenship perspective about why that’s maybe not a great argument to put forth. And we’ve seen this before. The security versus privacy… Barr, the Attorney General in the United States has given this argument a lot as a reason to have encryption back doors.
Katharine Jarmul: This whole we’re more secure if we’re less private is a continued argument that we see for decades now and is often put forth by people in power who already have quite a lot of information about us. And I think sometimes we need to question in these times, what does the consolidation of more of our data in one place mean?
Katharine Jarmul: We can think about this when we see that Google is helping NHS right now to develop predictive models now. On the forefront, that seems fine. But from a data privacy perspective, we already know last year what happened with Deep Mind and the NHS, is a bunch of data was shared without any consent and nobody was alerted, nobody was given an ability to opt out. And so there’s like these basic steps that we know where I think a lot of people would be willing to give data but are never even asked. And I think that’s a big problem within when we see these rash reactions. Let alone whether contact tracing is actually the best way for us to handle the COVID-19 crisis right now.
Hugo Bowne Anderson: I think there’s a related question when, you know, sure, maybe we’re fine with giving up some aspects of privacy, but it would be good to see how the data will be used, how the models will be used, if there are sunset clauses, will it remain in a database that any government official and Google employee will have access to?
Hugo Bowne Anderson: Actually, I told you this story last week. I’m going to tell the brief story of my time in quarantine. So for all our viewers out there, I flew back from New York city to Australia a month ago and I was put in quarantine. It was a government mandated military enforced 14 day quarantine in a hotel room. And there was a phone, there was a telephone in the hotel room as hotel rooms do. This phone was used for two or three purposes.
Hugo Bowne Anderson: One, a guy just rang around from another room and he was like, "G’day, this is Ben from room 1212, rah rah rah." He was like, "I’m setting up a Facebook group so we can all complain about everything that’s happening here." So I jumped on this Facebook group, but the nurse called daily, and then I got a call from a woman who said she was working with the New South Wales… that’s the state I’m in, Sydney’s in, the New South Wales Police Department, and I said, what do you mean you’re working with the New South Wales Police Department? She said, "I’m working with them."
Hugo Bowne Anderson: And I said, "Do you work for them?"
Hugo Bowne Anderson: And she said, "No I don’t."
Hugo Bowne Anderson: And I said, "Who do you work for?"
Hugo Bowne Anderson: And she said, "Unfortunately, Hugo, I’m not at liberty to say, I worked for the government, but that’s all I can tell you." And she said, "Look, I’m calling in order to find out where you’ve been? What flight you came in on? Where you’ll be going afterwards? Passport number? Date of birth? These types of things."
Hugo Bowne Anderson: And I said, "Oh, okay, what, what are you going to do with this data?"
Hugo Bowne Anderson: And she said, "I’m putting it into a government database."
Hugo Bowne Anderson: And I said, "What? And who’s going to have access to that?"
Hugo Bowne Anderson: And she said, "Oh no, it’s fine. It’s just for contact tracing. I’m just going to put that in."
Hugo Bowne Anderson: And I was like, "No, but how long is it going to be there? And who’s going to have access to it?"
Hugo Bowne Anderson: And she didn’t say don’t worry about it. I presume she’d been told to not say don’t worry about it. But that was the vibe I was getting. I didn’t tell her anything.
Hugo Bowne Anderson: I said, "Look, send up a police officer with a badge to my door and I’m happy to chat with them. But I’m not telling you any of this information over the phone." But what I actually realized is that her job also was to build a small relationship with me in that process and to make me feel good in a lot of ways, in order to get the information she needed, which I presume was for a good purpose, but I’ve no idea about how long, in what type of perpetuity, this information will be available to anybody. And I think those are the types of questions we’re dealing with, right?
Katharine Jarmul: Yeah. I mean, one question that we ask a lot, both within the security field, but also that I’m asking myself pretty regularly in some of the stuff we’re working on at Cape Privacy is what are relationships of trust and how do they interact with our concept of privacy? And I think this is a great example.
Katharine Jarmul: So when you are friendly, there’s a bond of trust and as humans, especially as trusting, optimistic, happy humans, we want that connection, especially now more than ever before, right? Where you might be at home, you might only be interacting with, let’s say your family, you might even live alone. We want those bonds. Definitely when you’re in a hotel room, right? You want those bonds, right? And again, it’s probably not malicious. She probably also wants to connect with others, right? But what it is is it’s about do we understand the risk and can we actually evaluate the risk in relationship to those relationships of trust?
Katharine Jarmul: And so they kind of play hand in hand, we want to trust others. What is the risk of trusting others? What is the risk of handing over your personal travel details? Is there a risk? If there’s not, okay, great. But can we actually evaluate that in a meaningful way? And we as humans, I think we often underestimate privacy risk. I don’t think that we’re trained or taught or that we learn very much about how to, let’s say, not give up personal details to people that we connect with.
Katharine Jarmul: And so again, I don’t want people to become untrusting, but it’s about can we leverage let’s say technology? Can we leverage regulation like GDPR and CCPA for example, that have the right to deletion and can we enforce these in a meaningful way so that when you are on the phone and you do say, "Hey, this is a trusting relationship and I can understand a reason about the risk and oh by the way, they have this confirmation of deletion they can send me, once this is all over, they have a confirmation of what models were used with my data. I can see and I can consent maybe even one day to every data science activity that my data is being used for and therefore I can actually reason about it and I can imagine it and I can make my own choices, right?"
Katharine Jarmul: Because some of it is about empowerment and self determination once we have equal access to information in terms of how data is being used and stored and when it will be deleted, right?
Hugo Bowne Anderson: You mentioned briefly we’re thinking a lot more about data lineage, data provenance these days. Almost like a future data provenance of looking, having a view of where your data is going as opposed to where it’s come from.
Katharine Jarmul: Exactly right. And being able to say, "here’s where I’m comfortable and here’s where I’m not." There’s this great concept in the data ethics community or responsible data community around data trust and the idea that we can collectively pool data and create a contract around it of "we pull this health data so that it can be used to say fight cancer, or map the genome or any number of things, but we forbid it from being used for these other things." Now if we had real governance of the entire system then we could actually enforce these types of things to some degree.
Data Reproduction
Hugo Bowne Anderson: So I don’t want to go down this path too far, but I honestly can’t help myself. When speaking about how data is used, data trust, pulling it for certain uses, and talking about the right to deletion as well, how can we even think about these things when in the age of data reproduction or the ability to duplicate data, how could I ever believe you telling me that you’ve deleted my data when you could duplicate it a thousand times? Right?
Katharine Jarmul: Yeah. That is a fantastic question and something, for example, that definitely we are working on at Cape Privacy. It also for example, within GDPR, it calls for data minimization efforts, right? How can we enforce data minimization, which is essentially trying to avoid this duplication factor of sensitive data across large organizations especially, but even in smaller organizations we see this, and some of it, it’s about the ability to define a system and the ability to define the rules at play within the system, right?
Katharine Jarmul: So we kind of get down to systems engineering or if anybody’s into security engineering that’s here, they’re starting to be like, "Oh, okay, I can start to reason about this." One of the things that we’ve been looking into is can there be rightful attestation of deletion? Can there be an attestation of when data is duplicated? And there’s some technical ways, of course, they require visibility, they require transparency in the system. They require you to know how your system works, which maybe you do, maybe you don’t. But I encourage, especially my fellow self-taught data science folks on the line, to start thinking about this as how would we even verify if we deleted data? How would I validate that the models that we’re using, Hugo’s data got removed? How would I even find the models that used Hugo’s data? The more we can start actually thinking about how we might do these things, the more of a chance that we will do them.
Hugo Bowne Anderson: That’s actually a great point, thinking about the models that use my data, the fact that you could delete my data, but you could have encoded it in a predictive model which makes predictions about me or as we discussed recently, you can enrich my data and then is that the data you got from me or you put a lot of resources and capital into enriching it actually. So do you then own that in enriched data, right?
Katharine Jarmul: Yeah. I mean this is a huge debate even within the legal field right now for larger regulations like GDPR, what does it mean? And I expect there to be more follow up on this. I spent the past week, reading The Data Ethics Commission Report, a commission that was put together here in Germany to think about AI systems and to reason about ethics within them. They issued that report. The report will go to the EU government, to the EU as a whole, and there’s likely to be some deep conversations and possibly a proposed AI centric regulation in the upcoming years.
Hugo Bowne Anderson: Great. There was something you mentioned in there before mentioning that report. And it was something maybe we shouldn’t just be collecting all the data in the first place anyway, and assuming it has value because it’s a function of legislation not catching up essentially with the fact that modern JavaScript can ship with a bunch creep ware, essentially.
Hugo Bowne Anderson: I read an article or I guess an exposé, an article, whatever, a blog post, if you will, on Stripe tracking recently where if you use the Stripe JavaScript, it actually sends a bunch of navigation information to Stripe. Not even to the parent company that’s running the website, and they do it for fraud detection and there’s probably a lot in there in terms of like, bots move very differently than humans on pages, this type of thing. So it might be anti-bot. However, there’s also a lot of ways to do that without collecting everybody’s navigation movements and history. And part of it was also that the terms and conditions were not very clear that that was what was happening.
Katharine Jarmul: So some of this is about documenting things, being transparent, and some of it is also asking is this the best way for us to solve this problem? I think that actually goes along with the COVID-19 conversation quite well. It’s like is this the best solution for what we need right now or is this just a solution that we can do because we heard about it once and it sounded like a good idea?
Hugo Bowne Anderson: I want to move on to COVID-19 in a second, but you mentioned like it may not have been clear on the terms and conditions. Who reads the terms and conditions?
Katharine Jarmul: I do. Like yeah, nobody has time for it, I do.
Hugo Bowne Anderson: The question of the difference between consent and in informed consent is a big thing. Particularly when multinationals have big legal teams that are good at covering their asses … Myself as a consumer user, I don’t have access to that type of capital, right? Or that type of labor?
Katharine Jarmul: Yeah. I mean, this was a huge push within GDPR is informed consent and the idea that you should be able to reason about what you’re agreeing to. And the New York Times, I believe it was the New York Times, had a really amazing graphic on readability of privacy policies over time. And you can see sample, like Google’s at one point in time required like PhD level nearly vocabulary understanding, and it took like 14 minutes to read. Yeah.
Hugo Bowne Anderson: Yeah. It actually compared it to several other texts, including the first volume of Emmanuel Kant’s Critique of Pure Reason. And it showed that there were several, like Airbnb or whatever terms and conditions that were more difficult to read than like the most dense philosophical texts humanity has ever produced. Right?
Katharine Jarmul: Yeah. German philosophy is pretty notorious, so yeah.
Hugo Bowne Anderson: But I remember, someone on Twitter was like, to be clear, the first volume of Kant’s Critique of Pure Reason isn’t the most difficult. And I was like, come on. But like it’s four or whatever, right?
Hugo Bowne Anderson: So how can tech help us? Can tech save us Katharine?
Katharine Jarmul: So I mean tech alone can’t do anything, right? Tech alone is good at computation, is good at maybe automation and speeding up processes and I think is now is a good reminder time to say we cannot solve societal problems with technology.
Katharine Jarmul: And Corona virus is a health crisis. It’s a pandemic. It’s also a societal crisis. It’s an economic crisis that’s happening right now around the world, and we’re not going to solve it with a better machine learning model. And we just need to take a deep breath and say, that’s okay. We can still actually use our skills around data and around machine learning to contribute in positive ways. And that doesn’t mean rushing to download the dataset saying, Oh, I figured it out and publishing it and sending it out to everyone the next day. It means really taking a step back and asking the critical questions.
Katharine Jarmul: There’s a lot of problems within COVID-19 that are not around, let’s say, tracking individuals movements or releasing a smart camera that can track coronavirus by telling me whether I have a fever or not. These types of things are, again, this rushed solutionism that is not very well thought through.
Katharine Jarmul: Instead, there’s a ton of new reporting on, for example, phishing attacks, spam attacks, coronavirus related cybersecurity threats. This is one area where machine learning has not really made a large penetration and where we could actually help, right? Capacity planning for hospitals, not in a coronavirus tracking sense, but just in a, let’s think about the problem from… I know that you spoke recently with somebody working on this problem, correct? I think that you were speaking with somebody who is doing capacity planning and supply planning.
Hugo Bowne Anderson: Yeah, with Penn Medicine, we had a webinar and they’ve developed an open source model like figuring out how many beds are needed, how many respirators, how much PPE.
Katharine Jarmul: Yeah. Yeah. Logistics, like people that are trained in solving logistics problems, operations, research problems in terms of why we have low supplies? How we can help with that? What is the equitable distribution of supplies? And again, not just rush download, but really it’s taking the time to ask the people in need what would actually be beneficial? And so from a local first perspective, it’s even talking within your community, getting something set up where people can share what they need online. You can use some NLP that you learned if you took my course on those datasets to say, "Oh, is there needs that we can collectively group together and make sure we have like a big groupby to minimize the amount of exposure for delivery folks."
Katharine Jarmul: Food banks, there’s lots of local food banks that are trying to do things. There’s of course many other ways to help frontline workers other than tracing their movements via their cell phone, right?
Katharine Jarmul: And then finally, getting informed on more advanced privacy and security techniques. If you have some extra mental bandwidth, which is okay, if you don’t, but just starting to learn and talk about them and think about them and apply them in your own work is already, I think, a step forward you can take during this time that avoids, again, this rushing to figure out the easiest solution, the whole hammer approach, where, if you have a hammer, everything looks like a nail.
Katharine Jarmul: And so I think there’s some of that and I encourage people to get out of their comfort zone to try thinking in a cross functional space to try thinking with people you wouldn’t normally think about data science problems with and to actually take in their input and feedback and have that drive it rather than, "Oh, I already know what to do. Let me just download this dataset and I’ll have a model to you in the next day."
Contact Tracing
Hugo Bowne Anderson: We’ve mentioned several times something which a lot of people are thinking about at the moment, contact tracing apps. We’re being asked to go to the app store and put them on our phones from our governments, right? So maybe you can give us kind of like your take on contact tracing apps, whether they’re, a month ago I was in New York City and now I’m in Sydney and they’re very different parts of flattening the curve and very different numbers and densities. And I wonder whether contact tracing is even relevant for all cities at this point in time. And then whether that’ll work, what we need to keep in mind, and those types of things. I know there are a lot of questions in there, but maybe we can just start by talking about contact tracing.
Katharine Jarmul: Yeah. So contact tracing, we have to think about it in multiple prongs, right? From a technical standpoint, let’s dive into that first because it’s interesting, we have a lot of arguments happening now around centralized versus decentralized collection methods. And the argument often for centralized is you need a central authority or some validation that the data that you have is true and correct. Maybe there’s something to that. But there’s also the argument for decentralized, it’s slightly more privacy preserving and perhaps only centralizing it after certain data privacy mechanisms have been used.
Katharine Jarmul: This makes me think about something for example, that we’re working on at Cape Privacy is a concept of federated analytics where everybody can send updates to a secure aggregation point, this is encrypted at that point in time, and then a final result is pushed out to all participants.
Katharine Jarmul: And so there’s some ways to think about this that also allow for a better visibility and understanding of how the information is being used. Some level of validation, of course the validation becomes mainly an argument of can we validate data quality? Which most people when they’re thinking of these coronavirus tracing apps, how many people are going to turn it on and as a joke report that they’re sick or something like this? There will be a non zero number of that.
Katharine Jarmul: And then another technical problem is the use of Bluetooth or even wifi and GPS is imprecise. It can have me close to a neighbor who’s actually on the other side of a door. It can have me in a bus going by some place or a person that is standing at the stoplight. From the more recent scientific studies, it is probably much better to determine are people in an enclosed space together or not? And that requires quite different sensors and measurements.
Katharine Jarmul: And then finally, from a technical perspective, when we look at something like the Apple-Google partnership, it doesn’t work on 2 billion devices. Because they’re older and the older devices are more likely people in lower income groups and older people, which are actually larger at risk populations. So will we come up with a tech only solution and you need the latest, greatest, and yet the most vulnerable do not have the latest greatest. Then what exactly is our solution? Are we going to ship new iPhones to every person? Is that going to be the solution? We have to kind of think of it at a higher level.
Katharine Jarmul: And then there’s other parts of the problem of course, which is a lot of this contact tracing and flattening the curve is about hospital capacity near you. And one of the critical questions I don’t see being asked is not this, why aren’t you staying home? And kind of some of the shaming that I see, which by the way, if people are not at home, they might be going to work and they might be going to work because they can’t stay at home and feed themselves, right? So we need to kind of watch ourselves judging and critiquing others where we are not them and we do not know what their needs are.
Katharine Jarmul: And then secondly, why is it that our hospitals are not prepared for pandemics? Even though from a statistical point of view, we know that they are not impossible or hugely unlikely. And so some of that is about, thinking about the cost models of our hospitals. There was a great quote from an official in France who was like, "We pay the firefighters even when a building is not on fire." And maybe we need to think about that for our healthcare systems as well is we need to be maybe over prepared even when there’s not a pandemic.
Hugo Bowne Anderson: There are a lot of things that I want to touch on there. You mentioned the demographics that are at higher risk. And related to that I think we’re seeing and we’ll see that there are different demographics who will win with respect their privacy as well. This is something that we’ve seen play out. There’s a wonderful book called Automating Inequality by Virginia Eubanks and it’s about really three large at-scale models that decide who in LA, from the people who are homeless, who will get housing, where it will be. It also decides who’s eligible for welfare in, I think that’s in Pennsylvania, another model of which it predicts which children are at risk, I can’t remember where that is, of domestic violence abuse and reports.
Hugo Bowne Anderson: And time and time again, throughout all these cases, Eubanks actually coins the term "the digital poorhouse", and it essentially refers to the massive surveillance system that is able to keep its eye on people in lower socioeconomic classes across the US and actually I think coining the term the digital poorhouse, is a wonderful active naming because it gives us a historical tether to think about what’s happening currently.
Hugo Bowne Anderson: But we do see all our preexisting societal biases reinforced by at-scale inference-and-decision-making systems, right? And I presume that we’ll see similar things happening with respect to privacy around COVID as we are with respect to bodily health. I mean we’ve seen the numbers of rate of infection and fatality, for people of color in the US, right?
Katharine Jarmul: Yeah. There’s a lot go dive into there. I’m not sure we have time to dive into all of it; however, I gave a talk at Chaos Communication Camp this past summer called Privacy as Privilege. And I delve into many different areas, of course Virginia Eubanks’ work is in there, but it’s about ways where privacy has become almost irreversibly tied to privilege within, let’s say also like especially societies that are deeply connected right now and the ability for AI to function for you versus function against you is part of that, right? And so the Alexa, that helps me. AI is here to help. Data collection is here to help.
Katharine Jarmul: Versus the AI that incarcerates me. AI is here to imprison. AI is here to maim or kill, right? And so when we think about that, and especially as the data scientists, right? The ones that are building systems, we also have to think about who’s going to benefit from this and who’s going to lose because of this.
Katharine Jarmul: And if we can’t reason about those questions, we should get a cross disciplinary team. Get friends that work in humanitarian sectors and human services and start deeper connections there and deeper conversations around, you know, thinking two steps ahead is hard. It’s hard. I’m an optimist. I always think like, "Oh, it’s going to work out perfect."
Katharine Jarmul: But we also have to have people around us that remind us, "Hey, it’s not always going to work out perfect." And we should think about, again, these risk factors, think about these trust factors, think about and be prepared for how it might go wrong and how it’s going to interact with larger societal systems like, let’s say, inequality and unequal distribution of resources and oppression, right?
Hugo Bowne Anderson: So, I want to go back to contact tracing for one second and in particular, talk about…Oh… When I saw that Google and Apple with collaborating like to do with sensitive, private stuff, I thought about my space of tech companies with respect to how they think about privacy and how comfortable I am with the way they think about privacy. And none of them are great, right? But when I look at that spectrum, Apple’s on one side and Google’s really, really far on the other side. So it was almost like I had this cognitive dissonance around these companies collaborating together. Did you experience something similar or do you have any thoughts on that?
Katharine Jarmul: So I have to first and foremost say that I know and have met people working in data privacy at both Apple and Google. And I think there’s a lot of really thoughtful individuals and researchers at both companies, also at Facebook right now has a really active cluster of folks working on privacy and data science. And Snapchat does too. So it’s amazing when you start to dive into the data privacy, let’s say, in data science and machine learning that there’s these play people in places that you might not expect.
Hugo Bowne Anderson: I agree, but back to your point before.
Katharine Jarmul: [crosstalk]not kick to those people because I think that they’re doing really…. Yeah.
Hugo Bowne Anderson: Sorry, I was just going to say it is a question of how it’s deployed as well right.
Katharine Jarmul: Exactly.
Hugo Bowne Anderson: Is it even aligned with any company’s business incentives?
Katharine Jarmul: Yeah and my biggest concern with the Apple-Google plan is, outside of what are we going to do about older devices, is to me it seemed like jumping to a solution that we weren’t sure worked and we weren’t sure was needed.
Katharine Jarmul: And some of this is because there is so much uncertainty around how it spreads, are there different strains? Can one get it again after somebody has recovered? All of these things are still really in flux, right? Which means we don’t want to build a solution that everybody thinks will fix something and then turns out it’s not even related to the real solution, right?
Katharine Jarmul: That was some of my primary concern around it is, well, what’s going to happen is you know, people going to think, "Oh, I downloaded the app so now it means I’m coronavirus safe, and maybe this is like a false sense of security. And then if we use that app, say, in a really restrictive way, such that maybe your employer can look, how many people are going to maybe lose their job or be told to stay at home who are not actually positive for coronavirus?
Katharine Jarmul: And how unpredictable is our access to testing in a ubiquitous way. And have we tested everyone who has symptoms and have we also tested large groups of people who don’t have symptoms, right? There’s a statistician that works here in Germany alongside the Robert Koch Institute that has been pushing for a larger widespread sampling of tests because just setting up testing locations and having people self report, you can imagine it’s going to be people with symptoms and it might be people that are, let’s say overly concerned about their health given the daily news. And for example, I haven’t gone to my local testing, so I don’t know. I may be asymptomatic, right? So there’s all this stuff that we don’t know from a scientific perspective and we don’t want to jump into a solution without actually consulting with the scientists who are working on this problem first. You know?
Hugo Bowne Anderson: I also read that we could have false negative rates of up to 30%, right?
Katharine Jarmul: Exactly. Yeah.
Hugo Bowne Anderson: Also coming from New York, I don’t know if I told you this, I was tested and the language they used was fantastic. She said the test, the doctor said the test did not detect the presence of… as opposed to it came back negative, which I think is actually really, really cool. I’ll tell you what though, the swab was really full on. She put my head back against the wall and shove this thing straight down there. I pity the fool who tested the tiger. I tell you that, Katharine.
Katharine Jarmul: Yeah, yeah. I mean, some of this too. I mean, some of what I’ve been thinking about, there was a really great piece from the deep learning AI newsletter that was around like do no harm. And it was about, let’s say, like releasing models as based on incomplete data or incomplete scenarios. And what you said also reminds me there was a recent data skeptic podcast with Jessica Holman-
Katharine Jarmul: and her group studies visualization of error and uncertainty in data science. And it’s like when we talk about these things, I mean, even some of the reporting I’ve seen around COVID-19 in the newspapers or in the other media that I’ve been consuming, maybe I have a level up because I have some statistical training and background. But think about the average, let’s say, human, and the level of mathematics knowledge that’s there. Can they reason about the risk of infection? Can they reason about the verifiability of these tests? Can they reason about the predictive capacity of a model built on limited or over sampled data from a certain population?
Katharine Jarmul: We’ll have to responsibly report this information and data because we’re trusted members of a group of people who know what data is about. And so, we have to really make sure that that’s super clear and figure out better ways to represent that than, say, a way that we as a community would talk about errors or that we as a community would talk about uncertainty.
Hugo Bowne Anderson: Absolutely. And I do remember after the presidential election in 2016 people came out and said, "Oh, the predictions were wrong." And that statement, in itself, totally misunderstands, and I understand it, but I totally misunderstands what a probabilistic prediction is. If you say something has a one in 10% chance and it still happens, that was just that one in 10 thing happening, right?
Hugo Bowne Anderson: Actually, Allen Downey had some wonderful posts on this, and he had an idea when I chatted with him on DataFramed, one idea he had which I really loved was newspapers could report these probabilistic predictions by each day doing the simulation and seeing what it was and reporting this future in order to get people accustomed-
Hugo Bowne Anderson: Yeah. And Nate Silver and the people at FiveThirtyEight are using kind of more principled, mindful language around it, saying there’s a one in five chance that this will happen, there’s a four in… As opposed to, "That’s 10% likely," which people may misinterpret in a variety of ways. But we crave certainty as well, right?
Katharine Jarmul: Yeah.
Hugo Bowne Anderson: We need to become more comfortable with uncertainty.
Katharine Jarmul: It’s true, yeah. And make sure that people know or have, let’s say, the right mental framework to make responsible choices both for themselves and for their communities, right? One of the graphics that professor Holman mentioned is, I don’t know if you saw it, I think it was the New York Times one again, where you could trace and see the probability of somebody staying above or below the poverty line based on race in the United States. And just by creating a graphic that instead of the big bar chart, actually have, "A person is born, they end up in poverty. A person is born, they end up not," and visualizing it by, let’s say, like a visual sampling was a lot easier for people to comprehend. And there’s so much more work that we can do around that, around how do we allow people to reason about systems. I know some of the work that we’re working on is how do we allow people to reasonably reason about privacy and systems, right?
Hugo Bowne Anderson: The Times have done a lot of great stuff on that. I mean, they were early doing interactive, like D3 visualizations and that type of stuff where… I remember that was one which it had the job report and it looked like it was going up, but they showed that, if it was stationary, all the different ways it could look based on sampling. And they had a similar one on, it was some election polling and it reported so-and-so was ahead, but it showed that, as a function of sampling, all the different ways it could work out. And it showed that interactively.
Katharine Jarmul: Yeah. I encourage people, I know that DataCamp has several visualization courses. If you haven’t taken one, even if you’re like, "Oh, I just want to build ML models," there’s a lot of interesting applications of visualizations even when we think about machine learning. So take a chance to dive into something new.
Difference Between Uncertainty and Risk
Hugo Bowne Anderson: The other thing that I think is important is for people to start to talk more about the difference between uncertainty and risk. You may refer to these as known unknowns in terms of risk and unknown unknowns in terms of uncertainty, but the way I think about it is when you’re playing at a casino, or let’s say you’re flipping a coin or playing roulette. You’re playing roulette, you know the odds, right? You’re playing with risk there. You know exactly what the distribution is, you know all the possible permutations and what your chances are.
Hugo Bowne Anderson: What uncertainty would be is if the table was rigged and you didn’t know how. Then you actually don’t have information around that. So thinking about what you do know and what you don’t know. I think one of the truly, truly, deeply challenging things of what we’re going through now globally is that there is so much uncertainty. It isn’t just risk. We actually, we don’t know, right?
Katharine Jarmul: Yeah, we have a lot of unknowns right now, and we thankfully, I think more of the unknown, if anybody’s played the exercise of known to unknown on both sides, we have a lot of unknown unknowns that are starting to move into known unknowns, which is good. This is a good movement. Hopefully eventually they’ll move into known knowns, but yeah, we need to be a little bit patient with our friends in the biological sciences and health sciences and figure out ways to assist them in their research so we can figure out good responses to the current crisis.
Call to Action
Hugo Bowne Anderson: Exactly. So I want to get to questions from the crowd in a minute. But before that, I’d just love to know if you have a final call to action or something you’d like people to focus on or to think about in the coming weeks and months?
Katharine Jarmul: Yeah, I’ll ask that people, rather than jumping too soon into tech solutions, that you focus on your local community, that you take time to reach out to local community leaders, to community groups that are working in this space, to donate your time and your thoughtfulness and your energy there first. And maybe part of that is teaching yourself and people around you something new about privacy, about machine learning, about privacy in machine learning. And let’s try to have some collective action around that. Not collective shaming but collective empowerment there.
Katharine Jarmul: In the same sense I also want to say to please focus on looking after yourself. You don’t have to be a hero, and you can’t be a hero if you can’t take care of yourself first. So like to say the refrain, a lot of my friends right now too who feel like they have to do something to help is, have you asked yourself what you needed? Make sure you put your own mask on before helping others. Use that from the airline. We’ll see. But really take care of yourself and the people close to you first and then start thinking about the larger community.
Q&A
Hugo Bowne Anderson: Thank you so much, Katharine, for taking such a mindful and humanistic and empathetic approach as well. I really appreciate that. We’ve got so many great questions. The first one, we’ve talked about this, but, I mean, I don’t have an answer to it, so I’m going to ask you. And I hope I’m pronouncing your name somewhat correctly, Rughavendra asks, "Is there a solution as of now to check if your data is deleted?"
Katharine Jarmul: No. There is no commonplace solution for this, unfortunately. Yeah, this is something I think numerous folks are working on. There’s some systems that are kind of around this. Now, what we can debate about is whether they’ve traversed the entire problem space. So maybe it has confirmed that it has been deleted from, let’s say, the primary database, but it cannot confirm, let’s say, artifacts or intermediary data sources or whether the data set was downloaded to a local data scientist’s computer and so on. So this is an open problem. It’s not an open problem as in it can be solved, but it’s an open problem in terms of it’s not yet actively deployed.
Hugo Bowne Anderson: I still come back to this idea of duplication. I was a teenager in the mid ’90s, which meant that for, better or for worse, I’m going to admit to slightly illegal activity. Or I’ll say some friends of mine. Some school buddies of mine-
Katharine Jarmul: Were you on Napster?
Hugo Bowne Anderson: They’d burn compact discs, Katharine. They would buy them and they’d take it home and burn 20. I tried to stop them, but they’d do it every day. And then of course iPods came out and suddenly bit torrenting and all of that. Like how do you even think about verifying deletion when you have this type of duplication? So a bunch of other good questions. Andrew asks, this is great, " In order to log on to some UK pub or shop wifi…" Firstly, thank you, Andrew. I actually worked in a pub, the Princess Louise on High Holborn street in London and I miss those days dearly. I lived with a bunch of Australian barmen, I shared a room with them. Look, in all honesty, I miss British pubs currently. But Andrew asks, "In order to log on to this wifi, the registration process sometimes demands a date of birth. The question is, is this reasonable? For what purpose is this required? Or is it abuse collecting information for the sake of it, stored for how long? Accessible by whom? And for what purposes?" Andrew says, "I believe this is unreasonable but would appreciate your thoughts, please."
Katharine Jarmul: Yeah, I mean, we see a lot of these consent being… From a regulatory perspective, there’s of course some consent around age, right? There’s a specific regulation that applies to minors, and that definition is different from jurisdiction to jurisdiction. And so some of that might be like, "Oh, we’re not going to track you if we think you’re under 13 but if you’re over 13 then we have a right to track you." And it’s hidden somewhere in a terms and conditions that evidently only I read.
Katharine Jarmul: But yeah, so there may be some legal reasons why people are asking this. And so, instead of saying something obtuse like, "Please tell me your age to access," they should maybe be straightforward and say, "Here’s what our website does, here’s how the data access works. If you agree, you have to confirm you’re over 14," or whatever it is. And unfortunately I think that this is what I would say is like a dark pattern in terms of consent screens.
Hugo Bowne Anderson: I totally agree. And to build on that, I think something we see in that question is, okay, let’s just say that access to the internet is a basic human right. That’s something I believe. I know people who do, I know people who don’t. It’s definitely something that’s incredibly important and you could view someone requiring you to give up information in order to access something which you have the right to in certain situations aS some sort of ransom as well. And it does seem unfair in that sense and something I’d encourage us all to rally against.
Katharine Jarmul: Yeah. I mean, transparency especially in this type of hidden consent and where the data is going to be used and how and what third party may or may not be, let’s say, tracking your data access. This is something that I also brought up in terms of Privacy as Privilege is more often than not when you use free wifi, you’re trading data for a service. But not everybody has at-home wifi. So we need to think about those problems as well.
We have a question from Orhn, O-R-H-N. Orhn asks, "Hi, Hugo and Katharine. Thanks for the webinar." That’s lovely. "I’ve been following your Python courses on DataCamp." That’s awesome. " "My question is about the health vertical for privacy. I’ve been consulting a video consultation startup and they’ve been struggling to scale their data analysis sources across different countries since different countries and health administrations required different sorts of privacy regulations. Is there any collective source of information for data privacy regulations internationally?"
Hugo Bowne Anderson: I’m just going to build on this. And I think you and I have chatted about this over the past several years. GDPR, for example, if you’re a small startup and trying to figure out even what it means, the type of legal costs you can incur then are really disincentives in a lot of ways. This legislation’s really important, but then for a company figuring it out on the other side is a huge challenge. So I think this question speaks to that more generally as well.
Katharine Jarmul: Yeah, I mean, this is a big part of the daily challenge we’re working on right now is how does one determine which privacy techniques to leverage based on both regulation and other risks, let’s say, of like proprietary information in a model or data sharing between companies that maybe don’t have perfect trust. And so some of this is about modeling, like what is the system that we want to build and what are the risks and threats there? In this case, maybe the risks would be regulatory risks, right? So can completely empathize that this is a difficult space.
Katharine Jarmul: I follow quite a lot of different people within the compliance and regulation space to try to keep up on things. Even though they’re based in the US, the IAPP has a pretty good… I think everything’s in English and it’s not always as specific as, let’s say, when I read the German reports on it. However, the IAPP has a pretty impressive comparison of a lot of different things. It’s an organization mainly for privacy professionals. So I could recommend that you check out things there and definitely stay tuned with what we’re building at Cape Privacy, because part of it is about letting people, letting small startups and other teams work on these critical issues while still staying compliant within the regulatory needs and maybe even within a privacy by default or by design kind of space. Again, you shouldn’t have to become an expert to make the proper choices here, and unfortunately that’s kind of how it looks like right now.
Hugo Bowne Anderson: So we have time for one last question. We haven’t gotten to all of them unfortunately, but please do feel free to reach out to Katharine or myself on Twitter, @KJAM and @hugobowne respectively. Or on LinkedIn. It may take some time for us to get back to you, but please, please do be in touch.
Hugo Bowne Anderson: Okay, the final question is, "Hi, Katharine," this is from Rob, "Hi, Katharine." Hey, Rob. Hey, mate. He’s a government data scientist in the UK. That’s why I just called him mate. This is a great question. Rob says, "Naturally public data trust is a big area of interest. You, Katharine, mentioned that tech alone can’t solve all the problems. In your eyes, what are the non-tech priorities in the space of public data trust?"
Katharine Jarmul: Yeah. Excellent, excellent question. One of the largest things is, or one of the things that I know from people working in the space is from a legal context, could we actually abide by regulation and set up data trust at the same time? Because there are of course a lot of precautions, let’s say, especially around electronic health records. As the previous question asker pointed out, they’re not, let’s say, generalized across all jurisdictions, and so we quickly run into a problem of, if my EHR, if my electronic health record data is stored in a data trust in Canada but I’m a European resident, then it gets quite murky, the type of consent that I have to give for these things. And then let’s say then the company that wants to use it is in Brazil, then we open up a whole other thing of international jurisdiction of data and consent and transparency and so on.
Katharine Jarmul: So from a legal aspect, I know there’s quite a lot to dive into around data trust. And then I would challenge us from a humanitarian data science perspective if our non-technical side is do we understand the problems there? Do we understand the problems of putting a bunch of data together? We’re creating quite a large security risk from that standpoint. And then do we understand how we can communicate about how the data is being used to persons? This is outside of the enforcement that that’s actually what it’s used for, which is of course a huge technical problem and also just a trust problem. And so some of it is kind of my challenge too of can people reason about privacy risk? Can people reason about machine learning if it’s new to them? Can people reason about uncertainty, errors, statistics? And so we have to answer those questions alongside the larger technical questions.
Hugo Bowne Anderson: So that was a great question and a wonderful answer. I wish we had more time, but we’re going to have to wrap up now. We’ve still got nearly a thousand people here, and that’s incredible to stick around for an hour and a half for a conversation like this. I’d like to thank Ryan, who’s one of the superstars behind our webinar program here, and everyone at DataCamp who makes these webinars possible.
Hugo Bowne Anderson: Once again, we received lots of questions we weren’t able to answer. So feel free to reach out to us on Twitter and LinkedIn. The recording of the webinar will be sent out via email after this whole week of webinars. And don’t miss tomorrow’s webinar on how to roll out an online first data literacy program, which will be at 11:00 AM ET. Thanks once again for joining. The biggest thanks of course, thank you so much, Katharine for bringing your-
Katharine Jarmul: Thank you, Hugo.
Hugo Bowne Anderson: … expertise and energy. I invited you on for this because I love every conversation we have and it’s always a wild ride, and this just exceeded all my expectations. I got to be honest.
Katharine Jarmul: Likewise. Thanks everybody who joined. Feel free to reach out. I promise some explanations of some topics like differential privacy and federated learning that we didn’t dive too much into. So if you want to dive into this more, ping me on the Twitters.
Hugo Bowne Anderson: So Katharine will send out a tweet per day explaining all these technical concepts, is what I just heard. And I look forward to that in 180, however many characters they increased it to last time, or less.
Katharine Jarmul: 240?
Hugo Bowne Anderson: Whatever it is.
Katharine Jarmul: Thank you, Hugo. As always, it’s such a pleasure chatting with you.
Hugo Bowne Anderson: Amazing. Thank you so much, Katharine. And thank you, everyone. I just said that to the wall, but thank you. All right. I’m out of here.
Want to share your content on python-bloggers? click here.