What case study topics do you want to read about? Take a quick survey.
A discussion with Alfred Spector, Peter Norvig, Chris Wiggins, Jeannette Wing, Ben Fried, and Michael Tingley
Dramatic advances in the ability to gather, store, and process data have led to the rapid growth of data science and its mushrooming impact on nearly all aspects of the economy and society. Data science has also had a huge effect on academic disciplines with new research agendas, new degrees, and organizational entities.
Recognizing the complexity and impact of the field, Alfred Spector, Peter Norvig, Chris Wiggins, and Jeannette Wing have completed a new textbook on data science, Data Science in Context: Foundations, Challenges, Opportunities, published in October 2022.6 With deep and diverse experience in both research and practice, across academia, government, and industry, the authors present a holistic view of what is needed to apply data science well.
Ben Fried, a venture partner at Rally Ventures and formerly Google's CIO for 14 years, and Michael Tingley, a software engineering manager at Meta, gathered the authors together as they were finishing up the manuscript to discuss the motivation for their work and some of its key points.
Norvig is a Distinguished Education Fellow at Stanford HAI (Human-centered Artificial Intelligence) and a research director at Google; Spector is a visiting scholar at MIT with previous positions leading engineering and research organizations; Wiggins is an associate professor of applied mathematics at Columbia University; and Jeannette Wing is executive vice president for research and professor of computer science at Columbia University. (More biographic detail on the panelists is available at the conclusion of this article.)
Ben Fried You've come to data science from very different backgrounds. Was there a shared inspiration to write the book?
Alfred Spector In one way or another, I think we all saw a deep and growing polarity in data science. On the one hand, it has enormous, unprecedented power for positive impact, which we'd each been lucky enough to contribute to; on the other hand, we had seen serious downsides emerge even with the best of intentions, often for reasons having little to do with the technical skills of the practitioner. There are many excellent texts and courses on the science and engineering of the field, but it seems like there is something in the headlines every day that demonstrates there is an urgent need to educate on what you, Ben, have called the "extrinsics" of the field.
Peter Norvig Throughout the rapid growth in applications of data science, there have been serious issues to confront: click-fraud, the early Google bombs, data leaks, abusive manipulation of applications, amplification of misinformation, overinterpretation of correlations, and so many more—all things we read about daily. Some problems are more serious than others, but we feel education will help us to lessen their frequency and severity, while simultaneously allowing us to understand their significance.
BF Why the word Context in the title of your book?
Chris Wiggins It was our primary motivator. In a nutshell, we wanted to provide some inclusive "context" for the data-science discipline. We felt the term data science is often used too narrowly.
AS We think of context in three ways.
It refers to the topics beyond just the data and the model. These include dependability, clarity of objectives, interpretability, and other things I'm sure we'll get into.
It also refers to the domain in which data science is being applied. What is crucial for certain applications isn't needed for others. Teams practicing data science must be particularly sensitive to the uses to which their work will be placed.
Finally, context refers to the societal views and norms that govern the acceptance of data-science results. Just as we have seen changing views and norms regarding privacy and fairness, data science will increasingly be expected to solve challenging problems, where societal views vary by region and over time. Some of these problems are "wicked," in C. West Churchman's2 language, and they are so very different from the problems that computing first addressed.
Jeannette Wing While data science draws from the disciplines of computer science, statistics, and operations research to provide methods, tools, and techniques we can apply, what we do will vary according to whether we're working on a healthcare issue, something related to autonomous driving, or perhaps exploring some particular aspect of climate change. Just as each discipline comes with its own constraints, the same might be said of each of these different problem domains. Which is why the application of data science is largely defined by the nature of the problem we're looking to solve or the task we're trying to complete.
PN Beyond this, I personally wanted to reach a broader audience than I had with my more mathematical and algorithmic textbook. To do data science, we need to know many techniques, but we also need to be conversant with larger, societal issues. We all shared this motivation.
BF All this leads to the question of how you define data science.
JW By the time Alfred and I first started talking about working on a book, I was already writing papers and giving talks where I defined data science as the study of "extracting value from data." But we agreed that this definition was too high level and insufficiently operational.
AS So, we started with "extracting value from data," then added prose to address the two personalities of the field—one where data is used to provide insight to people (as in many uses of statistics) and the other having to do with data science's ability to enable programs to reach conclusions.
CW We also recognized we needed a capacious definition [see sidebar] to respect what people are doing in the name of data science within industry and academia, as well as the rapidity of change in the field.
Definition of Data Science
Data science is the study of extracting value from data—value in the form of insights or conclusions.
A data-derived insight could be:
• An hypothesis, testable with more data.
• An "aha!" that comes from a succinct statistic or an apt visual chart.
• A plausible relationship among variables of interest, uncovered by examining the data and the implications of different scenarios.
A conclusion could be in an analyst's head or in a computer program. To be useful, a conclusion should lead us to make good decisions about how to act in the world, with those actions taken either automatically by a program or by a human who consults with the program. A conclusion may be in the form of a:
• Prediction of a consequence.
• Recommendation of a useful action.
• Clustering that groups similar elements.
• Classification that labels elements in groupings.
• Transformation that converts data to a more useful form.
• Optimization that moves a system to a better state.
Taken from Data Science in Context: Foundations, Challenges, Opportunities.6
BF It's a very fluid definition. Not only does data science mean different things to different people, it also has fuzzy boundaries.
CW Exactly! We're at that time in the creation of a new field where it does have fuzzy boundaries. It touches on many different subjects: privacy/security, resilience, public policy, ethics, etc. But it's also clearly taking form with the creation of job titles, degrees, and departments. We saw an opportunity to take a stab at defining its breadth—starting with the diverse challenges its practitioners must overcome.
Michael Tingley Do you make a distinction between data science and machine learning?
AS As a domain, data science is broader than machine learning, in that machine learning is only one of the techniques it employs. Data science encompasses many techniques from statistics, operations research, visualization, and many more areas: in fact, all the things needed to bring insights and conclusions to a worthwhile end. That being said, the revolutionary growth in machine learning has absolutely catalyzed the most change: incredible successes but some challenges too.
PN One difference is that, in the machine-learning arena, a researcher's focus might be to write a paper that touts some new algorithm or some tweak to an existing algorithm. Whereas, in the data-science sphere, research is more likely to talk about a new dataset and how to apply a collection of techniques to use it.
BF So, you were motivated by the breadth of challenges we face. Where did you end up? Are there approaches that can help?
• Tractable data
• Technical approach
• Clear objectives
• Tolerance of failures
• Ethical, legal, societal implications
PN After lots of give and take, we came up with something we call an analysis rubric, where we enumerate the elements a data scientist needs to take into account.
As Atul Gawande writes in The Checklist Manifesto,3 checklists such as our rubric make for better solutions, and we hope ours might help people avoid some of the mistakes we have made in past projects. But because each project is different, it's hard to come up with one checklist that will work across all of them, so we'll see how well it holds up to the test of time.
AS Let's be specific. The analysis rubric addresses the challenges in seven categories. Some relate more to how we implement or apply data science. The others relate more to the requirements we are trying to satisfy.
PN The rubric starts with data: getting and storing it, wrangling it into a useful form, ensuring privacy, ensuring integrity and consistency, managing sharing and deletion, etc. In some ways, this may be the hardest part of a data-science project.
For me, the first big revelation of data science was that data can be a key asset that offers real value.4 But, the second revelation was that data can be a liability if you're not a good shepherd for it.
BF Are there hidden costs to holding onto data?
PN I've learned something in this regard from all the efforts that have been made in recent years to advance federated learning. In earlier days, if a team wanted to build a better speech recognition system, it would import all the data into one location and then run and optimize a model there until they had something they could launch to users. But then that would have meant holding onto all these people's private conversations, with concomitant risks. As a field, we decided it would be best if you didn't hold onto that information but instead optimized each person's data privately while figuring out some clever way to share the optimizations made individually with multiple people in a federated learning framework. This federated approach seems to be working out pretty well. The privacy concerns have ended up leading to a pretty good scientific advancement.
AS Our second rubric element is the most obvious. There needs to be a technical approach, which can come from machine learning, statistics, operations research, or visualization. This offers a way to provide valuable insight and conclusions, whether prediction, recommendation, or the others.
It isn't easy to find a model in some situations. Sometimes there is just too much inherent uncertainty, and other times the world may continually change and make modeling efforts ineffective. Some situations are game-theoretic, and a model's conclusions themselves generate feedback that makes the world less predictable.
One example of the limitations of modeling has been to predict what might happen due to Covid-19. For many reasons relating to limitations of data, rapidly changing policy, variations in human behavior, and virus mutations, the ability to make long-term predictions of mortality has been poor.
BF Are you saying data science didn't help at all in the war on Covid?
PN I was involved in a project with an intern and some statisticians at UC Berkeley where we were trying to give hospitals advance notice of how many staffers they would need to bring in three days ahead of time. We couldn't give them accurate predictions 30 days in advance, but we could do useful short-term predictions.
JW And for sure, data science was applied successfully in many other areas, most obviously in the vaccine and therapeutics trials.
BF We could devote our whole time to models, but given the topic's broad coverage, let's move to the next rubric element: dependability.
JW With data science being used in ever more important ways, dependability is of increasing importance, and we include four subtopics under it: Are the privacy implications of data collection, storage, and use acceptable? Are the security ramifications for the application acceptable, given the likelihood that attacks may release data or impair an application's correctness or availability? Is a system resilient in the face of a world that is continually changing and with modeling techniques we may not fully understand? Finally, is the resulting system sufficiently resistant to the abuse that has savaged so many applications?
CW We should note the tensions within the dependability components. The push for privacy versus the need to provide security is an example. End-to-end encryption would reduce risks to privacy and keep providers from seeing private messages, but it would also limit platforms' abilities both to respond to law enforcement requests and to perform content moderation. There definitely are some unresolved tensions here.
MT Getting privacy, security, resilience, and abuse resistance right is a good start and a formidable challenge in itself. Is that enough to allow people to trust the applications of data science?
AS It's probably not enough. Developers, scientists, and users must have sufficient understanding of data-science applications, particularly in increasingly sensitive situations. The general public and policymakers also need to have more understanding, given the pervasive impact.
This leads to the rubric topic of understandability, which has three categories: Must a model's conclusions be interpretable—that is, should the application be able to explain "why?" Must conclusions prove causality, or is correlation sufficient? And must data-science applications, particularly in the realms of science and policy, make their data and models available to others so they can test for reproducibility?
Where data science is employed in research, the tradition is that others must be able to reproduce work so they can test and validate it. This is very hard to accomplish when we're dealing with massive volumes of data and complex models.
PN Understandability has been particularly hard with machine learning, but contemporary research is making progress—for example, with visualization and what-if analysis tools. While causality is difficult to show with only retrospective data, the causal inference work from the statistics community can reduce the amount of additional experimentation needed to demonstrate it.
AS Here's a real-world example from about 10 years ago when I was at Google. Some argued it might be better for societies to measure and then maximize happiness rather than, say, per capita GDP (gross domestic product). Catalyzing this interest, perhaps, was Bhutan's then-recently introduced gross domestic happiness metric. Some believed that Google could glean a happiness score from the collective searches of a population. Before we proceeded too far, we realized there was a big gotcha: The score would be so influential that Google would need to explain to the public how it was calculated. If the mechanism were fully explained, however, people would want to abuse it—and render it invalid. While there was data and (likely) a model, understandability—and then dependability—concerns eventually torpedoed the effort.
MT This naturally leads to the question of setting precise goals. Are the objectives of the system an immutable, external property, or is there also some emergent property in how the system or its context evolves?
AS The next rubric element relates to having clear objectives. Do we really know what we're trying to achieve? Requirements analysis has always been needed in complex systems, but many uses of data science are extremely challenging. They require the balancing of near- and long-term objectives, the needs of different stakeholders, etc. There may not even be societal consensus on what we should achieve. For example, how much fun—or how addictive—should a video game be? Which recommendations to a user are beneficial versus which might prove distracting in the wrong situations? Are some downright harmful?
As already mentioned, a society's norms may change over time. It's hard to anticipate everything, but we should try to think about the downside risks posed by aspects of a particular design. We advocate that these risks be made as explicit as possible.
CW Beyond that, we need to be prepared to monitor the way a data product is used and to mitigate its harms. A video-game maker years ago may not have anticipated that some people now would consider their product to be addictive for young children. Mitigating harms, in this case, may mean design changes that prevent or lessen extended play or other signs of addictive behavior. Even then, not everyone in the company that made the game might agree this is a problem. A company committed to ethical data products, however, takes this seriously.
AS An objectives-related topic unto itself is the incentive structure that data science makes feasible. Given the ability to measure and optimize almost anything, are we optimizing the right things? Which incentives should be built into systems to guide individuals, organizations, and governments in the best way?
BF Where does fairness come into this? It's critically important and very complex. Is there even agreement on what's fair and what isn't? Won't those opinions change over time?
AS Fairness is addressed in two ways in our rubric. First, it's an implementation-oriented topic: Data collection and models need to be built and indeed tested to be sure they work well, not just on average but for subpopulations. Societal priorities proscribe conclusions that are reached based on subgroups' protected attributes.
JW On top of the typical software engineering challenge of making sure the model is working properly, we need to pay great attention to training data. This is pretty new for software engineers.
AS I like to say that when systems learn from data, "the past may imprison the future," thereby perpetuating unwanted behaviors.
Beyond these data and implementation challenges, the second fairness challenge is in goal-setting. There are complex ethical, political, and economic considerations about what constitutes fairness.
CW Ultimately, this comes down to the objective of trying to gain value, which is a key word in our data-science definition, since it comes with both an objective meaning and a subjective meaning. That is, beyond whatever mathematical value we're trying to calculate or optimize, there's what we or our society may value. In part, I think this speaks to the fact we're now making data-science applications that have more and more impact on society. Going back to context, you have to think about what constitutes a success, and that can be complicated.
As Alfred has observed, this involves deciding on the goal or objective function we're trying to optimize while acknowledging what we are omitting. It's very hard to consider all the possible edge cases and human impacts of some data-science applications.
JW On a related topic, in our next rubric element we examine whether the data-science application is innately failure-tolerant, given that the objectives a system meets may not be perfectly defined, and they may be achieved only with some stochastic probability. Self-driving cars, for example, aren't particularly failure-tolerant, whereas advertising would seem more so. But even some advertising applications of data science can be intolerant of failures; for example, it's important to identify foreign sources of election advertising revenue and to abide by regulations governing certain products.
BF What about the last rubric element?
CW With data-science applications affecting individuals and societies, they must take into account ethics, as well as a growing body of regulations. These are covered in the ethical, legal, and societal implications element
(shown in table 1).
|TABLE 1: Illustration of the Analysis Rubric Elements|
|-||Toleration of Failures|
|-||Ethical, Legal, Societal, Considerations|
Taken from Data Science in Context: Foundations, Challenges, Opportunities.6
AS Indeed, the body of laws governing many data-science uses is already quite large. Furthermore, there are broad societal implications; for example, data science almost certainly is altering the employment landscape and having effects on societal governance.
MT As a practitioner, I think it's wonderful to have some guiding principles like the rubric to think about. In practice, however, it's sometimes difficult to anticipate these issues up front and perform risk assessments or even guess at some of the longer-term outcomes. For example, thinking about all the potential ethical implications of something before you even know where your investigation might lead is really challenging.
My question is: To what extent do we as practitioners bear responsibility for exhaustively analyzing and estimating these sorts of issues in advance? Isn't it inevitable that much of this work is going to end up being guided by retrospective analysis once we've figured out where we've landed?
AS Compounding the challenge you raise, the world might change just because of the launch, meaning the very existence of a data-science application changes the ground rules that guided its development. As an example, the world may become dependent on some application, which would result in increased dependability requirements.
CW Then there's also the matter of maintaining and monitoring a data product. It's not possible to know in advance what all the possible failure modes are before a launch, but there are plenty of opportunities to maintain and monitor a product as the world changes and potential harms are made clear.
JW We hope practitioners will end up using the analysis rubric as a checklist during many stages of a project. Some things ought to be easy enough to consider before building a model, but then further assessment will also be required after the model is built. With data science, it's even less likely that you'll be able to anticipate everything in advance than it is with more traditional software.
AS This emphasizes the role for product managers, who are tasked with looking at a project broadly. Their role becomes all the more critical as projects come to be less dominated by technology. In fact, if you talk to many product managers today, you'll hear them say things like, "Our engineers started on this effort, particularly the machine learning, and they did a lot of work without pausing to think about all the other challenges they were likely to encounter. And I really wish they'd talked about that earlier because it would have saved us a lot of rework." That being said, as Chris intimated, we don't think everything should be approached with a waterfall methodology. There's plenty of interaction and adaptation required.
BF Let's spend some more time on your work on ethics.
JW While we could have kept the discussion of ethics implicit in the other rubric elements, such as our discussions of how to set good and fair objectives, Chris and I, in particular, wanted to focus on ethics explicitly. We decided to start with the Belmont principles5 as a basis and see how far they would take us. I'd say they've actually stood up pretty well so far.
BF What are the Belmont principles, and how do you apply them?
CW The Belmont principles were effectively an attempt to create a U.S. government specification for ethics. In response to serious ethical breaches in taxpayer-funded research, Congress in the 1970s created a diverse commission of philosophers, lawyers, policymakers, and researchers to figure out what qualifies as ethical research on human subjects. After years of discussion, the commission announced that its focus would turn to articulating a set of principles that would at least provide a common vocabulary for people who attempt to make a good-faith adjudication as to what qualifies as ethical behavior. The principles themselves are:
Respect for persons, ensuring the freedom of individuals to act autonomously based on their own considered deliberation and judgments.
Beneficence, that researchers should maximize benefits and balance them against risks.
Justice, the consideration of how risks and benefits are distributed, including the notion of a fair distribution.
These principles were ultimately released by the U.S. government in 1978, and they've since been used as a requirement in some federal funding decisions. One exploration in our book is how these principles remain useful for thinking through ethical decisions that researchers and organizations must make in data-science research and in developing data products.
BF Are there any contemporary examples of how the Belmont principles are being applied?
AS Perhaps the intense discussion of Covid-19 vaccination for young children is illustrative of the give and take. While it's currently believed that vaccinating a young child may be of only modest benefit to the child, we have hoped that having fewer infectious children may reduce Covid-19 in elders with whom the child comes in contact.
This pretty explicitly shows the trade-offs: Respect for persons might argue we would not seek to vaccinate the child, since the vaccine is of unclear benefit and the child may be too young to provide informed consent. On the other hand, the principle of beneficence might win the day, given the potential for saving the lives of many grandparents. In a perfect world, this would be informed by good statistics.
In any case, it illustrates the sorts of challenges policymakers and parents face. We all believe that the explicit give and take of the Belmont principles in such situations ultimately provides better, and more transparent, decisions.
BF Do you have an example more related to data science?
AS Earlier in the discussion, Jeannette noted that self-driving cars are not naturally failure-tolerant. Interesting ethical questions—as well as some practical ones—come up around this since it's unlikely a self-driving car will ever be 100 percent safe in all circumstances. We'll face the question of what constitutes an acceptable failure rate as the technology gets closer to mass adoption. That is, how much risk are we willing to accept? Auto accidents currently account for around 40,000 deaths per year in the U.S. alone, but if perfection is required, we probably won't ever be able to deploy the technology.
PN We're quite inconsistent as a society when it comes to what we'll accept and what we won't accept. While the debate over self-driving cars continues to rage on, I happen to know some people who are working on self-flying cars. I find it perplexing that as a society, we have apparently decided that having the 40,000 road deaths a year is OK, while the number of air-travel deaths ought to be zero. Accordingly, the legal requirements imposed by the FAA are far more stringent than those applied to road travel. And we need to wonder if that's really a rational choice for how to run our society or whether we should instead be looking to make some different tradeoffs.
BF The sphere of ethics is inherently qualitative, whereas computing is a highly quantitative practice. I've witnessed discussions that diminish qualitative standards because they can't be measured and have no objective function. Given that, are you worried about uptake of these principles?
CW In my experience, software engineers love to talk about design principles. In fact, Alfred mentioned the waterfall model, yet design methodologies are pretty qualitative. Engineers are already dealing with principles that get debated regularly—and changed with some frequency.
BF Are the Belmont principles sufficient for any ethical question?
AS While we focus on the Belmont principles, we also acknowledge that individual and organizational decision-making will take other frameworks into account. I call out three:
First, there are professional ethics, like the ACM code of ethics.1 Truthfulness, capability, and integrity must be a given as we apply data science.
Second, certain situations have different ethical standards. The war in Ukraine has made stark for us the laws of war, so-called jus in bello, and their implications.
Third, decisions are made in an economic framework, where the economic system exists to channel energy, competition, and self-interest into benefits for individuals and society.
CW We want to remind everyone that it's not enough to have principles. Each individual and organization applying data science needs to come up with organizational structures and approaches to incorporate them into their process.
JW The academic community is taking this seriously. We saw an opportunity to put a stake in the ground by telling students, "If you want to be a data scientist, you're going to learn about ethics along with all this quantitative stuff." The Academic Data Science Alliance, which began a few years ago, emphasizes ethics sufficiently that I believe ethics courses are now integral to most academic programs in the discipline. I'm very encouraged about this since, as data science is only beginning to emerge in academia, we're now incorporating these qualitative ethical principles, considered integral to the field.
PN This is just part of being in a field that's finally growing up. When the work you're doing is only theoretical or academic, then you go ahead and publish your papers and it really doesn't matter. But once that field starts to make a genuine impact on the world, you suddenly find you have some serious ethical responsibilities.
BF Looking at the other side of the coin, should an understanding of data science inform a liberal arts education that includes some exposure to ethics?
CW Having taught a class on the history and ethics of data,7 I can tell you that humanities students show a tremendous interest in learning about it. And our engineering students even demand that we focus on the ethical aspects. You can imagine people who would like the topic to be taught as if it dwelled solely in the Platonic realm of pure thought. You can also imagine there are other people who would want us to focus more on the very applied and perhaps even product-driven aspects of the topic. I've found it useful to teach things historically to provide a structure to these different interests.
PN While it's important to raise these issues and to have general principles, it's also important to have case law based on real-world examples. That is, in our legal system we have laws that people take great care to write as clearly as they can, but they can't anticipate all the possibilities that might surface later. We supplement the laws with case law.
It's one thing to say that privacy and personhood are important rights. But then how does that apply to the use of surveillance cameras? You can't really answer that just from general principles. You need to get more precise by specifying the types of uses that are approved and those that aren't. Principles are a good starting point, but we also need the specificity that examples offer.
BF Now I have an engineering question for you: Is scale inherent to data science?
PN Yes. If it hadn't been for big data, we wouldn't today be talking about data science as a separate field. Instead, it would still be part of statistics. While the folks in statistics were focused on whether you needed 30 or 40 samples to achieve statistical significance, there were some other people who were saying, "Well, we've got a billion samples, so we're not going to worry about that. Instead, we've got a few other problems and we're going to focus on them." Those issues became the focus of the new field.
JW However, we can do plenty of data science at a smaller scale with what some people call "artisanal data" or "precious data." There are plenty of challenges to contend with in that space since it often involves working with combined datasets, which means dealing with all the issues that go along with heterogeneous data. So, we still have some fundamental scientific and mathematical questions to address, whether we're working with big data or heterogeneous small data.
AS A side effect of all this data is that we all are regularly confronted with both meaningful—and not-so-meaningful—details that are hard to put into context. Considered within our understandability rubric element, the sheer volume of data and conclusions we get every day is difficult for even experts to understand. In particular, we are often presented with correlations whose meanings are not as far-reaching or conclusive as we are often led to believe. All of the technology for capturing, storing, and locating data makes it far easier to cherry-pick data and use it out of context to advance erroneous points of view.
PN Also, whenever data is derived from human interactions with various systems, there is a challenge to determine how much of it is trustworthy. For example, if you're working with a lot of data that comes from observations of what people are clicking on, it might be tempting to assume they're clicking on things they're truly interested in. We humans have our frailties and biases—meaning our actions don't always reflect our own best interests. We also have lapses in the sense people click on things without meaning to. It's important to understand those limitations in order to interpret the data better.
BF Given all of this, what concerns should we have about how data science allows us to derive answers and benefits based on user interactions, especially given how they can change over time without the creator of the model being aware?
PN This certainly presents a big challenge. We need to recognize we're in a game-theory situation where, when you make a move, other people are going to make a move in response, whether they're spammers or legitimate participants in the ecosystem. This sort of runs counter to big data since, even if you've got millions of clicks, you won't have any clicks for what happens after you reach and disseminate a conclusion.
You don't know how people are going to change their strategies. You have no data on that whatsoever. There's this tension between the things for which you can measure everything and know exactly what's going on and the things in the future that may end up messing with your normal business model in unknown ways. Then there's also the possibility you'll have changed the ecosystem in ways you don't understand.
AS This applies to finance as well, of course. If you're applying algorithmic approaches to buying and selling and your activities are having an impact on the market, you can't be certain exactly what effect your purchase or sale might have.
BF Which is why analyses based on historical data have flaws. "Past performance may not be indicative of future results," as all the brokerage houses are quick to remind you.
CW If I can inject one broader aspect of scale, it also has an ethical valence. Big systems that operate at scale can have a far-ranging, global impact.
JW From the engineering perspective, scientists have their own concerns. Often, they are working with massive amounts of data from sophisticated instruments from the IceCube Neutrino Observatory in Antarctica or the James Webb Space Telescope. And, from what my scientist colleagues tell me, they need new techniques for storing, preserving, and analyzing data.
MT What about the software engineering of data science?
AS It's hard to build quality software under even the very best of circumstances. Data science adds a new level of challenge, because we are now using modules that are learning from data, and they may work well in some contexts and not in others. We may have confidence they are likely to work well for an average case, but we don't know exactly how well they work for certain inputs, and, again, we don't know how well they will work over time.
JW Having once been involved in the formal verification community, let me restate what Alfred said more formally. To show that a program was doing the right thing, we would use a very strong theorem—for all xP(x)—to overprove the point. Then, once that had been demonstrated, we could be certain the computer would do exactly what we had intended for any valid input.
But for machine-learned models, universal quantification is too strong and unrealistic. We wouldn't say for all xP(x) since we do not intend that a machine-learned model should work for all possible data distributions. Instead of proving for all xP(x), we could instead focus on proving for all data distributions within a certain class, but then we would need to characterize the class.
For robustness, we might say for all norm-bounded perturbations to characterize the class of data distributions for which a model is robust. But what about a property such as fairness? This soon becomes very tricky to formalize. A practical consequence is that we need to increase testing, recognizing—as in traditional software engineering—we're never going to be able to test everything that's likely to crop up in real life. This illustrates why trustworthiness is an important research frontier.
CW Another point has to do with Ops—generalizing beyond just keeping a website up, to making sure a data-science application is continuing to work well. I mean, inputs can fail, abuse can occur, and models may be more brittle than thought. As I alluded to earlier, we need to continue monitoring the model as if it were a living thing. This also means thinking through how you're going to monitor impacts on users, as well as your statistical metrics. There are some real engineering challenges to think about here in terms of how you're going to maintain observability for a data-science model that's deployed, particularly since it will be retrained and refreshed regularly.
BF We've covered a lot of ground today. Any final thoughts you'd like to leave people with?
AS We hope the analysis rubric shows a path toward providing useful structure to data science.
JW All four of us definitely believe in harnessing data for good, whether for a university, a business, or society at large. But there's no escaping the breadth of topics that need consideration. The breadth certainly complicates data-science education.
CW I would emphasize that we are often solving very hard problems—these are sometimes wicked problems—and we need due consideration of many underlying principles. We then need to act on them and do the very best we can to balance sometimes-conflicting goals.
PN As I said earlier, our field is growing up. We are having a genuine impact on the world, and we find that we have to think hard along many dimensions to achieve the best possible goals.
1.?ACM Code of Ethics and Professional Conduct; https://www.acm.org/code-of-ethics.
2. Churchman, C. W. 1967. Wicked problems. Management Science 14(4), B141?B142; https://www.jstor.org/stable/2628678.
3. Gawande, A. 2010. The Checklist Manifesto. Penguin Books India.
4. Halevy, A., Norvig, P., Pereira, F. 2009. The unreasonable effectiveness of data. IEEE Intelligent Systems 24(2), 8?12; https://ieeexplore.ieee.org/document/4804817.
5. National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. 1978. The Belmont Report: Ethical Principles and Guidelines for the Protection of Human Subjects of Research; https://www.hhs.gov/ohrp/regulations-and-policy/belmont-report/index.html.
6. Spector, A. Z., Norvig, P., Wiggins, C., Wing, J. M. 2022. Data Science in Context: Foundations, Challenges, Opportunities. Cambridge, England: Cambridge University Press.
7.?Wiggins, C., Jones, M. L. 2023. How Data Happened: A History from the Age of Reason to the Age of Algorithms. New York, NY: W.W. Norton and Co.
Peter Norvig is a Fellow at Stanford's Human-Centered AI Institute and a researcher at Google Inc. Previously he directed the core search algorithms group and the research group at Google. He is coauthor of Artificial Intelligence: A Modern Approach, the leading textbook in the field, and co-teacher of a class on artificial intelligence that signed up 160,000 students, helping to kick off the current round of MOOCs (massive open online courses). He is a Fellow of the AAAI, ACM, California Academy of Science, and American Academy of Arts & Sciences.
Dr. Alfred Spector is a visiting scholar at MIT. His career began with innovation in large-scale, networked computing systems (at Stanford, as a professor at CMU, and as founder of Transarc) and then transitioned to research leadership (as global VP of IBM Software Research, Google Research, and then as CTO of Two Sigma Investments). Dr. Spector has lectured widely on the growing importance of computer science across all disciplines (CS+X), and he just completed Data Science in Context: Foundations, Challenges, and Opportunities. He is a Fellow of the ACM, IEEE, the National Academy of Engineering, and the American Academy of Arts & Sciences, where he serves on its council. Dr. Spector was a Hertz Fellow, won the 2001 IEEE Kanai Award for Distributed Computing, was co-awarded the 2016 ACM Software Systems Award, and was a Phi Beta Kappa visiting scholar. He received a Ph.D. from Stanford and an A.B. from Harvard.
Chris Wiggins is an associate professor of applied mathematics at Columbia University and the chief data scientist at the New York Times. At Columbia he is a founding member of the executive committee of the Data Science Institute and the Department of Systems Biology, and he is affiliated faculty in statistics. He is a cofounder and co-organizer of the nonprofit hackNY (http://hackNY.org), which since 2010 has organized once-a-semester student hackathons; and the hackNY Fellows Program, a structured summer internship at New York City startups. Prior to joining the faculty at Columbia, he was a Courant instructor at NYU (1998-2001) and earned his Ph.D. at Princeton University (1993-1998) in theoretical physics. He is a Fellow of the American Physical Society and is a recipient of Columbia's Avanessians Diversity Award.
Jeannette M. Wing is executive vice president for research and professor of computer science at Columbia University. Her current research interests are in trustworthy AI. Wing came to Columbia from Microsoft, where she served as corporate vice president of Microsoft Research, overseeing research labs worldwide. Before joining Microsoft, she was on the faculty at Carnegie Mellon University, where she served as head of the department of computer science and associate dean for academic affairs of the School of Computer Science. She is a Fellow of the American Academy of Arts and Sciences, American Association for the Advancement of Science, ACM, and IEEE. She holds bachelor's, master's, and doctoral degrees from MIT.
Copyright © 2023 held by owner/author. Publication rights licensed to ACM.
Originally published in Queue vol. 21, no. 1—
Comment on this article in the ACM Digital Library
More related articles:
James Agnew, Pat Helland, Adam Cole - FHIR: Reducing Friction in the Exchange of Healthcare Data
With the full clout of the Centers for Medicare and Medicaid Services currently being brought to bear on healthcare providers to meet high standards for patient data interoperability and accessibility, it would be easy to assume the only reason this goal wasn't accomplished long ago is simply a lack of will. Interoperable data? How hard can that be? Much harder than you think, it turns out. To dig into why this is the case, we asked Pat Helland, a principal architect at Salesforce, to speak with James Agnew (CTO) and Adam Cole (senior solutions architect) of Smile CDR, a Toronto, Ontario-based provider of a leading platform used by healthcare organizations to achieve FHIR (Fast Healthcare Interoperability Resources) compliance.