Data Science, Past & Future
Ann Spencer2019-07-23 | 56 min read
Paco Nathan presented, "Data Science, Past & Future", at Rev. This blog post provides a concise session summary, a video, and a written transcript.
Session Summary
At Rev’s "Data Science, Past & Future", Paco Nathan covered contextual insight into some common impactful themes over the decades that also provided a “lens” help data scientists, researchers, and leaders consider the future. By looking through this lens, industry is able to ingest and iterate on some current “hot buttons” and infer potential implications on decisions, outcomes, and insights that may surface during the course of their work. Nathan also provided excerpts and insights from recent surveys to help provide additional context.
Key highlights from the session include
- data science’s emergence as an interdisciplinary field – from industry, not academia
- advances that challenge the status quo follow a similar formula that includes “hardware capabilities evolve in dramatic leaps”; “software layers provide new kinds of control systems and conceptual abstractions in response”; “together, those manifest as very large surges in compute resources and data rates”; and “industry teams leverage increasingly advanced mathematics to solve novel business cases”.
- why data governance, in the context of machine learning is no longer a “dry topic” and how the WSJ’s “global reckoning on data governance” is potentially connected to “premiums on leveraging data science teams for novel business cases”.
- how “the business executives who are seeing the value of data science and being model-informed, they are the ones who are doubling down on their bets now, and they're investing a lot more money.” was a surprise takeaway from the survey
Transcript
Paco Nathan:
Thank you, Jon [Rooney]. I really appreciate it. I am honored to be able to present here and thrilled to have been involved in Rev. There's a balance. There's a kind of tone that Rev really struck this year. It's about the mix of talks and discussions. It's about the attendees. It's about the audience and the hallway conversations. There's a really nice comfortable blend here of what's important in business, in engineering, in data science, etc. It's kind of rare to encounter that and we're really thrilled about it. I definitely want to provide some shout-outs. You've seen Jon up here, too, doing fantastic MC-ing. Also, Karina Babcock is our other co-chair and deserves a round of applause for really getting all this put together. I should mention the stats just came in that Karina sent over. We've had almost 700 people registered. We had 668 people registered. Just about 600 checked in and obtained their badges... that's more than twice what we had last year and really good growth on that. Congrats.
Back to my talk. If you want to grab the slides, they'll be on Twitter. If you have a smartphone, use the QR code, you can load it up on your phone. There will be a lot of slides probably more material than needed, but some background if you want to drill down and a lot of links. I have a hunch that some of these links you may want to chase down later.
A lot of the work that I do is in developing themes, going out in the industry, finding out what are the interesting projects, what kind of changes do these imply, who are the people who are making change happen, what are the issues that they share in common that they're struggling with, and how can we surface this? That's one of the things that I do working with Domino. I do a monthly column about trying to surface some of these themes that we're seeing.
I'd like to show some of the past of where we've come from. In data science, definitely, there are other people who've talked more about that and we'll point to them. But, I'll provide a lens of how to examine how things have changed over the decades in data science, and then apply that to four burning issues. Some really big hot buttons for what we're contending with now, and what does this indicate for us going forward?
First up, if you haven't seen this, John Tukey...by the way, we have socks for John Tukey. John Tukey was a mathematician at Bell Labs. Back in 1962, he wrote a paper called "The Future of Data Analysis." Just to frame it, back in 1962, even a lot of universities didn't have computers yet. The idea of being able to use machines to crunch data that was still relatively new. It was the future. It was almost science fiction for a lot of people. The implications of what was coming up, of being able to do a lot of data analysis, move forward with a lot of number crunching, and what the implications would be, this was all still very new. Tukey did this paper. It's a great read. It is 50, almost 60 years ago, but he was talking about something that was interdisciplinary. He was saying this doesn't belong just in statistics. If you look into Tukey, why he's on the socks there, he did incredible work in stats.
He also really informed a lot of the early thinking about data visualization. If you read Edward Tufte or if you take this class, you'll hear all about John Tukey. But the point there was what was emerging was interdisciplinary. It involved a lot of work with computing machinery and automation. It involved a lot of interesting work on something new that was data management. It involved a lot of work with applied math, some depth in statistics and visualization, and also a lot of communication skills. Other people who have talked about these kinds of histories much more. Also, we'll have Chris Wiggins from The New York Times and Columbia speaking later today. Chris has done fantastic work chronicling a lot of in-depth histories as well as teaching about this now. I definitely recommend Chris as well.
I recently did "Fifty Years of Data Management and Beyond" which looks at roughly the same time period. Over the decades, we see in each decade, there are challenges that the industry was struggling with in terms of business needs for data management. There were new types of frameworks that evolved roughly each decade to address these kinds of needs. To borrow a page from Marshall McLuhan, if you've ever heard "the medium is the message" that euphemism, really, that's a very dense statement [inaudible 00:05:09]. What McLuhan was talking about was that when new media is introduced, it does not replace the former media. It merely makes it more complicated. We see this also, in terms of our history of managing data, as new types of data management frameworks came out, they didn't replace what was there before.
Arguably, hierarchical databases from the '60s were some of the most important and are popularly used for transactions right now in terms of credit cards. Relational databases didin’t replace those, but the thing is they make them more complicated. Understanding how and why to use different types of data frameworks, and what implications does that have on the decisions that you make, the outcomes, the insights that you surface.
Here are a couple of pieces of history. There are plenty others. The point being though, is that this came out of industry, not academia. To some extent, academia still struggles a lot with how to stick data science into some sort of discipline. Even some of the great programs that I see in universities, they'll tie it to electrical engineering over here or statistics over there, kind of pigeonhole it. But it's interdisciplinary, and we'll talk more about that.
Through the past six decades, since Tukey first described some of this phenomenon, there's been a kind of effect that happens over and over. You can use this as kind of a lens to analyze what's happening, at any given point, during this history or simply going forward. This lens is described in four steps. One, hardware capabilities evolve. They usually go as bursts – there are step functions. That's because hardware capabilities are about really basic work in physics and a lot of work in research and electrical engineering. But also, a lot of work in material science, and that just doesn't happen incrementally. That happens in big steps.
Then software adapts. Software is reactive to what happens in hardware in a lot of ways. Software evolves, new kinds of abstraction layers. Speaking as a computer scientist, we've worked with peddling different types of abstraction layers as well as different types of conceptual ways of controlling systems. Then taken together, these lead to big surges in compute capacity, CPU, memory, networking, storage, etc. Along with that, we see jumps in data rates. Those two ratchet up over time.
The way that the industry typically responds to this is to apply increasingly more advanced mathematics for the business use cases. Ultimately, it's the more advanced mathematics that become economically very important to put down into the hardware layer. You know, case in point, if you were to talk about predictive analytics 20 years ago, the main people in the field would have laughed you out of the room. They would've said, "You know what? Predictive analytics, yeah, not so much." My professors, like Brad Efron and others, famously made some real goofs about this, too.
If you were to go out 10 years ago and talk about the importance of machine learning in industry – and I was out there doing that – you'll get a lot of pushback. People would say, "You know, Google, Amazon, maybe. But for most enterprise, using machine learning...not really. It's not going to happen. You know, companies like telecom and insurance, they don't really need machine learning." But that changed.
If you were out five years ago talking in industry about the importance of graphs and graph algorithms and representation of graph data, because most business data ultimately is some form of graph. You were to talk about the math, the advanced math that's required to be down low layer in the hardware for graphs, you'll get immense pushback. I know, I did that. Because you start talking, not just about matrices and vectors, you talk about something called tensors and tensor decomposition. Five years ago, people laughed. It's like why on earth would any business people care about anything called tensor? Now, Google is spending what, 10 figures marketing TensorFlow? I don't know. Sorry. Apologies to Google if I got that wrong? This thing happens over and over.
One application of this is regarding data governance. Also, I did an executive briefing for O'Reilly over the past year. If you want, check out those links, there are more details about data governance. I've learned that data governance, it's an interesting idea, but there's been a lot of false starts, kind of a dry topic. But in the context of machine learning over the past few years, this is suddenly not a dry topic. This is suddenly arguably one of the most, if not the most important issues, that we're dealing with now.
Let's dig into this a bit. By the way, I got assigned a topic by Ben Lorica from O'Reilly to study what's happening in...what's changing in data governance? We can draw this from some historical context. Back in the '70s, we had big boxes, mainframes. I started programming in the '70s. I don't know if anybody else started programming in the '70s. Please? It’ll make me feel better. Thank you. Great. Okay. I feel much better. We had these big boxes, and they weren't terribly differentiated. You ran applications on these big boxes, and the applications would call some libraries in the operating system. But like every application ran some data, some networking if they needed to, it wasn't really differentiated. They had other terminal devices like card readers and maybe green screens or teletypes. Those were connected over proprietary wires, and that's how things were.
Then going into the '80s, we had the boxes become differentiated. You had different types of servers running around. Some of them began acting as clients, they could make calls over open standards for networking. Now, Ethernet comes up. You've got open standards for protocols where one server can call another and access an API. You get some differentiation, and you get this client/server architecture that was big. When I was in grad school, our department saw the launch of two new ideas. One of them, a little iffy at first, but it turned out big, it was called a SUN workstation. The other one, a little iffy, and they got busted by the university, but it turned out pretty big so the university took a nice equity stake. It was called Cisco. Meanwhile, there was a guy upstairs, Vint Cerf, who created this new thing called TCP/IP. It was kind of fun being a grad student and seeing a lot of these things launch. There was a little bit of data governance because there were now database servers, but still not much.
Going into the '90s, then we moved past networking into internetworking. We had TCP/IP. We had a lot of interesting network protocols, which led to an explosion of things, World Wide Web, etc. Instead of just client/server, now we had even more differentiation. This got more complex. We had three tiers, three layers. The presentation layer was about, say, web browsers, right, what you could do in a web browser. But the business logic kept getting more and more progressively rolled back into the middle layer, also called application servers, web servers, later being called middleware. Then in the bottom tier, you had your data management, your back office, right? Along with your database servers, you had, data warehousing and business intelligence. Some of the data governance started to really come in there, but it was really focused more on the warehouse.
Then things changed. Leading up into the 2000s, you can pinpoint a time, Q3 of 1997, there were four teams identified. By that point, they had all reached the same conclusion. They were pursuing pretty much the same solution to it. It's really great to go back, and again, chase the links here on the slides. Greg Linden's article about splitting the website on Amazon. Eric Brewer talking about Inktomi, the origins of Yahoo! Search. Jeff Dean talking about the origins of Google, which, don't even get me started. Then Randy Shoup, a friend, talking about how eBay evolved from just like four servers into many.
What happened was, they all recognized that at the time, when you had database servers, as your business grew, you would get a bigger and bigger hardware box, and you would get a bigger and bigger license from Oracle. They realized that with e-commerce and the growth rates that they were seeing, number one, they couldn't get big enough boxes. Number two, they wouldn't be able to afford the Oracle license. Instead, these four teams, what they did was to say, "Okay. Let's take that big monolithic web app and split it and run it on thousands of commodity hardware servers," Linux mostly. We'd have server farms. Now, the trouble with server farms is that this kind of commodity hardware, they fall over a lot. You want to have a lot of logging on them and just check how they're doing.
By virtue of that, if you take those log files of customers interactions, you aggregate them, then you take that aggregated data, run machine learning models on them, you can produce data products that you feed back into your web apps, and then you get this kind of effect in business. That leads to what Andrew Ng has famously called "the virtuous cycle of data." That was the origin of cloud, the server farms. That was the origin of big data. That was the origin of this rapid increase of data in the world of machine data, and also the business use cases for machine learning.
Now, another thing that happened here was this was the 2000s. This is when cloud was launched. We get much more interesting work in architecture. There's also a lot of interesting threats and an evolution on the security front of what was happening. I was working in network security back then. You had IDS, you had bump-in-the-wire application gateways, you had SIMS and other things coming together. A lot more intelligence being pushed out to the edge, and that's a theme. You also had the launch of smartphones. There's a lot more mobile devices. This landscape starts getting pretty complex. The data governance, however, is still pretty much over on the data warehouse. Then we roll out a decade later. Toward the end of the 2000s is when you first started getting teams and industry, as Josh Willis was showing really brilliantly last night, you first started getting some teams identified as "data science" teams. I led a couple of the early teams to be identified as "data science" back then.
Coming into the 2010s, we had data science practice, we had evolution of big data tooling, we had a lot more sophisticated use of the big data and what was going on in the cloud. You started to see point solutions. I went to a meeting at Starbucks with the founder of Alation right before they launched in 2012, drawing on the proverbial back-of-the-napkin. Data governance on big data, that was starting to happen. You also saw much more strategic use of data science. Those workflows would feedback into your business analytics. Security continued to evolve, cloud continued to evolve, more and more mobile devices.
Then we roll the clock up to now, and where we're at. It's a much more complex landscape. Going into the 2020s, this is what Thomas and Alex were describing was really brilliant on the panel. I loved that. Many important points that they touched on. Really, my talk is more about unpacking some of those themes, and showing a timeline behind it. We've got this complex landscape, tons of data sharing, an economy of data, external data, tons of mobile devices. Now, we have low-power devices and inference running on them. You can take TensorFlow.js and drop your deep learning model resource footprint by 5-6 orders of magnitude and run it on devices that don't even have batteries.
Okay, we have a really complex landscape. The data governance parts of it have become more and more important. There's compliance surrounding all this because this really matters now, but yet, the data governance solutions, their point solutions, you've got some for your mobile devices, some for your edge inference, some for your edge security and CDNs like Cloudflare and AWS Shield and others. Some for big data, some for data warehouses, etc. There's nothing common. There's no tech stack there really. There are some open standards that are evolving, but that part really has to be solved. Arguably, that's one of the biggest problems we have. It's also a driver for data science. The year 2018 was what Wall Street Journal called "a global reckoning on data governance." If you don't know, there were hundreds of millions of people affected worldwide during 2018 with security breaches and then data privacy leakage. It's also the year that GDPR went into effect. I did a video interview with a couple of the biggest firms affected by that, like two days before it went into effect, in London, it was great. Now, we have CCPA coming online next year in California and other states throughout the U.S. following suit with the GDPR style of regulations.
Of course, that was also the year that we had the whole news cycle about Facebook and Cambridge Analytica. Arguably, a lot of interesting platforms are surfacing about ad-based business models and about the corporate surveillance that goes into it. You see Facebook in the new cycle. You see Facebook up in front of Congress. Arguably, Oracle is an even bigger player in this, although they are a little bit more sophisticated in how they do their PR spin. As far as corporate surveillance, Oracle is probably the biggest fish there.
What I'm trying to say is this evolution of system architecture, the hardware driving the software layers, and also, the whole landscape with regard to threats and risks, it changes things. It changes how we have to respond to it. You see these drivers involving risk and cost, but also opportunity.
Again, this is some of what Thomas and Alex were talking about. These are the things that change us and change our industry. Really, this is what puts a premium on how do we have to leverage data science teams. This is why you get the return on investment for data science. That's where we stand now, and these are the kind of challenges that we're facing.
My colleague, Ben Lorica at O'Reilly, he and I did three large surveys about adoption for ABC, that's AI, Big Data, and Cloud in enterprise. Also, these surveys, these are mini books: if you want to grab them, they are free downloads. We've just done three. We've got another two in the pipeline. But what we were trying to do is a contrast study. We were trying to look at, for the big fish, for the enterprise firms who are adopting data science, the ones that are successful, that have been out doing machine learning models in production for five years or more. Let's do a contrast between those and the ones who are just barely even getting started yet. You know, what's the delta? What can we learn from that?
I've tried to represent some of high-level findings as a survival analysis here. You can think about having three buckets. There are the companies that are basically non-starters. They are not really getting into this yet. Then there are some companies that have got a couple of few years into it. They are developing their practices. They are evaluating and adopting. Then you've got other companies that, again, they've had five years or more success in machine learning and production. What is common amongst each bucket?
When you look at the laggards, the nonstarters, number one, they're buried in technical debt. If you look at their data infrastructure, the one thing they complain about, in common, especially, is that they've got too much tech debt to solve in terms of their collaboration tools. Their data infrastructure just doesn't support it. They're fighting with silos of data. Really, that stuff takes years – those kinds of enterprise transformations take years to fix. Then the problem is that even if they got started today, it would take a long time. The trouble is they're not going to get started today because, from the top, the company culture does not recognize the need. It's not a priority. This is coming from the top exec levels, board of directors, etc. If you don't have that company culture, you're not going to get past these hurdles. Even if you do have buy-in from the top, the other problem is, they just don't have enough people in, effectively, product management roles, people who can translate from technology capabilities into business opportunities. If you don't have those people in the line units, it's not going to happen. If you add up those three challenges, that knocks out more than 50% of the enterprise. More than 50% of the enterprise is years away from being competitive in this area. Moreover, not only just being competitive, but as the past panel was talking about, think about the business efficiency. How many billions of dollars do they have to invest, and how efficient is that compared with the ones that have first-mover advantage? They're years away from being up to that point.
If you look into the middle bucket, they have three things that they report in common. One is data quality, cleaning up data, the lack of labelled data. You know what? They should be concerned about that, that's the big hot button. That's great, they're working on it. They also complain about the talent crunch. They simply cannot find enough people with the right skills to hire and they're having troubles reskilling, upskilling for their existing workforce, that's perpetually a problem. The other thing, another phenomenon that we reported last year was, once you do start to get your ducks in a row, once you've covered the table stakes, and you start to have collaborative infrastructure for surfacing data insights, there are some things that come to the fore. You start to notice, "Hey, we've got some security problems." "Hey, we've got some data privacy problems." "Hey, we've got some fairness and bias problems." "Hey, we got some other ethics and compliance issues, and we're going to have to take care of that." You get into a lot of competing priorities for capital and for time.
Now, working down to the mature part of this, they report two things in common. One is about workflow reproducibility. It's very important, in terms of machine learning because you're dealing with stochastic systems. So reproducible workflows is a very hard problem. Another thing is about hyperparameter tuning. There are companies out there like Determined AI that are doing really fantastic work on this. What this means is the mature practices do not want to spend all of their money on the cloud vendors. They'd like to do something more efficient when they're training a lot of deep learning models. The problems down in the mature bucket, those are optimizations, they aren't showstoppers. Those are nice problems to have.
So this is a way of looking at companies and evaluating, where are they at? What are the struggles they're dealing with? Also, a good way of looking at vendors and seeing what are the pain points that they are addressing. How big is this opportunity? Another way to look at it is you can break it into different segments and look at liabilities and assets. Frankly, the companies that are in the middle, that aren't necessary the tech unicorns, they're really interesting because they tend to have a lot of people with domain expertise. They also tend to have a lot of access to interesting datasets, data exhaust that isn't monetized yet. They have ways, they have a path if they want to take on and compete against the first-movers, the tech giants, because typically the tech giants do not have the domain expertise.
There have been studies – O'Reilly is doing these studies, MIT Sloan has done some fantastic studies– which I’ve summarized some here, and I’ve put links into. Also, McKinsey Global Institute, Michael Chui, and the others have an incredible study which is slightly terrifying. The thing is that these studies are converging, and they tend to show the same thing. Basically, there's a gap between the haves and the have-nots in terms of companies that are using data, are data informed, correctly model-informed, etc., and that gap is widening.
Now, in terms of our contrast study, what are the companies that are doing it right? What do they report in common? Number one, they use specialized roles. They've embraced the use of data science teams and data engineering teams in lieu of just saying, "No, no, that's a business analyst." That's one marker. Another one is that, to Nick's point, in his keynote yesterday, was excellent about using experts, internal data science teams. This is what the winners in the game do. The ones that don't fare well tend to rely on external consultants.
Another thing that really surprised us was that the more sophisticated companies have practices about using robust checklists. They've made mistakes before, they're not going to make those same mistakes again. They are curating checklists of what to do and what not to do when they are rolling out machine learning models in production. That's good. I'm really glad to see that. We're actually surprised that they are being proactive. The fourth point is something that surprised us completely. You know, typically, when you think about running projects, running teams, in terms of setting the priorities for projects, in terms of describing, what are the key metrics for success for a project, that usually falls on product management. But in terms of the sophisticated companies using data science, they're leaning more on their data science leads in lieu of their product managers.
Again, drilling down into that. What that means is, if you take product management and just try to drag it across from web app development and game development, things like that, and try to use that kind of embedded institution? It doesn't work. The book really isn't written yet about product management in this area, although Pete Skomoroch will be talking later today, and definitely listen to what he's saying. What he's talking about is not out in the line units yet, but it needs to be.
All right – where did this happen? Where did this gap start to emerge? I can point to the year 2001. It was very interesting. For one thing, there'd been something...a small matter called the .com bust that happened, just a small matter. Then also, it was the heyday for data warehousing and BI. I mean, arguably, notions of BI come from the late '50s. They were really articulated well in the late '80s, but they got traction throughout the '90s. That became a kind of an embedded institution. Frankly, leading data science teams early on, you almost always had to struggle against the BI teams. It was also the year, 2001, when "Agile Manifesto" was published. This became another kind of embedded institution, one that we're still struggling with. The thing is...how many people here have read the "Agile Manifesto?" Okay. I'll say like 20 or 30% maybe. The word “data” doesn't appear, and it wasn't a priority for them. They weren't thinking about data. That was an afterthought. They were thinking about iterating on code base. A generation of developers, they came of age equating database with relational, which is not true. By virtue of that, their level of sophistication with data was that the legibility of systems, part of the selling point for relational, that equates to legibility of data, and that is really not true. That's a way to tank a company.
But the companies with first-mover advantage, they made this sharp left turn toward NoSQL. What we do in data science, they really abandoned a lot of previous practices. You know, they moved fast, and they broke things. The real chronicle on that... The one that I love is Leo Breiman's paper called "Two Cultures" from 2001, where he really chronicled this kind of sea-change and really talked about the rise of predictive analytics, even though a lot of his colleagues were deriding him for it.
The way that I would characterize this is that we've had a generation of mainstream developers who've been taught that coding is eminent and data is secondary, and that's pervasive in the industry still. But yet, the first-mover companies have changed to think that learning is eminent and data is a competitive differentiator. Now, those two statements don't reconcile. There are a lot of people who believe the former and won't adopt the latter. You can't just retrain them if they don't believe it. This is a fundamental disconnect that we're struggling with. It may be that for people in the former category, if they don't level up to it, well, there are some good construction jobs.
Here's why, because the business executives know better. The business executives who are seeing the value of data science and being model-informed, they are the ones who are doubling down on their bets now, and they're investing a lot more money. This is our biggest surprise out of our surveys. We had like a 5%, 10%, 15%, and then we had a catch all for 20% or more percent of IT budgets being invested in machine learning. In the mature bucket, 43% reported that they were putting 20% or more of their total IT budget toward machine learning and data science. That was a big, big surprise for us. What I'm saying is, that gap is widening. If you don't recognize that, and you don't understand the drivers of why this is changing, the industry is going to move ahead.
Now I want to shift into, how can we take some of this tooling and apply it? I've got four scenarios and a little bit controversial topics, but hopefully providing somewhat of a lens based off of where we've come from looking toward how it's resolving, and where is it going in the future.
The first one is about company culture. Again, talking about executives... In December last year, I was on a workshop for the World Economic Forum. We were establishing the AI agenda for the Davos conference in Switzerland that happened earlier this year. A lot of our operating principle in this was based on the finding that, when you go out to that 50% or more of enterprise, that just doesn't get it about machine learning, the problem is at the top, people who are on exec staff, or really even more it's the board of directors. I mentioned that I started programming in the '70s but these folks are probably older than me. The thing is that they have learned about Six Sigma, they've built their careers on it. They learned about Lean. They learned about a lot of process that requires that you get rid of uncertainty. But now, they're being told that these younger tech unicorns are coming after them, that they're using machine learning, and the board of directors doesn't understand it. You know, these are probabilistic systems. How could that make sense? They're being told they have to embrace uncertainty. That doesn't make sense. But the thing is that if they don't act decisively their competitors will, and certainly the regulators will.
There are some great voices talking about this. Jacob Ward, he was Senior Editor for "Popular Science." He's now at Stanford. Jacob Ward was the first person...along with Pete Skomoroch, those are the first two people who alerted me to the fact that Daniel Kahneman and his colleagues were the ones who've really encountered this problem and described it. I definitely recommend Jacob Ward talking about the impact of behavioral economics on decision based on data. Also, another person who's really interesting is Cassie Kozyrkov. She's Chief Decision Scientist at Google Cloud, and she has some great talks about lessons learned, mistakes made at Google deploying machine learning. What can Google learn from their mistakes in machine learning? How can that apply in other companies in terms of decisions? The best summary on this is from Ajay Agrawal writing on behalf of McKinsey, talking about the unbundling of decision-making, and how can you have teams of people and machines that collaborate toward large-scale decisions. The takeaways there are that we have a lot of problems with the unbundling of decision-making. Behavioral economics is the North Star for this.
We have a lot of work to do in terms of corporate governance regarding this. The good news, for those of us who do some business development, you always want to sell on the upside. The good news is half the total of available market is not even started yet. That's great. The bad news is that for those of us who do some future scenarios – futurism work – using our tools like GPT and J Curves and the rest, when you start to put the data and the studies together, there's really a convergence point. It's only about four or five years out, where the gap between the haves and the have-nots becomes critical. The companies that haven't started yet, they're really too far behind to really be worth investing in. They become a lot.. well, it's probably indicating a lot of M&A activity. You know, other larger companies who are more progressive would gobble them up to get their customers. There's a kind of point-of-no-return coming. Another case was a lot of demand like what David Donoho talks about in his history of data science as far as the demand meme, the jobs meme for data science. That's in good position. We'll have jobs going in the future.
Okay. Next up, if I can make it through here … a different scenario. One of the other four, even more controversial. It's about hardware. What's happening with hardware? This is the one where I probably get the most pushback, but it's also being picked up by other people who are much more notable speakers than I am. With hardware, for the past 20 years, we've been taught in software engineering, the form of hardware for processors doesn't really change. Because of Moore's Law, we're just going to get processors that are better, faster, cheaper. Don't even bother looking at that. We've got Java and JVM languages. We don't need to probe the hardware at all. It's not really an issue. And we have virtualization. You know, it's just...don't care about it. The problem is this. Hardware is moving faster than software. Software is moving faster than process. You know, we've been taught for the past 20 years that in software engineering, process is this big umbrella thing. You can apply it to a lot of different kinds of projects. But that and the fact about hardware, that's all changed.
Now, hardware is evolving more rapidly. Certainly, on the processors side, you see GPUs... Nvidia got really lucky having GPUs out there they didn't really understand why they were becoming popular suddenly for machine learning, but it worked in their favor. But not just GPUs, you've got TPUs, IPUs, DWPUs, and a whole range of ASICs. You've got some really interesting companies coming up doing this. It's not just about the processors, it's also about what's going on in the switch fabric.
If you look at some of the origins of TensorFlow, Jeff Dean's early talks about that, they are talking about using new kinds of networking gear and not necessarily using TCP/IP, going beyond that. You can get into sub millisecond latencies for streaming in real-time. That's also being driven a lot as...like Alex was just talking about in the last talk. It's also about memory fabric what Intel is doing with Project Arrow and using FPGAs as intelligent front ends on large memory fabric, and just the incredible efficiencies that can be gained by that. Moore's Law is dead, but there's hope. There are projects, like Project Jupyter – as far as open source, the protocol, the open standard part of Jupyter is really accounting for this, more than many others. Apache Arrow is my favorite project at Apache, and it's really in the driver seat there. But also look toward UC Berkeley RISELab and the whole constellation of Ray and what they're doing. These projects are very savvy about this change, of hardware moving faster than software, moving faster than process. Okay, I've got a little bit more time.
In 2005, a colleague had moved to Seattle, and he was on a new project, and he kept calling me with these really weird questions about a new kind of service. I was befuddled about it, but I tried to work as a guinea pig for this thing. Then in 2006, they told me to go look at a website and sign up for a thing. I did, and inadvertently became one of the first three people outside of Amazon to do 100% cloud architecture. My teams were guinea pigs for a lot of the early AWS services coming out. I signed a lot of NDAs. I've had a long relationship with Amazon, but that's about all I can say.
Roll the clock out. The end of 2006, we were having trouble managing some of our NLP workflows that we were doing. One of the engineers suggested a new open source project that had just come out, and we became one of the early users of it. It was called Hadoop. Then we ran into some bottlenecks running Hadoop in the cloud. In 2008, there was a JIRA ticket, and as an engineering manager, I wrote a $3,000 check to a young engineer in London named Tom White who pushed a fix. We were able to get efficiencies of running Hadoop in the cloud. Then our friends at Amazon called up and said, "Hey, you've got the biggest Hadoop cluster in the cloud." We became a case study for what...at the time it was called “Project 157” – You'll still see that in the docs, but it was renamed Elastic MapReduce.
About a year later, Berkeley... one of my heroes, growing up in this field, was Dave Patterson. Dave had led his grad students to interview a lot of people who were involved with cloud, a lot of different competitors, and try to understand what's going on. They wrote this paper, this paper was prescient. It spelled out what would happen over the next 5 to 10 years in cloud. It just nailed it. I got invited to critique it and then gave a guest lecture at Berkeley. You can see a video of me getting eviscerated by Dave Patterson after I critiqued his paper. But in the audience, there were grad students, first-year Ph.D. students, who were the founding teams for Apache Mesos and a spinoff called Apache Spark. I did a lot of work with both of these teams over the years.
Dave led his current crop of grad students to publish a follow-up study, 10 years in the day, afterword, and it's called "A Berkeley's View on Serverless." If you want to be working in this field, if you haven't read this paper, stop what you're doing, grab the paper and read it. It's worth that investment of time. The point is, it's really dealing a lot with work from Eric Jonas in the economics of cloud, and how things have shifted. Eric is going – it seems like a lot of the interesting people at Berkeley are – off to U. Chicago next year to join Michael Franklin.
The point that they make is really, you can think of it... When AWS first launched, they kind of dumbed it down. They had very sophisticated services inside, but they kind of dumbed it down to make it recognizable to sys-admins and enterprise who were used to using VMware, they get to slice their apps over. But it's 10 years later. Now, there's demand for much more higher-level functions, and there's this whole umbrella of what's being called serverless. It has a lot of import and really they're spelling out here, why and how, what the risks are, some of the limitations. But this is what's happening. If that last paper was prescient for the next five years, seriously, this one is even more. Part of what they're pointing to is how there's a continued decoupling of compute and storage, to translate that, not to get into too much of electrical engineering, but what's happening there is basically the drivers for Hadoop and Spark have reversed. That's why this lab has Ray, which is basically, the designated Spark killer.
Another thing about hardware evolving...we did an article on Domino about this. I'll point to this. Alasdair Allan and Pete Warden talking about edge inference and running huge machine learning models on very low power, small footprint devices. There's probably some of these devices in the wall units here in this room. It's kind of terrifying what they're saying, but it's what's happening. This is even bigger than the scope of the other changes I was talking about. Looking forward, Moore's Law is over, but Koomey's Law and with it, Landauer's Principle , those are in effect. We're going to see vast efficiencies because in some ways, Moore's Law allowed us to be extremely sloppy. Also, look toward Ray. That is...when you break down the use cases, that's the thing that's getting rid of Spark. Really look at Ray and Modin and the others that go along with that, how they're leveraging services, how they're leveraging contemporary hardware.
Also, increasingly, data science is less about business analytics and more about edge inference. If your team isn't thinking that way, you should evaluate it. One is shared infrastructure and government meeting enterprise. I was chair for JupyterCon, and Brian Granger and I noticed a lot of enterprise coming into Jupyter in the lead-up to the conference. I'll say, from the perspective of a friend, David Schaaf, who's Director of Data Engineering at Capital One. You know, David says, "Hey, look, on the one hand, I can buy proprietary systems for data infrastructure, and then I have to train my people up, maybe six months. On the other hand, I can hire grad students who know how to do machine learning. They can use Python and Jupyter to deploy apps in machine learning that the bank needs, on day one. Why on earth would I spend the money to get a proprietary system and then derail my people for six months and have it not even be as effective?" We're seeing a lot of open source, especially Jupyter, coming into organizations like Capital One. Definitely, Bloomberg has made big bets and a lot of contributions, Amazon and others. Also, DoD is doing a lot of work with infrastructure based on Jupyter. A working thesis there for me is that the hard problems in data science are no longer in Silicon Valley. They're out in the field, especially in large organizations, especially in regulated environments. Increasingly open source projects are looking toward regulated environments for which features to prioritize. It's also where enterprise finds a lot of common ground, and government.
We had Julia Lane talking about Coleridge Initiative and the work on Project Jupyter to support metadata and data governance and lineage. I'm involved with that as well, consulting for NYU. Also, if you look at talks from like Dave Stuart, talking about nbgallery and how DoD is using large scale infrastructure based on Jupyter. Also, how DoD has pushed that source code open on GitHub [https://nbgallery.github.io/]. I think we'll see a lot more non-vendor contributions in open source, and less committer wars. Certainly, that's coming out of government and enterprise.
Okay. I know that I'm out of time, but real quick, I'll go through this. Last but not least is model interpretation, and it's not what you think. It's a really super hard problem. I will point toward... We had...the column last month for Domino goes into detail. Ben Lorica did a great podcast interview with Forough Poursabzi Sangdeh. Forough did her Ph.D. on model interpretation and then realized, "Hey, wait a minute. There are some big problems here." I was on a panel a couple of months ago with Zach Lipton at CMU, who is also another one of the people going, "Hey, wait a minute. Model interpretation, explainability. Some of this is really wrong." There are important reasons and appeals for making models more interpretable and explainable. There are needs for data science teams to embrace this and use these tools and reflect and understand what's going on. But in terms of putting this kind of tooling in front of stakeholders right now, it's extremely problematic. Just to paraphrase, from what Forough is saying...I definitely recommend her interview.
The gist is this. There's a joke going around on Twitter, "If it's written in Python, it's machine learning. If it's written in PowerPoint, it's AI." I mean, there's a gist of truth to that, but it's also harmful. I think that's really the wrong perspective. Machine learning is a subset of mathematical optimization. We could draw equations for loss functions and regularization terms and all that. But, we could say that ML is about tools and technology, but the uses of ML... To paraphrase Forough, "The use of machine learning is ultimately, it involves a lot of HCI (human computer interaction).” There are a lot of social systems involved. If you just think that machine learning is about engineering, you lose the other half of the equation.
Going forward, what's the application for it? If you try to interpret what's going on in machine learning and you're only looking at half the picture, you're going to get it wrong. I think a definition would be to say that AI is about the impact on social systems. To illustrate this, if we look at complex workflows to prepare data training sets and then create models and evaluate them...if we look at the business risk of deploying machine learning models and how to understand what's going on, if we just simply look at the artifact, the model, and try to dissect it, that's a peephole analysis. We're staring through a peephole at a really complex problem. Instead, if we're going to mitigate risk and try to understand and explain models and what's going on, we have to look at the information that's throughout the entire workflow, all the way back to collecting the data initially from the business process. There are people who are working on better ways not to throw away the information and the human input at many, many steps leading up to that point.
This is extremely crucial. This is where data governance comes in, because this is the essence of lineage. There are great people talking about this much better than me. Chris Ré from Stanford has that whole project DeepDive and I'm not going to even get into Lattice. But Snorkel is the open source project, and Chris is talking about again how to leverage lineage. What they are talking about with Snorkel is weak supervision, something they call data programming. How can you make mathematical functions to describe the experts who are providing labels, and where they're good or bad? How can you trace that all the way back into the data collection? Percy Liang, also at Stanford, the math behind influence functions. We'll see more and more of this being worked into the machine learning process itself. There's some post-hoc analysis from Microsoft called InterpretML using EBM, explainable business machines, that came out last week, and it's pretty good.
But the point here is that data science teams need to rethink how they're managing the data workflows, all the way up to the point of having very rigorous ways to put together training sets and don't throw away information at every step. When the auditors come knocking, that's going to be the issue. Chris Ré has a point about lineage, and just where does the rubber hits the road on model management and lineage, and how does data governance come into play with the hardest problems that we have currently with machine learning. I have an exec briefing on active learning – semi-supervised learning – which would help out on some of this, but I won't go into detail about it.
This really points the fact that number one, product management for AI, that book isn't written yet. We're learning about it. Pete Skomoroch will cover more. In the larger scope, if anybody comes to you and says, "Machine learning is just engineering" walk him out the door, seriously. I mean, this is the hardest problem we have to deal with right now. That's part of the disconnects we've seen, and we won't get past the regulatory problems and governance problems without it. Through the six decades, we've had this kind of lens. This is how things evolved and changed, and it's still ongoing.
Thank you very much. I look forward to talking to people afterwards. If you want to get a hold of me, here is some places. Thank you.
Editorial note: The transcript has been edited for readability.
Ann Spencer is the former Head of Content for Domino where she provided a high degree of value, density, and analytical rigor that sparks respectful candid public discourse from multiple perspectives, discourse that’s anchored in the intention of helping accelerate data science work. Previously, she was the data editor at O’Reilly, focusing on data science and data engineering.
Summary