Succeeding with Alternative Data and Machine Learning
Grigoriy2017-05-08 | 50 min read
Perhaps the biggest insight in feature engineering in the last decade was the realization that you could predict a person's behavior by understanding the behavior of their social network. This use of non-traditional social data has driven a significant amount of economic growth as it revolutionized the accuracy and applicability of models.
In the future, accessing alternative data such as foot traffic through stores, satellite images of parking lots, or the information locked inside textual data is likely to allow models to unlock the next great innovative feature.
In this talk, Kathryn Hume provides insights into the use of non-traditional data in finance. She covers macro trends about the explosion in choices available to practitioners, the confusion as to true advancements versus hype, advancements in crowdsourcing approaches, and other topics. The talk is a thorough and actionable overview of how data science organizations can look beyond tabular data and traditional statistical approaches to find opportunities in unexpected data sources.
Succeeding with Alternative Data and Machine Learning
Video Transcript
There's going to be some things I'm saying that are confirming a lot of the stuff we heard on the last panel. It's like going a little bit deeper.
I have insights into this across both the buy side and the sell side. A little bit less with the ratings agencies. But I, hopefully, will be able to offer some of these things with a very broad perspective.
I'll also talk about some of the new emerging capabilities coming out of the academic community and have a slightly different answer to the question that you posed regarding how one might say on top of this while still maintaining practical efficiency and productivity.
So first picture here. Ansel Adams, Yosemite Valley Bridge, taken in 1934. The quality is a little bad. That just has to do with the projector. Second picture. Dorothea Langer's Migrant Mother, 1936. What sticks out here these pictures are both in black and white. So is the following image.
This is by Richard Zhang. He's at UC Berkeley. They developed a technique to take black images and automatically present them in color. Same picture from Dorothea Langer, restyled so-so. I bring this up as a very brief intro to think about some of the cool stuff that's going on in the world of machinery these days.
Zhang and his team at Berkeley used a technique called deep learning. This is the architecture of a convolutional neural network, which—there's different types of neural networks, different architecture to process data that either comes as a single package or a sequence at a time series.
Convolutional networks work very well with single packages. They processed it through and were able to identify correlations between the parts of the image. So the beak on our flamingo's nose or the nose on our Migrant Mother affiliate that statistically with a certain color, pattern, and then use that to automatically illuminate these images.
I wanted to start with that. Sort of shifting from the world of black and white into the world of color and how new technologies are enabling us to do this. Because my take on new kind of data trend—as Alexander just described regarding considering the economy in quarterly statements by 10Q's and 10K's days and all of the information you can download from EDGAR and the FCC and all of the traditional fundamental data resource sets to something that's now vibrant and in color.
We have here metaphorical images of traffic data in retail stores selected by Foursquare, which is now licensing that data to hedge funds and marketers. So that they can get insights into how many people are in given locations at a given time to try to predict retail profits.
Or as talked about from the last one, satellite image data. This has been very popular these days via companies like Orbital Insight out in Silicon Valley or RS Metrics here in New York City. Very different underlying technical capabilities, but the moral of the story for these two companies is if we want to know how many people are coming into Walmart, let's count how many cars there are in the parking lot.
For me, it's really interesting when we think about a mosaic theory of investing. Because the story that you're going to tell if you look at one data set will be different than the story you might tell if you look at those two data sets in conjunction.
If we take our traffic data our satellite data we might say, gosh, a ton more people are coming to Walmart. Tons more people are coming to Macy's. If we correlate that and combine it with credit card data, we might see just that transactions actually went down relative to the number of people who came in.
If our story is this looks like there's been a lot more people that are coming, that might say, let's go long, but it actually might be a lower conversion rate.
If we put these two things together, that actually can impact the story that we tell and I think the art of using our alternative data well is not just collecting as many datasets as possible but collecting the right ones—the ones that not everybody has—and then combining them to really figure out what the moral of the story may be.
In thinking about it, if you're—this is a little bit oriented towards a buy side investor, I'm not sure what the demographic is of people in the room—but when you're thinking about this alternative data space, there's sort of a higher Maslow's hierarchy of utility depending on how processed it is, right?
At the top lies just streams of data. There's very few firms out there that I've encountered that have the internal capabilities and capacity to clean this, process it, and then do all of the downstream analysis in machine learning to turn it into something useful.
What you're seeing on the market is—and Matt Turck from FirstMark has a decent article about this he put up in January—is startups who have their data exhausted in the course of their activity—they're giving off these traces of economic activity—and they're starting to understand that in order to make money off of this stuff, they have to make it easier to use.
They're packaging it, not all the way down to the point where they're going to actually predict what position you should do, but they might get to the point where it, as with the example of ours, metrics and satellite data, is basically a red, yellow, or green you know dummy-proofed analyzed packaged insight out of how many cars were in the parking lot that goes from data that's collected, convolutional neural network layer in order to identify how many cars are statistical that are to normalize that vis-a-vis different trends, down to you can incorporate this in your pipeline.
What's interesting, of course, is that, especially in financial services and what's unique here vis-a-vis other industries is the value of that data goes down the more it's processed, right?
It becomes a commodity if everybody can just buy it and use it. So this is a little bit more of sort of advice to startups that are trying to make money off of this stuff. Your market is going to increase the more you package, but the utilities just drastically decrease.
Similarly for hedge funds that are trying to actually go out and use this stuff effectively at an advantage. One thing that I find interesting is that the Maslow's hierarchy of alternative data is actually quite similar to the Maslow's hierarchy of the entire machinery ecosystem.
We sort of have, at our baseline ability, we can collect store process data. As it was mentioned in the last panel, while that might not be the sexiest thing, it's by far the most important thing in the core capability of anybody's machine learning efforts.
That then passes through data analytics, counting things, what happened. Data science, potentially doing your prediction, a little bit of analysis. Machine learning, doing data science with feedback loops. And then predicting our position. Whatever might be super smart AI.
I put up this question. I'd like to pose this as an answer. I don't know if that market value—that inverse curve—is the same for any and all data science startups as it is for the alternative data world.
And I put that question there because most verticals don't have the constraint of needing to be the only people who actually have this insight in order for it to work. And we're going down on our data, right? There's a ton of value to just have the ability to process data as opposed to go all the way through insights.
There's still a lot of market value there. So this is a question for Bloomberg Beta or First Mark or whatever. But I'd love to know what that curve looks like. So with that as a preface, I'm going to talk about three things. Just some macro trends in the financial services space.
Some new capabilities that we're excited about from our vantage point at Fast Forward Labs. And them a little bit to reiterate the last panel processes to sort of make this work. So macro trends. Choice. This is alternative data landscape produced by CB Insights so it's a little bit old. I'm sure there's more players in it today. But the moral of the story is there's a lot out there.
A firm decides that they're going to pursue some alternative data strategy, it's not as simple as, all right, let's go find this. There really is an art to understanding what these various providers provide and in understanding what the competition has access to as well. Confusion.
We work with one of the many large sell side banks. But one in particular when we went in to think about the way in which sell side research firms can better serve the buy side, they said, today we carpet bomb our clients with research. And KPIs and product opportunities.
There is little to no personalization based on the buy side's trading history, preferences, and the research that they like to use and read. There is to date, a lack of Netflix style personalization algorithm approach to sell side sales. They just sort of throw things at the buy side. The buy side then is flooded and inundated by all the things they've been carpet bombed with.
Their problem is the inverse, where they have to find the stuff that actually makes the most sense and is the most relevant for the decisions they're trying to make because they're flooded with so much information.
The other thing that we see is new things. In thinking about machine learning, there's going to be today, many, many, many startups to come and say that they're the new hot toy on the block, they're the next best thing. This is a landscape produced by Shivon Zilis, who's an investor at Bloomberg Beta.
When I look at it this, this is cool but it's also a total mess. One of the big trends that we hear is for people who are in the position to actually make decisions on what they should use, they have trouble separating the wheat from the chaff.
They don't know the types of questions they should be asking these various software vendors and startups to determine if they really have the right tool for the job. Skepticism—and this was talked about in the last session—where you're going to have some people who are doing just fine.
They're following a traditional fundamental research processes, things are working well, they're making lots of money, and they really don't understand why you need to instill the fear of God in them and to convince them to use alternative data and shift around the way they've been behaving for the past 25 years successfully.
We're seeing a lot of new ways of thinking about finance. So these three are all crowdsourced models. The one on the left, Quantopian, is this sort of traditional quantitative using market data approach to posing a problem and then letting a bunch of data scientists try to solve it.
In the middle is a little bit newer. Same type of model, but focused on alternative data and not quantitative finance methods. The one on the right is a company relatively new out of Singapore. They're building a platform that I'll talk about some of the techniques on later slides.
But that basically, you can weight risk across an entire structure portfolio based upon input from individual portfolio managers that they have an outsourced model to incorporate in their firm.
The final thing that's unique about—I don't know if anybody is a Get Smart fan. This might be a little old for a lot of people in the audience. But I grew up with Get Smart. The Cone of Silence is this fantastic metaphor for Chinese walls where they're supposed to be able to have a secret conversation that no one else hears, but the engineering is all messed up so everybody can hear what they're saying except for them.
If we think about recommendation algorithms—and we talked about collaboration and science throughout this conference today—data science often depends on collaboration. It depends on being able to collect data across a stream of activity and maybe use a metric in order to see if there are similarities between person A and person B's behavior.
This is tough in this vertical. Directly importing the type of algorithms that Netflix may be using to recommend products just doesn't work. There are certain types of tactical and technical questions that need to be answered in order to sort of make this work in this field.
OK, so that's sort of what I see as like the state of the nation right now and the types of questions that I hear again and again and again from companies.
Shifting gears, there's a question after the last panel. There's so much noise in this space. There's so many new developments coming out of academia on a regular basis. How do we keep on trac?. How do we pay attention to what's interesting?
I work for a business that is trying to solve that problem. Our main mission is to help companies innovate new data and machine learning. We do that by bringing these reports that go and give a deep dive. They're oriented for both a leadership audience that has some technical acumen as well as an actual data scientist audience.
They go through and describe for each new coming on the scene algorithm capability, what it is, how it works, what the history is, and then how you build products with it. And the goal there is for us to try to help people stay on top of all of this noise and mess, but in a way that really gives you concrete, tactical, clear understanding of what it is, how it works, and what's possible with these tools.
I would love it if we could make money only doing this, but we can't. So we also complement this particular research oriented model. An advising and consulting expert will come in and analyze and evaluate companies' machine learning capabilities or even go as far as to even build products for them.
I'm going to talk about three of these research projects that we've done and areas we're excited about this being applied in the financial services vertical in particular.
The first one here is relatively old. We did this work in 2014. Natural language generation is the inverse of natural language processing.
As opposed to having a bunch of messy text and trying to find structure so that we can render that text commutable and analyzable, this stuff let's start with structured spreadsheets, structured data, and then automatically write text that indicates what's interesting in that data.
In our methodology, we always start—we don't just read papers, we actually build product because our team is interested in really understanding what you can do with this and how things work on the technical level. For this particular technique, we built a system that automatically generates real estate ads.
The structured data on the top is how many bedrooms? how many bathrooms? where is it located? does it have a washer or dryer? Then you press the button and it uses a generative technique to write 1350 listings in under 20 seconds.
We tried to affiliate the style of each listing with the given jurisdictions. This is all from New York City data. We trained it on the StreetEasy data set. If it's up in the upper east side, it'll be a little bit more upper east sidey. And if it's in Williamsburg, it will be a little more Williamsburgy.
We noticed when we were training the models that there were some interesting correlations in the way real estate agents tend to use language when they're describing their apartments.
One of the things we noticed is that if an apartment is mentioned as cozy or small, that normally means that it's 400 square feet under standard market value for the jurisdiction. If size is mentioned at all, that means that it's at least somewhat under standard size. So the next time you are looking for an apartment, if size is mentioned, don't go.
That's fun but we're not in real estate. So areas where we're seeing this applied in financial services—you may have seen on Forbes or the Associated Press, a lot of company earnings reports are now written by robots.
There's a company called Analytics Insight that is serving the sell side firms that writes earning reports in certain research for the buy side. We're seeing it often in compliance reporting. I know that JP Morgan-Chase is thinking about trying to basically automate a lot of the work that their compliance department is doing by just automatically writing some of these reports.
I know that they're facing some cultural difficulties in exercising and executing on that project, given people's fear that they're going to be replaced. We touched a little bit about democratization of data in the last section.
I think that to use the trendy jargony term, I think that the real value of this stuff is actually in the democratization in so far as there's going to be few people in your organization who like this format. There'll be more people but not all who like data presented as visualization and charts. And there are certain type of information that lends itself well to visualization and other types that lend itself well to text.
Where the all the software vendors and the are making the most money is by having this be an add on to their BI platforms so that whenever you're counting stuff and you're communicating the output of that counting, you can do it as an automated personalized email depending on someone's role in your organization as opposed to just presenting them with charts.
The next place. This has seen much more interest than the first, I think, and is more exciting. And again, not to counteract the main thesis of the last panel where when one is executing practically data science in an organization, you don't start with word embeddings. You start with regular expressions, diagrams, parsing language, et cetera.
There really has been a really interesting development in the capabilities of language processing over the past couple of years.
If you think to a traditional NLP, this is inspired by philosophers and linguists like Noam Chomsky, who really thought that language is platonic. So there are these sort of structures that exist in the world. They exist in our brain and we're all born with these ingrained, inborn capabilities to discern grammar.
A lot of the early efforts in the '50s and '60s through the '80s were to assume that language had this structure and then to break down the messy, yucky world of language into trees to discern that, which lends itself to structure and hence, computability. That got us so far. But as we can imagine, there's all sorts of problems that type of methodology can't really solve.
In the 2000s with the advent of big data and loss of statistical methods in companies like Google, we said, well, why don't we turn a problem like translation into a lookup problem.
If we're going to try to translate from English to French, we'll have tons of examples of a phrase in English and tons of examples of a phrase in French and then we'll find the two columns in those two are the same, and we'll assume that that's supposed to be the translation.
The big in big, big data is emphasized and we turned this into a statistical problem where we're looking for engrams of one, two, three types of expressions that correlate with one another as a proxy or index for what something might mean. Meaning doesn't really work that way.
The new really exciting developments in NLP these days are in the realm of word embeddings, which is affiliated with using deep learning to try to process meaning, where we'll take a word or a sentence for a document and we'll make a vector.
We'll make a computational representation of that. We'll turn it into a string of numbers. And then we do math on those numbers. We can do things like plot a word on this three dimensional plane.
We can say, all right, let's follow woman as well and measure the distance between those two. And measure the directional with the vector. And then we say, well, let's try to do an analogy. So as the man is to king, woman is to what. And we just measure line of the vector that leads to that output and we can find meaning and tease meaning out of this.
We thought this was cool and wanted to see what we could do with it. So we said, why don't we pose the problem of automatically summarizing text. So this automatic summarization is going to be way harder. It's not going to line well with the simplified structure that we find here.
It works better if we assume our default oncology is a mess and then try to turn it into numbers and do stuff with it. What we did here was we took long articles—New Yorker, Atlantic, you name it—turned them into a series of vectors. And then we said, let's score each sentence and compare where the meaning of a hole lies in our mathematical space and the meaning of each sentence and extract out the one or two sentences that lies closest in that vector space to the meaning of the whole.
We used that as our proxy for saying these are the sentences that we think really capture the meaning of this document. This has potential significance for trying to manage all the information that financial services deal with these days.
So Mohamed over there. He knows a lot about this, so if you're interested, you can ask some questions. They, when they first offered their product onto the market, were focused a lot on identifying breaking news. So they could parse all of this text very quickly so as to discern that something had happened up to 12 hours faster than some of the other news agencies that are out there. And at least 15 minutes faster than online news agencies.
I had a request from a sell side firm for content derivatives. So you're going to have a ton of research that's generated. And there's the marketing and creative teams who are then tasked to turn this into emails, tweets, some sort of offering to try to pitch new clients.
They said, can we take this and can you help us automatically generate an abstract, the tweet, the newsletter clip and personalized email version of this? This is hard. It's hard for a lot of reasons. On the people side, as we think about—the technical term here, the data scientists will say it's a lossy compression. What that means is we've got our full document and we're to take a summary. We're going to lose stuff in the process.
We're taking a non-perfect smaller version of it in the attempt to try to gain efficiencies and read stuff faster. Most portfolio managers, research analysts that are coming from a fundamental discretionary background, don't like that idea.
They feel like the more important information is the silver bullet that sort of lies at the end of the document. I think there are some lessons here in terms of the political challenges that the industry will face that one or two from the legal vertical, who has been trying to use machine learning to facilitate the process of discovery, which is automatically identifying information that's relevant for a lawsuit. The lawyers all have the same complaint.
All of these white collar information-governed industries—you're going to have not only the fear of are you replacing me? are you augmenting me? But also, this is not how I look. This technology doesn't really facilitate the way I process information myself.
The second is back to political setting expectations. So this is a relatively new capability. It's really hard to rewrite the document in our own words. It's easier to just do that scoring function that I described and extract out the sentences that count.
More technically and practically, stuff requires a lot of data. While you might assume you're drowning in data, most of the companies that I see on both sides—sell and buy—they haven't actually organized and architected their data so that they can train machine learning classifiers.
This stuff involves what we call a supervised learning approach, where it's not throw me data and the machine learning will come in and discern patterns automatically. It's rather, we start with the answer that we're supposed to have and build out a model there. Often, while this is theoretically possible, products fall flat because we just don't have the data to train it.
Similarly, if you want to do this in the future, it's OK if you don't have enough because there are certain types of techniques to bootstrap your way to progress by keeping a human in the loop, but that takes some thought.
Then the final thing is sort of pie in the sky perfect solution to the problem here, which is, there's so much information, we want to find the stuff that's most important and most relevant at the right time, is a super complex, let's say, fusion of search techniques and recommendation techniques. Which is hard and just becoming possible to solve.
The final thing that we're super excited about is in the realm of probabilistic programming. These are new computer languages that don't denote procedure. It's not, we're going to do x, we're going to y, et cetera, like standard deterministic computer programming languages do, but denote inference.
Here as a programmer, you can come in and basically state what your distribution looks like. Is it normal, is it Poissonian, is it binary—whenever that may be. The back end of these algorithms have a lot of sampling power baked in. So they go through and mean get really efficient to just sort of sample.
You're not doing it at random but you're focusing your sampling energy based on the slope of the curve. And then the output posterior distribution, if you're familiar with Bayesian terminology. They have Bayesian terminology so they output, you come in and you say, I think it looks like that.
You put the data into the model to sort of update your parameters. And then they output this full distribution that says, based on what you told me you think this looks like and what we've learned, we think this is the sort of final output.
When we study this, we built a system that helps predict real estate prices going on into the future. We're using color as a visual metaphor for a probability distribution. So if it's low probability that you can find a house for less than one $1,600,000—welcome to New York—we're going to give you a white color. The likelihood of your being able to find that scales up with these colored to go to dark purple.
One of the key powers of Bayesian inference as a whole, programming as a means to use Bayesian inference fast is that when you're doing a prediction in the future, you're not just doing a point.
If you're trying to think about a trade or manage weighted risk across a portfolio, it's not just going to be, this is where we think it's going to be in 2019. But rather based on the uncertainty in our input model, this is the relative competence of our output at various points in the future.
I did a webinar on this with some folks at the Stan Group. They're a group of statisticians coming out of Columbia. And this is a slide from Eric Novik, who's the CEO of a company called Stan Fit, the Stan Consulting Group or whatever, that is trying to sort of just go out and help the world use this kind of stuff in practice.
What I really liked about this slide is that it helps—when we are thinking about data science lifecycle and the data science pipeline—it helps you really understand if you move into a Bayesian framework, it looks different, right?
We're not approaching data from, let's take a bunch of model, throw some machine learning at it, and get our point predictions. But rather, let's start with a model, as I said. Import our data into that mode, update it, and then use that to quote, "inform more rational decisions."
I think from a process perspective, it really is quite different from what you see in a standard data science pipeline. Using this in practice. So these two lovely gentlemen. This is Scott. This is Thomas Wiecki, head of data science at Quantopian. These guys both love probabilistic programming. This fellow here just put out a language, which is a Python front end interface to use this stuff. Stan is a little bit more cumbersome because it's coming out of academia.
They said, all right, here's my problem. I've got daily trading strategy returns with unknown distributions and dependence between them. I've got a risk that the strategy is open to historical data.
Often, when you're trying to use machinery to predict what you should do, the market today might not be like the market in the past that you used to train your models.
They say, I want to do the best in achieving returns across an entire portfolio, not just the one strategy. And I'd like to understand the risk and uncertainty in this weighted portfolio.
These guys find that this probabilistic programming is a great tool to try to answer this particular question, which is, how much capital should I allocate to the different subtraits in my overall strategy so that I can fully understand my risk and returns with a full distribution?
Which I think is cool. So moral of the story. When we're thinking about using machine learning to make the trade, it's possible to move beyond prediction. It's possible for us to, using this kind of stuff, incorporate subject matter expertise into our models. I'll talk about what that means for collaboration between subject matter experts and data science in a minute.
It's possible for us to update our old assumptions in light of new data in terms of that Bayesian method. And then most importantly, it is possible that opposed to just saying, this is what's going to happen, we can actually quantify our confidence on the future predictions to make more rational decisions.
OK. So final stuff. And this is a little bit of overlap with the last panel. Every time I talk to folks here, my main takeaway is that—and there's reasons why data is relatively sophisticated in financial services.
It's the industry that, besides insurance, is probably the best poised to succeed because the business people are comfortable with decisions under uncertainty. Other industries getting nailed and they have a ton of cultural work to do because data has historically been under the function of the CFO.
It's used for PNL and just basically counting how things that have happened in the past. There's a ton of certainty affiliated with it. And people assuming that data is supposed to be oracular. And when the data scientists come in and they do an experiment and it doesn't work or there's a 72% confidence rate, the people that are in the business lines are like, I don't know what to do with this, your product doesn't work. And a lot of projects get stymied.
The great thing about being in finance is that you're used to that, right? So there's less work to do to help people understand that this stuff is probabilistic. As we said in the last panel, you need to lay a solid foundation. Janitorial work and making sure the data is clean is kind of the most important part of the game. This is more on the alternative data side.
As I mentioned in the beginning of the presentation, you don't just buy any and all data that's out there. You get the stuff that's the right data for you. And this is from Quandl. It's like a marketplace for alternative data for financial services these days.
If you're somebody in the audience who's actually the procurer of alternative data sets, I would look for the detail of it. History. Is this a brand new startup that has six months worth of data or do you have five years of history to actually build out some good regressions?
Breadth. One sector. Many sectors. And then rarity. How many people are going to have this? How processed is it?
When you're using it, I think about sort of on the spectrum of more systematic versus more fundamental, we'll put Two Sigma at super systematic. Fidelity at more fundamental. You can incorporate alternative data into a financial services context in different ways.
If you're Two Sigma, you don't have a lot of people there. You're basically going to put a lot of stuff together. Weight the importance of each data set using parameters. You might have a tool like SigOpt, which is an optimization tool, so that at the end of the day, it's going to re-weight things so that you make the decision tomorrow in your trade.
On the other side, you're going to combine alternative data with your traditional financial data. You're going to have some of these types of parameters that are moving around. But most of the time, the final decision maker is a person. And then that person is going to make a trade.
These two models are very different. It’s not just like we have an alternative data strategy. It's, where are we are today on this spectrum? And how do we orient our processes so that these things fit together nicely? Which leads to how you structure your teams?
In the last panel, it seemed like there was a little bit of variation, actually. These are slides from Dan, who led data science at LinkedIn for a while, who proposed that there's three ways to think about where to put your data scientists.
They can stand alone as an independent group. They can be embedded, which means you have sort of a central data science team and they go off and they do projects as the business need comes up. Or they can report to the line of business owners. So they can actually sit out with the various business teams, be they investment teams, be they actuarial teams, whatever that may be.
I'm not going to go through. You can read the slides for the pluses and minuses. Each of these has pluses and minuses, as you can imagine in any company where you've got either a standalone unit who has its own mandates, its own political stuff. Or you've got people that are supporting business lines. You get different outputs depending on what you do.
I've mostly found, especially in a firm that's starting as fundamental and then moving towards alternative data, I think it's much better to have this integrated type team, where you do have some liaison into the business line.
But as they said in the former panels, it's great to have that be your first—you've got your subject matter experts to help build a prototype. But then you want to build out some sort of centralized ability to scale and build products that you're not just spinning around in one off consulting projects.
A big confusion that I see when companies are starting to think about data is they want to go from like zero to 100 overnight. When they pose problems—say it's working with the sell side and they want to do personalized sales recommendations—they view it in their minds as a technical problem, where we go to full automation overnight.
I'll come in and say, all right, that's now like a three year massive endeavor this will require a ton of infrastructure investment, a ton of back end work, et cetera. But if we put a human in the loop, we can take what would be an impossible problem and turn it into like a three month fun prototyping project.
This picture here comes from Stitch Fix, which is a personal shopping company based out in San Francisco. They have this product where you can come in—it's not like Amazon where you shop. You basically just sign up.
I, as a woman, would put in my size, my clothing tastes, whatever it may be. And they put it through this pipeline where the first round of recommendations is all algorithmic. So they're going to take the initial data and output a series of recommendations on what might be interesting for me.
Then they pushed out to a bunch of 1099 Uber-like workers, that are their personal stylists who take the algorithmic output and input their human judgment to then pass that down to the inventory pipeline. And then eventually gets to the consumer.
Basically, they've taken the Amazon model and put a middleman in to do the first choosing. But that choosing is updating the performance of their recommenders so that they can actually make their business model work.
I just find that for a lot of the current data infrastructure that companies have—in particular for the supervised learning models—it's helpful to keep a human in the loop. This is a big one.
Back to the—you're going to get hit by a ton of companies trying to sell you software. I think most people assume that you can get more headway if you buy a tool than if you build something.
When one works in data science, it's a lot harder to solve the general version of your problem than the particular version of your problem. Which means there really is an opportunity to get a competitive edge if you astutely and selectively pick the projects where you want to invest the time to build it internally.
You can solve on your data, get some headway with your capabilities as opposed to waiting for six months to a year when a vendor would have to solve the general problem in order to actually have market share. So it's a little counterintuitive.
This, I think, a lot of the other people talked about. I can tell you—I lead all of this as development for my company. They talked on the last panel about having to the internal sales work in order to convince an organization that experimentation is OK.
It might not all work. That's like 300 times harder when you're a vendor. So I'm coming in and I'm trying to have people pay us money to do things that might not work and it's very hard. That said, I agree with what was said in the last panel regarding process, where you start out with a business problem.
We spend a lot of time working with people who are actually out in the field to figure out if there will be real business impact if we build a product. Use that to define a set of metrics to evaluate what we're doing.
The second phase is always experimentation. So we try a couple of different types of models, different types of features that we might be weighting in our models to see if what we think will work. And then we pass into the hardening and engineer stage to actually turn this into a product.
There's some overlap, right? Design can happen in the front where you're trying to figure out which features actually matter. You can do design and data science in parallel. You just might not have the right particular outputs in a feature at the time. Then there is always an engineering data requirement on the back end.
Final point regards—this is a little bit on the compliance side. So this is something that we've learned at our company over the past year, where—we're working on a technique right now called topic interpretability.
If I've got a black box model, I've got my input, it goes through magic. I get my output. Often, when one is working with some of these more complex models like deep learning, we really don't know why because the functions underlying them aren't linear. They're non-linear.
We don't know what input is leading to what output. Which, apparently on the surface, seems to be a big deal for regulators because if you're working in say, granting loans, and you're unable to account for why a particular person—of a given race or at an income level is not receiving a product, that can be problematic.
But what we've seen in finance is that often from the information barrier perspective, this can be more of an asset than a liability. So there are certain types of, let's say we're on research side—you can have, potentially, insights from some of the more research oriented side of the bank and impact what the sales team can do if there is this black box between them that actually prohibits people knowing why a certain output has been created.
I think there's creative ways to work with the fact that these models aren't always understandable. But it's something that always needs to be considered if you're going to shift—if you're getting used to working with regressions and you're trying to experiment with something new. Make sure that you can do it before you try it. And that's it. Any questions?
Q: On the build versus buy, you spoke about methodology and building products. How do you guide your clients through the data? You can buy data licenses and just consume APIs or you can build proprietary data sets. How do you guys reply just through making that decision?
Kathryn Hume: Depends on the size of the client. So if we're working with like these banks or insurance companies—we have one that has data that dates back to the 1890s. It's just always been in paper form so they're now trying to sort of turn it and digitize it and make it meaningful.
I think almost always, first party data is not enough. So there's value to then having some sort of new third party interface. I would start with the question you're trying to solve as opposed to just arbitrarily buying data.
We find it's most useful where it's, what do we have today. What do we want to do. What can we sort of imagine based upon where we are. If we imagine that and we know that we don't have the data to actually inform whatever project we'd like to undertake, how can we then explore what's out there to fill that in. And then go from there. So that's for the large company.
When it's smaller—When we do all of the research that we're doing, we don't have any first party proprietary data. There's reasons why it's always real estate, right? There's a lot of data publicly available. We have data sets that are out there.
We're working on churn right now and we're fortunate enough to have had one of our customers be interested in having in interpretable churn model. So we're working from them. But yes, I think it depends on your size of the state.
Q: Pretty distinct presentation. But two things. First is you said that the value of the data analysis decreases very rapidly. That's very important that you pick up on the original idea quickly and harness it. One of ideas that you mentioned was the concept of distance. Is distance always existed from pneumatic data but only recently has been extended. That is a huge potential to use that to extract what people are thinking. What is their reaction to a vertical asset?
Kathryn Hume: So there's a whole gold mine that which can be explored now in using that particular technology. The only thing I would say is aware of—I think one of the initial impulses in developing an alternative data strategy would be to go for social media data.
It's not always as valuable as you would think. I've seen a lot of people that are disappointed when they license APIs from companies that mine Twitter data, as an example. And I think one of the things—I, mean we talked about sentiment earlier on—is if there's no ground truth—so if it's very hard for two people to agree that it should be positive or negative—it often isn't a good data problem.
While sentiment seemed like a great avenue because it can be binary—positive or negative—and be like binaries, it's not quite clear that emotions map cleanly into those categories. It can lead to some sort of not as useful as we'd like results.
The other are where it's very important is improving customer satisfaction through conversations during the customer call. There's a whole lot of data. And analyzing that quickly and addressing the real issues has a real issue.
I didn't talk too much about retail banks. This is sort of a little focused on imaginary investments. But the chatbot storm—the chatbots are taking retail insurance companies and banks by storm. I think it's safe to say that in the next five years, the customer service functions will be drastically changed.
Again, the risk is a lot of people that are championing these products don't actually understand the technical difficulty of building a good bot. It goes back to vendors. There's a lot of vendors that use a lot of big rhetoric and their products suck. There's also people want to do the fancy work. They're attracted to like the shiny conversational problem whereas the back end data problem actually leads to the chatbot having meaningful things to say as opposed to just saying junk.
I think the most interesting chatbot problems are actually just a step up of like a tree—like when you call the bank and you got options on your phone. Where you got a lot of recorded historical questions, responses, right? You can use that to just train a not so smart but decently cogent customer service agent. It's a huge trend and I actually think it's really meaningful.