Summertime Analytics: Predicting E. Coli and West Nile Virus
Domino2017-12-14 | 31 min read
Gene Leynes (Senior Data Scientist) and Nick Lucius (Advanced Analytics) from the City of Chicago discussed two predictive analytics projects that forecasted potential risk involved with E. coli in Lake Michigan and West Nile Virus from mosquitos.
At a recent Data Science PopUp, Gene Leyes and Nick Lucius, from the City of Chicago Advanced Analytics Team, provided insight into the data collection, analysis, and models involved with two predictive analytics projects.
Session highlights include
- Use of a generalized linear mixed-effects model and incorporating season and regional bias. Predicted likelihood of West Nile Virus one week in advance 78% of the time. The prediction was correct 65% of the time.
- Team discovered how a linear model was not going to work for predicting E. Coli in Lake Michigan because there was not a clear correlation. Team implemented rapid testing of volatile beaches, strategic selection of predictor beaches, use of “k-means clustering algorithm in order to cluster the beaches into different groups and then use that information to decide our predictors for the model.”
Interested in the West Nile project? Review the West Nile project on Github and download the West Nile data. How about the E. Coli project? Then review the Clear Water project on Github and download the data on the data portal. For more insights from the presentation, review the video and the presentation transcript.
Hello. I'm Gene Leynes.
I'm a data scientist at the city of Chicago. We're going to get started with the next presentation. We've had a nice mix of technical and more high-level talks. Today I'm going to talk about a couple of projects that we've been working on at the city that were debuted this summer. At any given time, we have quite a few different initiatives going on in the city of Chicago. We're always doing different data science projects to optimize operations and try to make more from less. We're just like every other company. We deal with shrinking budgets, and we're always trying to find ways to do things smarter to extend our resources further. Within our data science team, we have DBAs, and we have business intelligence professionals. Nick and I are representative of the advanced analytics team, and we report directly to the Chief Data Officer Tom Schenk, who, if you do anything with data and you live in Chicago, you've probably seen Tom because he's omnipresent. The two projects we're going to talk about today are the West Nile virus predictions and the water quality predictions.
I'm going to start off with a slide of-- this is horse brain tissue from a horse that died from West Nile virus and… this is the enemy, this is the mosquito. This is how humans, horses, babies, old people…this is how we get West Nile.
This was a really interesting project. I learned a lot working on it. I'm not a big fan of pestilence, but at the same time it was a pretty enjoyable project. It was really surprising and interesting.
Who'd have a guess where West Nile is in the United States or where it started? Well, what were some states you'd think of? Florida, California, North Carolina, that's a good one. It is a little bit in these places. It started in Queens. It's actually a very urban problem. It's a very unusual sort of thing. It came to Queens in 1999, spread very quickly throughout the United States. And by 2001 or 2002, it was already in Illinois. And we are the fifth-most contaged-- I don't know the right word for that-- state in the Union. We're actually right there in the top as far as West Nile virus cases. So, that was very surprising to me.
And the thing-- there's some good news about that, though. Even though it's everywhere, it's not usually that bad. Most of the people who are infected with West Nile don't even know that they ever were infected. About 80% of people show absolutely no symptoms at all. And of the 20% of the people who do show symptoms, they have flu-like symptoms, and they rarely even go to the hospital or to the emergency room. Of those people, 1% have the severe symptoms that are neural-invasive diseases, and this is where it gets bad. This is where people become paralyzed and have chronic pain and die. But it's really a pretty small number.
This year, for example, we had, I think, final count-- and they're still coming in it is about a 60-day lag between incubation and the testing, 30 days for incubation, 30 days for testing. The numbers are actually still coming in, but there's only about, I think it was three or four cases throughout Illinois this whole year. But the weird thing, and the reason that it's actually a pretty important public health issue, is because outbreaks can happen anywhere, and they're very unpredictable.
For example, one year in Colorado, there were something like 2,500 cases. I don't know why. I don't think Colorado knows why. But it's very important to get ahead of these cases and to take action to reduce the West Nile spread and to reduce the mosquito populations.
The other important thing to know is that not every mosquito transmits West Nile virus; it's Culex restuans and Culex pipiens. These are not the nuisance mosquitoes that are normally biting you at backyard barbecues. A lot of people in Chicago-- it's funny… the south side of Chicago really wants us to spray for mosquitoes, and the north side really doesn't want us to spray because they want their beehives, and they're more naturalist focused. You've kind of got this dichotomy of like who wants to be sprayed and who doesn't. It doesn't actually matter because we're not spraying for the things that bother you, for the most part anyway.
This is kind of important to understand. This is the life cycle of the disease basically. The West Nile mostly transmitted between mosquitoes and birds. Birds migrate and pass it around the country. That's why it spread so quickly throughout the United States. The mosquitoes transmit it from bird to bird. The mosquitoes that infect the birds actually don't prefer humans. The human and horse cases are pretty much just spill over.
So enough of being a downer (I'm going to use the same language that I used from the Chi Hack Night. I'm sorry if you were there but that's what I'm going to do). I'd like to talk about what we do to prevent West Nile virus at the city of Chicago; we really do three things.
The first thing is we larvacide stormwater drains and it's an unbelievable number. I still kind of can't believe it. I'd have to see it to believe it. But we larvacide 150,000 storm drains around the city of Chicago. And they get interns or whoever to drop pellets. [LAUGHTER] So they drop these pellets into the catch basins in the storm drain, and this basically just prevents the mosquitoes from breeding in the storm water.
The second thing we do is we do DNA testing. We have these gravid traps that have this specially chemically formulated sugar water that attracts just the mosquito species that we want. There's a little fan, because mosquitoes are terrible fliers, and it just blows them up into the net. We catch the mosquitoes alive. We have-- I should know the number…. it's like 40 or so traps around the city. We harvest these, shake them out, and grind up the mosquitoes in batches of no more than 50, because over 50 the West Nile DNA would become too diluted to measure. We grind up the mosquitoes in batches of less than 50 and test for West Nile using DNA tests at the CDC lab on the west side.
Then the third thing that we do, if West Niles is present in a particular region where we're testing, which we have almost complete city coverage, if we see it two weeks in a row, we spray for it. The whole point of this project was to reduce that time from two weeks to one week to really nip it in the bud when we do have these problems and immediately knock down the mosquito population. So this is what the data looks like. This is not a fancy ESRI map… This is just my map. [LAUGHTER]
So the-- oh, and also, this would be a good time. I should point out we use a lot of open-source tools. A lot of our projects, including this project, are on GitHub. The data itself is on the open-data portal. The data that I used in the model is exactly the same data that you can use, at least as far as the test results go.
I actually did have some secret data that we couldn't make public. That's the precise trap location because we can't make that publicly available. These are approximate trap locations because we don't want people tampering with the traps. But you can get pretty close. We have the lat-longs of the approximate trap locations for all the traps published on the open-data portal. I guess I was wrong in my numbers. There's about 60. I wrote these slides, but I forgot because it's been awhile. We have about 60 of these traps located throughout the city collecting these mosquitoes. They're collected usually once a week, maybe twice a week. We publish the data in terms of the actual lab results on the data portal.
This is what the shape of the mosquito season looks like in the city of Chicago. The blue line represents the mean number of mosquitoes. Wait, let me just read this. Actually, I think it's the total. Anyway, no, this is the total number of mosquitoes captured per trap. Then the orange line is the average number that are infected with West Nile. You can see in May, we started doing the collecting. We don't put all the traps out yet because there's never any West Nile in May. June, it starts to pick up. We start to get our very first positive results. July, it's ramping up. August is the really heavy month. And by the end of the year or by October, it's gone. There's a little bit there, but it's really gone by the end of October.
The other thing that has been really reinforced by working on this project with me personally is that we can really see the effects of climate change. A lot of these vector-borne illnesses, and by vector, I mean mosquito vector or tick vectors, are really increasing throughout the United States in different climates. The seasons are getting longer because we're missing the really cold winters that kill off the vectors. This is a problem that's going to continue to be a problem or continue to get worse. But sorry about that. But that's certainly something that was reinforced with me.
So, the guy before me-- I'm sorry-- could probably explain the model better than me. We used a generalized mixed-effect model and it does use Bayesian optimization to correct for the bias of the shape of the season, as well as the bias on a trap-by-trap basis. Some traps are just more likely to have positive results. The other variables that we feed into the model include things like weather and whether or not the trap had positive results last week, as well as the cumulative number of results that the trap has had throughout the season. We try to get an idea of whether or not it's a bad season, what happened last week, what the weather is, and then incorporate the overall shape of the season.
We tried a lot of different models. We tried gradient boosted [models] / GBMs, we tried random forests, and we actually-- to back up again, this whole thing started with a Kaggle competition that used really sophisticated Bayesian models to calculate the entire season. It's funny. the results from the Kaggle competition weren't particularly useful for us because they won the competition, but it was tuned for each season. And you don't know the season until after the season happens. So, they weren't cheating, but it wasn't something that we could immediately just take off the shelf and use for predictions for next week. And by the way, in all these data science problems, the hardest thing for me is usually figuring out -- it's easy to model stuff -- it's hard to figure out how to project it out for your t-plus-one time step, and how to put things back in your model so that you're making a prediction for next week. This is the thing that's not in a textbook, and this is the thing where the rubber meets the road, where it's always the tricky part. The outcome from our model was a number between 0 and 1. Let me think. For the most part, the results were around 14. I think that was the average. We chose a cut off of 0.39. Anything over 0.39 we said “this is a positive”. With that cut off we were able to predict 78% of the true positives, and of our positives we were correct 65% of the time. I don't have the f-score handy. You can certainly find it in the GitHub page. It really is all there. Some of the more machine learning people might enjoy seeing some of those statistics and seeing an old-fashioned confusion matrix, but this is something that makes it a lot easier to communicate to the public and to management and to epidemiologists and to other people within the city of Chicago.
Once we have these predictions we put them into our situational awareness program. I'm going to give you an idea of what that looks like. We have this thing where you can find the data, and you select the data set that you want. This is a preloaded query and we basically said anything that's over 0.39, color it red. Anything that's under, color it green. And this, for one particular week in the middle of the summer, was what our map looked like. And these are some of the details of what actually happened in that trap. And underneath, I don't think I show this part, but there's a little thing down here where you can look at the raw data and download that if you want it. So unfortunately, this particular data set, because of the trap locations, is only available internally in Windy Grid, which is our situational awareness program. But we also have something called Open Grid, which has most of our other data sets. I think it does have the West Nile test results, but it doesn't have the predictions per se because the predictions also have the top secret locations.
I hope this gives you a sense of one of our projects. And I'm going to let Nick take over and tell you about another pretty cool project that we've been doing. Thanks.
I'm looking forward to telling you about another project here in the city of Chicago. It's involving similar conditions, E.coli in the lake water. So again, we're looking at a pathogen, something that can cause illness, and a way to use predictive analytics in order to help let people know about the problem and try to mitigate the diseases. I'm really sorry about this slide. I thought about not including it, given the time of the year. But it's really impossible to tell the story of E.coli and the Chicago beaches without this because this is why this project exists.
Chicago's beaches are a very large source of enjoyment for residents and visitors during the summers. People go to swim. People go to have picnics. People go to ride their bikes. It's such an amazing amenity that we have. So a little bit more about it-- I didn't realize this when I started the project, but over 20 million people each year visit Chicago beaches. I think that's just an astounding number, given that there's not even three million residents in the city of Chicago. I think this is also a timely number, given that news just came out, I think, yesterday or the day before that Chicago's hitting all-time tourism records. Over 55 million people visit the Chicago city on an annual basis. And so, yeah, that's amazing. 20 million of them hit the beaches.
Now, each year at the 27 Chicago beaches that we have, there's about 150 water quality exceedances. What that means is that the bacteria, the E. coli that's in the water, hit a level that research has shown that people can get sick if they go swimming when the water's at that level. To put that 150 number into context, there's about 2,000 beach days each year. When you take the number of beaches, you multiply that by the number of swimming days that are out there, there's around 2,000. So it's really a handful of times that this happens. But when it does happen, it's really important to get notifications to the public, accurate notifications, so that people can make a decision about whether they want to swim. The beaches don't close. Normally, when this happens, there's an advisory that's sent out. If a person has a weak immune system or is a child or is elderly, they can make that decision whether to go into the water.
Now, finally, the state of beach technology for actually checking the water quality, it's like this. There's these traditional culture tests, where they actually grow the bacteria in a Petri dish and come back after about 18 hours and check how many bacteria grew. These are slow. These are slow tests. And with the rate that E. coli count changes during a day, once you get those results, it no longer reflects what's going on in the water at that time.
Because of the 18-hour lag time, models that have been built with those test results have just never really accurately notified people. I know it can be disconcerting. But for the prior years, if you ever went to the beach and you saw an advisory and it said, “hey, there's a problem here,” it was telling you about yesterday. And there might not have been a problem at the beach that day.
Another brand-new way for testing for E.coli on beaches is rapid DNA tests. Now, these have just been researched and developed over the last few years. This year, this past summer was the first year in Chicago where they were used at each beach. And a lot of municipalities over the world are looking at what Chicago is doing with rapid testing to be able to potentially take it to their communities and use it for beach monitoring.
But the one main weakness of these rapid DNA testing methods is that they're quite expensive. It's mostly the machinery that it takes. These samples are picked up along the lake each morning, driven to UIC, where they're put into the machines to do the tests. And while that's happening over the summer, that's taking up all the capacity in the regional area to even do these kinds of tests. If somebody wanted to say, “hey, I want to do one of these tests as well at 11:00 AM,” on any day during the summer, well you got to wait. So, that's the kind of supply and demand that's out there right now.
All this motivates using predictive modeling to be able to get a cost-effective, accurate read-out on what's the water quality at any beach at any given time. Predictive models have the capacity to prevent illnesses and, bottom line, save governments millions of dollars while notifying the public of whether or not they should go into the water.
Now this project that I'm going to tell you about what we've done, it's really interesting how it came to be. It actually came from Chi Hack Night, where a group of people noticed that there was beach water quality online on the data portal and thought that maybe they can make a predictive model, and approached the city. The city of Chicago worked with these developers, worked with these data scientists at Chi Hack Night-- I've actually see some in the crowd here tonight-- and on a volunteer basis in order to develop the model and also worked with students from the DePaul University in order to develop data visualizations and do some model refining. I was actually one of the volunteers. I did not work for the city of Chicago at the time that this was going on at Chi Hack Night. So it was really cool. I got to be a volunteer at Chi Hack Night, work on my data science skills, and then afterwards end up working at the city of Chicago as a data scientist. It's been an awesome experience.
I'll tell you a little bit about the model now. The originally developed model used water sensors. That's what's on the top left. There is a water sensor in the water that was just reading out things like water cloudiness, wave height, temperature of the water on an ongoing basis and then sending it to this city of Chicago's a data portal. The team used weather sensor data. It also used the results of the E.coli tests from prior days, from the day before, from the week before in order to power the model.
Then there were a lot of one-off data sets that were interesting. Like when the locks opened, when the sewage water was put into Lake Michigan, which, in case you don't know, actually happens a few times a year, usually when there's huge rainstorms and the whole city of Chicago area is flooded. They'll open up those floodgates and let it out into the lake. In those instances, the beaches are closed immediately until that dissipates. But the thought was there might be some effects in the following days that might lead to a better model.
Then down on the bottom here, what you're looking at is the E.coli level for a single beach during a year. I think it shows you how rare of an event a bad water quality day might be. This particular beach in 2015 only had one single day and there's really not a lot of warning. It comes out of nowhere. It goes away right away. These can be very difficult to predict anomaly rare events and that's what the team noticed right away.
All the modeling that the volunteers put all their hours into, the conclusion at the end was that no matter how much environmental data we might be able to get our hands on, we can't seem to pin down what causes E.coli and so what in nature you can use to predict E.coli. Accuracy rates in the models just never got over a certain threshold. It was a frustrating experience. But what that work did and what all those individuals' contributions to the group did was get a discussion going around what other ways we might be able to look at the problem.
We ended up developing a new way to model beach water quality, which shows some great promise so far. The way that it works is, instead of using these environmental variables to try to figure out what's going on in the water versus what's going on in the air, the idea is this. Pay for a few of those expensive tests today and then predict what's going on at other beaches in the area today with those tests. It really becomes more of a missing value problem. You've got some beaches that you're testing. Then you've got some beaches that you're inferring. And there are regional effects in Lake Michigan where some beaches do tend to move with other beaches. And so you can predict one beach's E.coli level versus another's. But it still is not-- it's not something you can use a linear model for. There's no clear correlation.
What we've done is we've used a k-means clustering algorithm in order to cluster the beaches into different groups and then use that information to decide our predictors for the model. What we do is we say, OK, these are the beaches that we're going to pay for tests, and these are the ones we're going to test. Then these are the beaches that we're going to predict.
Now, for the finer details on that, I'll refer you to a paper that's forthcoming. There's a draft version on our GitHub page already. It really goes into the details of all the modeling. So, I won't go into that here.
What I do want to do is show a public website that we made that allows people to learn about the model, to learn about this project, but also to create their own model. Because one of the key structures in this model is the choice of beaches that you use, the ones that you choose to test. So, a person can go on here and pick-- they see a list of beaches. You can choose whatever beaches you want to say, OK, I'm going to test these ones out, build a model.
In the background in this Shiny app, there's an R script running that's going to build the model. Maybe I should have done a demo on this right before we got started. I think I just need to reload it. Let me choose a couple of beaches again here. The background is going to build and validate a model. You'll be able to see how your model did and put it up against the city of Chicago's model.
Let me just give it a few seconds here to get going. There's this dotted line showing the true-positive rate of the city of Chicago's model. Then that bar shows that the one that I just made really doesn't do-- it did a 15% true-positive rate versus the city's model here, which was at about a 38%. People can go and try to build their own, see if they can come up with a beach combination that might be useful.
You can also mess with the false-positive rate down here to see. I increased the false-positive rate quite a bit, which means that now the model is going to issue advisories when there's really no problem at the lake. But it gets a better true-positive rate because of that. When we went to evaluate the model, we decided to put it into production. Because, like Gene was saying, one of the hardest parts that we face is the operationalization of the model and getting it to actually work on real data in real time.
In 2017, this past summer, even though the city of Chicago was rapid testing every single beach, we created a model that selected a few beaches and then predicted other beaches so that we can see how it did. In doing this hybrid method, we saw an increase of about 60 different days that the public would have been notified with our process, whereas with the old model, the public would not have known about the problem. A lot of those do come from the fact that our process would actually do rapid tests. You get some wins that way. But the model does a better job itself, too.
The predictive model itself stacked up against the prior predictive model is doing about three times as accurate. I have a slide for that, three times the accuracy of the prior model. What we've done is, like I said, we've got a paper to publish the model. I've gone and talked with water quality experts, the people who actually are doing the science on a regular basis, to show them, to make sure that everything makes sense to them, and, hopefully, to get this into the hands of beach water quality monitors wherever they may need it. So that's the end of the Clear Water and the West Nile virus part of the presentation.
But we just wanted to tell you a little bit more about the city of Chicago and what we do. These projects actually fulfill-- they're part of fulfilling some core pledges that the city has called a tech plan. The mayor of the city issued a tech plan right at the beginning of when he took office. And it pledges to the city of Chicago, among other things, that we will work with civic technology innovators to develop creative solutions to city challenges. This project here did exactly that. Not only did we work together, but individual volunteers as a group ended up donating 1,000 hours of their time on this. It was such a great thing to see people on Tuesday nights. Now, everybody's got jobs. Everybody's got busy lives, sitting there together in a room here in this building just working on this, trying to figure out a way to make the city better for everybody. I've got proof, in case you didn't want to take my word. We looked at the history on GitHub and were able to put this together so that you can see when people were actually working. I don't know what's going on here on Saturday at 2:00 in the morning. But Tuesday nights, you can see a big uptick in the work that was being done. When you look at the whole history of this project, the Chi Hack Night volunteers really, really did it. Another pledge in the city's plan that both these project meets is to leverage data and new technology to make government more efficient, effective, and open.
And finally, I'll mention something else about Gene's project, which was really cool. He put Open Grid, the city's situational awareness mapping platform, up, ran a demo for you. But his project actually enhanced that platform and created a new development for it. Some of the maps that were built out for the West Nile project actually got into Open Grid for not only Chicago residents to use, but any other municipality that uses Open Grid will now have these capabilities built in. So that's it for us. If you have any questions, I'm sure Gene can come up, and we can answer anything you might have.
Domino powers model-driven businesses with its leading Enterprise AI platform that accelerates the development and deployment of data science work while increasing collaboration and governance. More than 20 percent of the Fortune 100 count on Domino to help scale data science, turning it into a competitive advantage. Founded in 2013, Domino is backed by Sequoia Capital and other leading investors.
Subscribe to the Domino Newsletter
Receive data science tips and tutorials from leading Data Science leaders, right to your inbox.