Data Science

Data Quality Analytics

Domino2018-01-03 | 17 min read

Return to blog home

Scott Murdoch, PhD, Director of Data Science at HealthJoy, presents how data scientists can use distribution and modeling techniques to understand the pitfalls in their data and avoid making decisions based on dirty data.

Session Summary

At a recent Data Science PopUp in Chicago, Scott Murdoch discussed how data scientists can understand their data, implement data quality analytics, and then move toward partnering with IT. Murdoch advocates that “IT and data science should actually be very well integrated… that's one thing I've learned over my tenure.”

Session highlights

  • definition of data quality analytics and the cost of dirty data.
  • identifying key fields (unique key, crucial fields, etc.) needed within the data to address the problem solving for.
  • coverage on OLS regression, K-means clustering, and neural network modeling techniques as advanced methods for tracking quality.
  • practical tips including keeping the coefficients (the seeds and the weights) for the future when creating a forecast.
  • integration with IT to create marginal error range for benchmark metrics (percent zero, frequency distributions, etc.), build scripts, and automate these procedures

Transcript of Presentation

All right, first of all, I'd like to thank the team for having me here today. I'm very excited to talk to you. I'm going to talk to you about data quality analytics, which I'm sure some of you will glaze your eyes over, because who wants to talk about data quality. But if anybody does, it's this group, right?

While it's maybe not the most fascinating topic to some people, it is a very important one. What is data quality analytics? This is the important one. Why do we need it? What's the cost of dirty data? We're going to see how you can understand your data, implement data quality analytics, and then integration with IT. Which is a very important piece of it. Dirty data can cost the company a lot of things. It can cost productivity. Why? I probably don't need to tell you this, but I'm up here, so I will.

If you get something wrong, you're doing an analysis, and all of a sudden you find something in your data-- a missing feel, a proportion is missing of the data, and you did all this analysis, you go show your boss. It sucks, right? They gotta go up the chain. You might have to change results.

If you're a larger company, which I've worked for two, and you roll it out, it can affect your brand perception if you use that data to drive decisions, which-- that's the reason you're doing it. And most importantly, it can cost revenue.

So, without further ado, what is data quality analytics? Basically it is using distribution and modeling techniques to understand the pitfalls in your data. The cost is fairly minimal compared to the savings. The cost is your time, which, in a big corporation, would be what they call an FTE [Full-Time Employee], or your physical time. If you're doing something on your own, it's-- pardon the language, I'm an economist-- but the opportunity cost of your time. Which means the best thing you can do with your time if you weren't doing that. The savings is possible saving yourself from embarrassment, reputation, and a percent of revenue.

So, why do we care again? When I found this number, it didn't shock me, but it is kind of weird to see it in person. So, dirty data-- making decisions, I should clarify that. Making decisions off dirty data, or data that's not properly coded, is about $3 trillion per year in the US.

I come from the health care industry. For us, that's about $340 billion a year. It's a lot of money. When you look at these three steps, your eyes may glaze over a little bit, because they seem straightforward. But I'm going to caution not to think about it lightly.

When you start looking at your data, I want you to think about what problem you're trying to solve. I know everybody is like, well, of course. That's why we're doing it, right?

But I want you to think about the fields that you're using for it. Then I want to think what type of data do you have. And more specifically, I want you to think which data you don't have. All right? And then, what do you really know about your data. What I mean by that is, what is available in your data? Where did data come from? What's the source? How many people have touched it before you? Did it come for IT? Did it come from an online source? For people in startups, that may be a fairly easy question. For somebody that comes from a big corporation, you'd be surprised how many people touch the data before you even touch it.

As we're going through the next couple of slides, I want you to think about the last project you did. I want you to think about any predispositions you had with the data. What assumptions did you make about the data? Did you check it? Did you validate those assumptions? And if you didn't, I'm not blaming you. Sometimes it's a time thing. But if you did, were your dispositions right?

So, just by a show of hands, how many people in this room, when they get a new data source, know what the unique key of their data is? Somewhat? I'm seeing some hands. It's an IT thing that a lot of data scientists don't always look into, at least from what I've seen. Unique key is what gives you unique row. I'll talk about health care data, but you can apply it to anything.

One of the things I'm very familiar with is claims data. If you've never seen claims data, you go to the hospital, right? If, for some reason, something unfortunate happens. You break a bone. I had that happen once. You have to go to the ER, right? They have all these different procedures called CPT codes that when you go into the ER, one is an X-ray. One is the doctor's time. One is the nurse's time. One is the drug they give you, right? So, you have all these different rows.

Well when I give claims data to somebody, they say, “OK, it's at the claim level”. And once again, what does that actually mean, right? It's the claims data, and people will just take it and, OK, it's at claims level. Generally, what it means is it's a member, it's the-- what makes a unique row is the member, the data service, the claim number, because they can have multiple claims for one visit, and the claim line.

Now, this is something I've seen both in financial data and in health data. The claim line is-- and for anybody in the IT side or the organization, you've probably seen this before. It's basically like an update sequence. You'll have 10 lines of the same coin. They might even have the same what we call allowed amount. It's basically the amount that somebody has spent on this procedure. It'll have the same amount except for nine of them, and then the 10th one might be different. What happened is every time the claim comes in from the provider, they update it. But they don't delete the old records, right? So, you have to pick the last one. If you don't know that unique key, if you just assume it's at the claim level, you do a sum by a column, you'll get about 10 times more than you should.

The next thing is to understand the critical fields. Going forward in a few examples I'll show, I'm going to try to model how much somebody pays for a given procedure. So that's what they call the allowed payment, or the allowed amount. It basically means how much is the insurer allowing the participant to pay for the procedure. The rest is covered by the individual. Then there's the provider, which is also known as the MPI. Then there's the covered amount, which is how much is covered by the employee. What they call the CPT code, which is the actual procedure.

That's what I talked about before. The X-ray, an individual doctor taking a look at you, the-- what do they call it-- Vicodin they give you for pain, that would be under a different one. Then the provider's specialty, which will be tied to the provider and the member's ZIP code. Some other fields they'll have, which is, for anybody in health care, is what they call PHI. Member care ID, the last name, first name. OK. So, whenever you're looking at data, and you're doing this systematically, you should think about a set of criteria at which you want to look at the data. I have a tendency to look at the top three here. I know a lot of people start with the last two.

By a show of hands, how many people create a distribution before they use the data? OK, quite a few. How many people see how much is missing? Even more people. Impressed. How many people see at zero? How many people see a proportion of value? So, the top 20 most frequent values? Fewer people, all right.

Say, around 300 gigs that come in a month. So, it's too much data to spot check, right? They assume the data is submitted correctly from the sources, but they have no way of verifying that. What they do is they do their-- what they call their schema checks. Does it have the right column name? Does it have the right encoding? Is it a string versus a varchar or something like that? But they don't actually go into the field and say, is this what we expect to see based on what we've done in the past. And that's what I'm trying to get across to you here today, all right?

Anyway, I was doing this analysis. I was doing it on that allowed amount. Imagine you have around a half a billion claims and I did this frequency. And one number-- somebody has charged $5.14 5% of the time of those half a billion records. Does that seem reasonable to you? Think about all the providers all over the entire United States. One provider has the same amount they've charged for 5% of 500 million claims. It's a big number, right? It's a little too much for coincidence. Well, it's right. But I wouldn't have known that if I didn't look into it.

What it was is people going in for a drug test at a very large-- I can't say, but a very large drug testing facility all through the United States. But it was right, and at first it looked wrong. Then I did find some things that were wrong with the data. The data wasn't-- we used to call it fat fingering. It wasn't a human error, but it was encoded wrong from one of the places that was submitted. So that's why we do this.

For things you're doing, modeling off of those last two-- the histogram, and the max and the min-- are very efficient, great ways to do it. If you're doing a very large data submission every month or every day, every week, and you want to build a process around it, so you have some basis for benchmarking, I would do all these and any other that you can think of. Because any of you can get is that much better.

OK, so here's some of the advanced techniques for data quality that I'm sure every single person in this room has done in some form or fashion for modeling. But what I'm going to talk about here is to use these techniques to check the quality of your data. All right? I'm going to go through two examples next, both the regression and clustering. And the neural networks assume more to regression.

How many people here have used neural networks? That's very good. I went through graduate school about eight years ago, and I could not get my advisors to let me use a neural network. I finally got them to let me use it. But nowadays, everybody uses it, it seems like. But it was such a hard sell back in the day. And that makes me sound old.

OK, so for regression, what I'm going to try to do is try to predict the allowed amount, which is how much somebody pays for a specific procedure, right? All the fields that I want to use are the ones up here. What kind of doctor is it? Where was the doctor? What type of procedure was it? How much was it covered by the individual? Then yes, that is tied to the allowed amount.

I want to build a model based off of this and you get a set of coefficients just like any other regression you get, right? You get a coefficient for every variable you have in there. Keep that in mind. I can do the same thing for k-means clustering. But the difference here is for-- I use the same fields here to keep it simple -- but let's say you want a cluster on the actual provider where you don't necessarily have a clear dependent you're trying to predict, but you want to group things. That's when you use this. Same thing whether you're using clustering, regression, neural networks, any more you can think of.

Most important thing is keep the coefficients, the seeds, the weights. Basically, what you're going to be doing is you're going to be trying to set those coefficients, and basically creating somewhat of a forecast for the future. I'll show that in the next step.

Now we get to the part about integrating with IT. By a show of hands-- because, once again, I came from a big company-- how many people in the data science group have regular meetings, let's say, of at least once a month with their information technology organization? OK, so I see-- keep them up. Out of the 100 people here, I see about 10. That's normal, but it shouldn't be. IT and data science should actually be very well integrated. That's one thing I've learned over my tenure.

You can work with IT to basically take those benchmarks you found in two slides ago. The percent zero, the frequency distributions. You can write a Perl script or Python script, whatever you want to do, to automate this procedure. Basically, what you do is-- just like anything else a lot of data scientists do-- you look at a distribution. You can create a random sample of 1,000 samples. Find what the mean and median, standard deviation, and you take a cut off wherever you're comfortable. So let's say plus or minus 2 standard deviations. If a particular-- let's say the allowed amount. If the average allowed amount is anywhere from $0 to-- let's see, how much can a claim go for? The highest claim I think I've ever seen is around $100,000 for one visit. So, let's say $0 to $100,000, all right? But those are the tail ends, right? Most of them are probably somewhere between $100 and $1,000, right? Maybe $1,500. So, take anywhere between those two ranges. And when your new data comes in, you run the mean, you run a standard deviation, and you compare those. And if it's without a threshold, you set off an alarm. And that sends it to IT. And all of a sudden, you have an automatic check of your data. It's one less step you have to do when you do your analysis. You can do the same thing for modeling, which is what I call step two. It's the more advanced thing.

Basically, what you do is you did that modeling. You save the coefficients. You run the new data as of like an out of sample. And then you see what your prediction rate is. Similarly, you can run a new model and see what the delta is between the coefficients. They're basically getting at the same thing. Basically, you want to see how well is the relationship between, let's say, the allowed amount I showed and all those different provider information for the new data and the old data. And that's a lot of what I got. I can give more examples. But are there any questions that people have?

All right, we got a few. And if you need me to explain anything again, I went a little fast because I thought I was actually later on time, so.

Domino Data Lab empowers the largest AI-driven enterprises to build and operate AI at scale. Domino’s Enterprise AI Platform unifies the flexibility AI teams want with the visibility and control the enterprise requires. Domino enables a repeatable and agile ML lifecycle for faster, responsible AI impact with lower costs. With Domino, global enterprises can develop better medicines, grow more productive crops, develop more competitive products, and more. Founded in 2013, Domino is backed by Sequoia Capital, Coatue Management, NVIDIA, Snowflake, and other leading investors.

Subscribe to the Domino Newsletter

Receive data science tips and tutorials from leading Data Science leaders, right to your inbox.


By submitting this form you agree to receive communications from Domino related to products and services in accordance with Domino's privacy policy and may opt-out at anytime.