Collaboration Between Data Science and Data Engineering: True or False?
Domino2018-11-19 | 32 min read
This blog post includes candid insights about addressing tension points that arise when people collaborate on developing and deploying models. Domino’s Head of Content sat down with Don Miner and Marshall Presser to discuss the state of collaboration between data science and data engineering. The blog post provides distilled insights, audio clips, excerpted quotes as well as the full audio and written transcript. Additional content on this topic will be forthcoming from additional industry experts.
Introduction
Over the past five years, we have heard many stories from data science teams about their successes and challenges when building, deploying, and monitoring models. Unfortunately, we have also heard that many companies have internalized the model myth, or the misconception that data science should be treated like software development or data assets. This misconception is completely understandable. Data science involves code and data. Yet, people leverage data science to discover answers to previously unsolvable questions. As a result, data science work is more experimental, iterative, and exploratory than software development. Data science work involves computationally intensive algorithms that benefit from scalable compute and sometimes requires specialized hardware like GPUs. Data science work also requires data, a lot more data than typical software products require. All of these needs (and more) highlight how data science work differs from software development. These needs also highlight the vital importance of collaboration between data science and engineering, particularly for innovative model-driven companies seeking to maintain or grow their competitive advantage.
Yet, collaboration between data science and engineering is a known challenge.
As “Seek Truth, Speak Truth” is just one of Domino’s core values, Domino’s Head of Content sat down with Don Miner and Marshall Presser to have a respectful and candid conversation about differing priorities, known collaboration challenges, and potential ways to address these challenges. Both Miner and Presser have extensive practical experience within data science and engineering. Miner, a founding partner of a data science and AI firm, represents the data science perspective. Presser, whom is on the data engineering team at Pivotal, represents the data engineering perspective. This blog post covers distilled highlights, key excerpted quotes, audio clips, as well as a full playback and transcript from the conversation. There will also be additional forthcoming content on this topic from additional industry experts. The purpose of this blog post and future content is to contribute to the public dialogue around the collaboration challenge that has been lacking of in-depth analytical discourse from multiple perspectives.
Data Science vs Data Engineering: How did we even get here?
The candid discussion kicked off with an examination of the current state and how did we, within data science, arrive at this current state? There is a seemingly myriad of terms to describe people who interact with models. Just a few terms that are currently in usage include researchers, data scientists, machine learning researchers, machine learning engineers, data engineers, infrastructure engineers, DataOps, DevOps, etc. Both Miner and Presser commented upon and agreed that before any assignment of any term, the work itself existed previously. Presser defines data engineering as embodying the skills to obtain data, build data stores, manage data flows including ETL, and provide the data to data scientists for analysis. Presser also indicated that data engineers at large enterprise organizations also have to be well versed in “cajoling” data from departments that may not, at first glance, provide it. Miner agreed and indicated that there is more thought leadership around the definition of data science versus data engineering which contributes to the ambiguity within the market.
Marshall Presser: “We started with data engineers before we had data scientists, I think. And data engineers did such things as build data warehouses from which people did kind of rudimentary business intelligence, and slice and dice, and kind of analysis of past state, and what the business, whatever that business is, looked like yesterday with some kind of minimum analysis of a future state. And then people came to the bright conclusion that we could actually do more with data than just report on the past. And so from my perspective, data scientists kind of entered into modern analytic thinking, if you like, 10, 15 years ago I don't think I even heard the term data scientist. I can't remember when I first heard it, but it was a while back. I'm not even sure I heard the word data engineer back then. We evolved these two different specialties and, Don, I'm going to pass to you in a second, but my sense of what the difference is, is that data engineers have the job of acquiring data from various sources. Massaging it, getting it into a place where then data scientists can do interesting machine learning with it. So, that's what I think the current state looks like.
Don Miner: “I agree with pretty much everything that Marshall said…just because we coin these terms, data engineer, data scientist onto these things doesn't mean it didn't exist before. [These terms reached] a critical mass at a certain point where people were, "You know, we should probably call that something." There's enough data scientists running around , "Oh, you know what? That should have a name." Or there's enough data engineers running around now that, that should have a name…. people spend a lot more time defining what data science means and not so much defining what data engineering means. All the way from university curriculums… I've never heard of anybody having a data engineering undergrad class, but you're starting to hear data science classes pop up. … I have some ideas about why that is, but I think where we're at right now is data science is a pretty fairly well defined career path and profession. People generally know what that means.…there's a lot of impact from hype still that's starting to wear down a little bit. But on the data engineering side….has really been left alone from the typical opinionated people that would be helping define these things, talking about it at conferences. That has still left a lot of ambiguity in the market there. So, I think that's where we're at right now.”
How do these differences translate in real life? e.g., recruiting Data Scientists and Data Engineers?
As both Miner and Presser perspectives are grounded in practical experience, the discussion turned to how the differences between data scientists and data engineers translated into which measures and skills are prioritized in hiring and recruiting. Miner relayed that when he recruits data scientists, he looks for technical ability (i.e., machine learning) as well as potential domain expertise. Miner also countered that when he recruits data engineers, he often looks for software engineers that happen to have database experience, various technical versatility signals (i.e., working with Kaftka), as well as a “certain type of attitude”
“This is different for different organizations but… data engineers need to be really versatile, they need to have the ability to work in lots of different kinds of roles. They need to be able to write software, they need to be able to work with databases, they need to be able to do DBA things, they need to care about security, they need to care about networking. It's a very interdisciplinary role, and so really my number one facet when I'm looking for a good data engineer is flexibility and versatility in their technical skills. And also, like you kind of mentioned, you need to have a certain type of attitude in order to succeed in working in the bowels of a data organization. They need to be very resilient in dealing with frustrating issues. Meanwhile, a data scientist, typically I'm looking for technical skills like machine learning experience. A specific skillset that I'm looking for, maybe certain domains that they've worked in, in the past. So, actually I would say right now for data scientists usually I'm looking for specific technical abilities. With a data engineer it's more about attitude and versatility than it is about their specific technical skills.”
Both Presser and Miner agreed that the function of data engineering is important, particularly the navigation skills to obtain data. Miner, in particular, noted
“in our consulting engagements, and also two other data science consulting companies that I know and work with, if we have a pure play data science project, meaning that the data engineering's not in scope, the customer said that they were going to take care of it, we won’t start work until we have proof that the data's been loaded. We've been burned so many times by them saying like, "Oh, you know what? You guys can start on Monday. We'll get the data loaded sometimes next week." We're not even going to start until that data's there….that's the other issue too with the data engineer. I actually ran into this issue….on the younger side of the data engineers, one of the issues that we run into is that they don't have the seniority to stand up to some ancient Oracle DBA that's not willing to play nice. …it's a really hard role to fill because, you're right,… the interpersonal skills, and the political navigation skills are really important for the data engineer.”
Current state of collaboration: candid insights
After exploring the differences in skills, technical abilities, and work flow priorities, the conversation moved toward very candid insights about collaboration between data science and data engineering. The challenges that arose during the conversations include challenges with communication in general, a lack of two-way respect, potential lack of good project management, and expecting data science workflow to be like software development workflow. When asked “what is the current state of collaboration? given that aspects are emerging and may differ depending on the organization", Miner indicated
“I have two answers to this. One is that I don't think that data scientists and data engineers at most organizations that I'm working with have figured out how to communicate with anybody. So, not even with each other, but how does a data scientist and a data engineer fit into, a modern one, that's building some new systems, how are they interacting with different lines of business? How are they interacting with marketing, sales? How are they interacting with product design? ….even this at a fundamental level, there's major problems in the industry. And how they're interacting with each other? … it's hard to say because I can't really say that, at least in the past couple of years that I've had very many interactions with like, that guy's a data scientist, that guy's a data engineer. Our roles are clearly defined and they're communicating. So I guess I'm going to give a non answer and say that, I don't know, it's too early to tell…..[from] my perspective, I can say some things about different people playing different roles in different scenarios and how they're communicating.
But overall, I don't think the roles are very clearly defined yet to be able to really say how they're communicating…
In a couple of places where I have seen it be pretty functional, and you have had a functional data engineer that had responsibility for the data, and you have had the data scientist…In a lot of cases what's not being seen enough is respect in both directions…the data engineer is like, "This data scientist doesn't know what he's doing. He doesn't know how to work with data. The data scientist doesn't know how hard this data engineering stuff is." And on the same side, the data scientist is frustrated that the data engineer is not getting things done fast enough. Not getting it done in the format that they want it in…. the best data people I've worked with in both directions have had empathy for the other person's situation. The data engineer has intuition about what the data scientist is looking for, and what they need. And the data scientist has intuition about what's hard for the data engineer, and what's unreasonable for that person to do…. that's the best scenario that I've seen. The worst scenario, which is the one that I see typically, is the data engineers are just processing data and not being worried about things like duplicates, or like things encoded in the wrong way, and cables being laid out in ways that aren't appropriate for data science. And then the data scientists see this stuff and they're just like, "This is garbage. What are you doing? I'm just going to do it myself now." And they're going run into a whole bunch of problems because they don’t know how to access the data and stuff. I think really what it comes down to is understanding each other's situations and understanding that they are both hard, and working through that.“
Presser also provided insight about having people aligned at the beginning of the project is important way to build empathy and address collaboration tensions and that Miner’s perspective
“is not the least bit uncommon, [it] is a symptom of really bad project management. It seems to me that the way to solve this problem is to have everybody in the room when the project is being designed ... It's sort of like life insurance. You know, you don't really need it until you need it, but you've got to keep having it, even when you don't need it. The projects that I've seen that have been most successful are the projects in which the data scientists, the data engineers, and… the application developers are all there in the room from the beginning, with the customer talking about what the problem is they want to solve, what a minimal product is, what the final solution should be, what the users expect out of this. And if you start from that place you're much more likely to get empathy. …That's the first thing.
The second thing is that, I find such difficulties that Don described don't exist, at least in many of the projects I've been working on, between the data scientists and the data engineers as much as between the data scientists and the data engineers and the applications developers. Because the application developers have, I don't want to say contempt for data, that's way too strong, but what I would say is they don't have as much experience and love of data that Don and I do.
To them, a database is a database, data is data, oil is oil. You know, it's all the same. They're not interested in thinking about, in general, the kinds of data collection and issues that they're going to need to solve the problem. They're sort of, "Let me come out with a minimal, viable application really quickly." And, by the way, I've actually heard a project manager say, "You know, any line of code that my developers write to audit what they're doing, to put stuff in a database, is a line of code that they're not putting in developing the application." And so they frequently encourage a huge technical debt as they've got this great application now, but when it comes time for phase two of the project, to do something interesting with the data that this application should have stored somewhere but didn't, we're kind of left holding the bag because the application developers were kind of short sighted. And to my mind this is the kind of short term thinking that hinders really good data science.”
Another potential point of tension includes organizations treating data science similar to software development, Miner noted
“something that we advise our clients on all the time, and is a major portion that I think takes people by surprise sometimes, is that most organizations is that their default is to treat their data science projects like software engineering projects that they're currently running at the organization. So if they want their data scientists to be filling out Jira tickets and have Sprints. Not only the data scientists, but data engineering is not a similar task like that either. And the platform architecture too, is similar. They all share something in common. in data science, data engineering, and platform architecture, it's one of those things where you can spend forever on something and it won't be done. So, it's all about, "When do I feel like stopping?" Or, "When do I run out of money?" Rather than, "Okay, this application is done. I'll ship it, it's in a box. It's all good to go. We release it to the world and we sell it. It's great." On the data science side it's hard to tell how long something's going to take until you do it. So there's this chicken and egg problem. I can't write the Jira ticket it's going to take two weeks, until I actually spend the two weeks to do it, and realize it's actually going to take four weeks. And so when you try to apply these traditional software engineering project management things on these projects it doesn't work. It actually causes harm in a lot of cases….there's actually a new discipline that needs to arise.”
Addressing the collaboration challenges
Collaboration between data science and data engineering is a hard problem to solve for. While there was consensus that the difficulty of the problem has contributed to a lack of extensive public discourse, Miner and Presser dove into aspects that have the potential to ease the tension points around collaboration. Prior in the conversation, aspects to support collaboration that naturally arose included early stakeholder alignment, as well as mutual respect and intuition regarding various responsibilities. Also, when asked directly to problem solve for potential ways to address the collaboration tension points that provide barriers to developing and deploying models, additional suggestions about corporate culture, collaboration tools, and a “data liaison” arose.
Presser noted that corporate culture contributes to collaboration, specifically
“I think it's, in many ways, a corporate culture kind of thing. There are organizations that work well together, and they are others that don’t, and I work a lot in the federal government space where this project consists of people from various organizations that are not part of the federal government. Outsourced project management, outsourced database management, outsourced this, outsourced that. And there's a little fighting over fiefdom here, and a customer either can't do anything about it for contractual reasons, or chooses not to do anything about it. But that's the opposite of what Don was talking about in terms of empathy and respect, and its driven by, in many ways, where the revenue dollars are coming from. So, I find that some organizations I like working with, some organizations I don't like working with because the corporate culture is not one of sharing, of empathy and respect. So choose your partners well.”
Miner also agreed with corporate culture as contributing:
“I think the best organizations that I've worked for have been the ones that fostered open communication, no competition within. Not very many egos…you can get away with it in a lot of other things, and data projects are not one of them. That's the issue, … an organization that has maybe found that [many egos] successful for other types of work that they've done, in this case it's not very successful…my answer to the question of what I've seen work well, I think one of the big ones to me is, everybody having a good energy, and knowing what the goals are. And I think that also ties into corporate culture as well. A corporate culture that has very clear goals, or a leader that has very clear goals, that's being very transparent about what those goals are, allows everybody to align themselves, their little micro interactions throughout the day, to be part of those goals. Also, goals in data science are often weird. Sometimes they're not straightforward.”
While Presser is an advocate prioritizing in-person collaboration to accelerate work and address collaboration, Miner advocates for having a “data liaison” person as well collaboration tools due to the nature of data science work:
“The other thing that I wanted to add onto what Marshall said about fundamental communication, because I do agree that too often not all the stakeholders and not all the people are going to be the different identifies are going to be involved early on in the discussions. This is actually where a lead data science liaison type role fits in a company where you don't necessarily need your data scientist, like at a large organization, being involved in every decision, but having a data science leader, that's a chief data officer, or chief data scientist, or whatever the title is, I don't think it's really nailed down, is involved in these scoping meetings. We've seen that be successful. Maybe another thing too on the communication standpoint, I'm actually going to provide a vote for real time remote collaboration tools in working…..I agree that in the beginning of the project it's really good to get everybody in the room due to the amount of communication that needs to happen. But also, too, email feels almost too slow for these projects. Data scientists are kind of trickling in on insights, and data engineers too, are running into different problems in an ad hoc way as they're actively working. So we use Slack a lot, I think a lot of people do right now and it's been pretty successful, because you don't have to bunch up a bunch of stuff to put into an email like, "Here's my list of problems today." Maybe you may have two data scientists talking about an issue and the data engineer is eavesdropping and saying, "Oh hey, by the way, this is how I designed it," or like, "Oh hey, yeah I can fix that for you real quick. Not going to take me much time at all." So this more real time communication is good, and I think also too, it's almost better than in a physical office in some cases too. Even if you're sitting at a desk, three desks away from the data engineer, you still have to get up and go bother that person. Here, I think I'm actually making the argument that I think Slack and other things like it, may actually be one of the best tools for this thing right now, as the project's going on.”
When queried to unpack the idea of a “data liaison” more and provide additional clarity and whether this person could be a “project manager”, Miner indicated
“…in a consulting construct, that both myself and Niels [co-founder] provides in some of our larger projects. And it's a really necessary role and some of the other customers we work with, we've made this recommendation for them to do this, it's actually two reasons. One is that, data science requires a lot of focus. When you're working on data science problem and you're fumbling with some machine learning thing, you're messing with the data, an interruption can break down a house of cards in your head that you've been building for multiple hours and if you're responsible for going around to random meetings to discuss use cases and things, you're never going to get anything done…what you need to do, is you need to kind of pick somebody. I mean honestly, these are some personality types that are better than others, but really it needs to be somebody that could do it if they had to, that understands the real problems, that can represent the data scientists that are actually going to do the work in these meetings. But due to the focus requirement you kind of need to pick somebody to be the sacrificial person to do it, that's okay going around and talking from experience so that the others can focus. It's a really important role… in a large organization with a large team.”
Reflections on the potential future state
After problem solving for potential aspects that may help ease collaboration tensions, the discussion moved to what the potential future state of collaboration could look like. Potential future state scenarios discussed include increasing specialization of roles as well as the need for a discipline or process to help manage collaboration.
Marshall Presser: “ …from a future state perspective, I think specialization of roles is only going to increase. We're going to get people are purely data scientists, people who are purely application developers, people who are purely data engineers, people who are purely platform architects, people who are purely liaison, people who are purely project management that may tie in to the liaison role, and keeping these people coordinated and so that they can, one, speak a common language and they have sympathy and respect for one another, I think that's a challenge going forward. But once we solve that problem, it'll be great.“
Don Miner: “I think on Marshall's point, … the biggest problem here about this lack of process around management, around data engineering, the communication between data engineering and data science, this lack of management, if you want to specialize, you want to have a data liaison...do you want to have a data engineer specialist, because the earliest data science project, like the smallest one, data scientist is doing the data engineering work too. And probably the platform architecture work too, and the application development.
Once you start specializing, which is why we have data engineers and data scientists now, these two people need to have a process to communicate.
When you have an application developer, now they need a process to communicate and work together.
You have the platform architecture, you got management, you got the advisory liaison person, you got the rest of the business, all is about process and, honestly, I don't think anybody really knows what they're doing. I think the number one thing that's holding us back in this industry, is building large data science teams and organization. The most successful data science teams I see right now are like three people… it could be a massive organization, but those three people are getting a lot of work done, and if they wanted to scale up to 20 people, 40 people, it's not going to work. I actually have a specific anomaly that I saw the other day, where I'm hiring a new data scientist in Denver. Particularly wanted a senior data scientist in Denver, so I posted a job opening on LinkedIn for a Denver data scientist. I got something like 30 applications in a few days. 11 were from one company…. I ask some of my colleagues that are in Denver, saying "What's wrong with company X? I just got 11 applications from data scientists from this company." First of all I didn't even know they had a lot of data scientists, and they said ... because [they] are data scientists, and they [said] "Yeah, they're job openings are all over the place. They hired a crazy number of ... hundreds of data scientists over the past two years.” … now obviously they're hemorrhaging, because they probably didn't actually think about how to communicate. I think that's where I would like to see the world go, is if we had better processes, just like we got through on the software engineering side, continuous integration and testing, good UX principles and things like that. We can build really scalable software teams now.
Data science isn't there yet…... the topic of the data engineer and data science thing though, is the tip of that spear.”
Managing data science: hard, but not impossible
Don Miner: “There's not really very many practitioners out there saying, "How do I manage a data science project well?”…. Somebody's going to have to talk about it at some point.”
Ann Spencer: “Why do you think that is? Why do you think that people aren't talking it, or aren't addressing it?”
Marshall Presser: “Well, for one, it's hard.”
One of Domino’s core values includes “Seek Truth, Speak Truth”. We leverage this core value in our content to support people tackling hard, and perhaps, previously unsolvable problems within data science. This blog post covered distilled insights, audio clips, and excerpted quotes from a candid discussion about tension points that arise when people collaborate around the development and deployment of models. If interested in more in-depth insights, then consider listening to the over 45 minute audio recording or reading through full transcript. Both are provided below. We also realize that there are additional situations, nuances, and textures regarding collaboration that were not covered in this blog post and are working with additional industry experts to amplify different perspectives. We will continue to provide additional forthcoming content that covers collaboration between data science and engineering. If you are interested in contributing to this public discourse, contact us at writeforus(at)dominodatalab(dot)com.
Full audio recording
If interested, this section provides over 45 minutes of the discussion.
Domino Data Lab empowers the largest AI-driven enterprises to build and operate AI at scale. Domino’s Enterprise AI Platform provides an integrated experience encompassing model development, MLOps, collaboration, and governance. With Domino, global enterprises can develop better medicines, grow more productive crops, develop more competitive products, and more. Founded in 2013, Domino is backed by Sequoia Capital, Coatue Management, NVIDIA, Snowflake, and other leading investors.
Summary
- Introduction
- Data Science vs Data Engineering: How did we even get here?
- How do these differences translate in real life? e.g., recruiting Data Scientists and Data Engineers?
- Current state of collaboration: candid insights
- Addressing the collaboration challenges
- Reflections on the potential future state
- Managing data science: hard, but not impossible
- Full audio recording