How Your Data Science Team Can Improve Knowledge Management—And Why It Matters

Domino Data Lab2018-11-27 | 7 min read

Return to blog home

Data scientists often greet the topic of knowledge management with a sense of dread. Some see it as a time-sucking distraction from their “real” jobs; others don’t fully grasp what it means. Even many who see the concept’s value find the process painful.

But knowledge management capabilities will become a key source of competitive advantage for companies, according to Matthew Granade, chief market intelligence officer at Point72, and Mac Steele, director of product at Domino Data Lab. At the 2018 Rev summit, the pair laid out why knowledge management matters and how businesses should make it a priority.

What is Knowledge Management?

The goal of knowledge management, according to Granade, is to capture insight, which he defines as “better understanding.” Insight is thus relative—it’s about constantly improving upon previous ideas. From Einstein to Freud, insight is often seen as the purview of the “lone genius.” In reality, Granade argues, most insight comes from collaborating with others and expanding on existing ideas. “When I build data science organizations and team of quants, I’m trying to enable them to stand on the shoulders of everybody who’s come before them,” Granade says.

Creating that kind of “compounding machine” requires a way to capture knowledge, a framework for users to follow and mechanisms to improve through feedback. Increasingly, companies’ futures will be determined by how well they do this, Granade argues. With more algorithms and infrastructure widely available, the pool of data science talent growing and requirements to share data expanding, the ability to capture and augment unique insights will become a key differentiator.

Why is Knowledge Management so Hard?

Some knowledge management challenges plague every industry, Steele points out:

Organizing knowledge in advance is difficult. Classifications are often too rigid, since you don’t know what will matter in the future.

There are few incentives to participate. As one data scientist told Steele, “I get paid for what I build this year, not maintaining what I built last year.”

It’s a classic collective action problem. No one wants to be the first to spend time on documentation. When knowledge is being captured, it can be hard to know how to act on it.

Systems always lag behind reality. If knowledge management takes extra time and is done in a different system from the core work, its quality will suffer.

Other obstacles are unique to data science teams:

People use different tools. Knowledge management is tougher when some team members work in R and others in Python, and when some store code on GitHub and others in email. Training people to use the same systems is difficult because of high turnover.

The components of a single project are scattered. Artifacts and insights may be spread across a Docker store, a wiki, a PowerPoint presentation, etc.

If you have code, that doesn’t mean you can rerun it. A meta-analysis of 600 computational research papers found that only 20 percent of the code could be re-run; of that share, many second attempts yielded slightly different results.

How can you improve knowledge management?

Granade and Steele identify four steps that can help data science leaders improve knowledge management in their organizations, and share practical tests to help you gauge how well you’re doing:

1. Capture as much knowledge as possible in one place.

“The more things are in there, the more connections you have across them, and the value grows that way,” Steele says. “You don’t want people operating on the fringes.” A common platform that encompasses both the core work and knowledge management is key to ensure it gets done and minimizes the burden. If you can’t capture everything, start with the most valuable model or knowledge, and build a system around that.

Test: Ask five data scientists in your company, separately, “How many projects do you think this team is doing right now?” They’ll probably have different answers.

2. Choose a knowledge management system that allows for:

Discovery: Data scientists spend much of their time searching for information, cutting into productivity. Teams have to decide whether to curate knowledge (the Yahoo approach) or index it (the Google approach). “Curation makes sense when the domain is relatively stable,” Steele says. Indexing and searching is best when the “domain is fluid, and I can’t possibly know beforehand what the taxonomy should look like.”

Test: Ask a new hire to work on a topic, and time how long it takes them to collect the right artifacts. If it’s weeks or months, that’s a red flag.

Provenance: Let people focus on the aspects of knowledge management that matter. Use a platform that allows people to synthesize their work, not have to track which software version they used.

Test: Write down beforehand what percentage of time you think your team members should spend on documentation. Then ask a few how long they actually spend. This could be eye-opening.

Reuse: “If it won’t run, it won’t get reused,” Steele says. That requires access to not only code, but also historical versions of datasets.

Test: Ask a new hire to reproduce the work that another data scientist did six months ago, preferably one who has left the team or organization. Ask him or her to update it with the most recent data. If it takes a week or a month, that’s troubling.

Decompose and Modularize: Ensure that people have the incentives and tools to create building blocks that can be reused and built upon.

Test: Ask two teams that have worked on similar projects to do a post mortem and identify overlapping work.

3. Identify the right unit of knowledge.

Compounding systems rely on units of knowledge, Granade says. In academia, those are books and papers; in software, it’s code. “In data science, our view is that the model is the right thing to organize around, because it’s the thing data scientists make…it’s a fully operable unit,” he says. The model includes the data, code, parameters and results.

4. Think beyond technology.

Changes at the people and process levels are also important. Granade recommends reframing how people see their jobs: They should spend less time doing and more time codifying and learning. He also suggests making collaboration a priority in hiring (by focusing on it during interviews) and compensation (by tracking and rewarding those who record their work and build modules others can use). Finally, while knowledge management should be seen as everyone’s job, Granade suggests some organizations create new roles for curating or facilitating knowledge.

The Practical Guide to  Managing Data Science at Scale  Lessons from the field on managing data science projects and portfolios Read the Guide

Twitter Facebook Gmail Share 

Domino powers model-driven businesses with its leading Enterprise MLOps platform that accelerates the development and deployment of data science work while increasing collaboration and governance. More than 20 percent of the Fortune 100 count on Domino to help scale data science, turning it into a competitive advantage. Founded in 2013, Domino is backed by Sequoia Capital and other leading investors.