Advice for Aspiring Chief Data Scientists: The People You Need
By Nick Kolegraff2017-10-308 min read
Nick Kolegraff is the founder of Whiteboard, a strategic innovation company focused on machine learning and AI. Previously, Kolegraff was the Chief Data Scientist at Rackspace and a Principal Data Scientist at Accenture. As a part of Domino’s “Data Science Leaders at Work” guest blogger series, Kolegraff provides advice for data scientists and data science managers to consider when, or if, they decide to take a “chief data scientist” role. This advice includes insights on the mindset you need to have, the types of problems you need to solve, and the people you need to hire. There are three posts in total. This third post focuses on the people you need to hire.
The people you need
After you get a handle on the types of problems you're solving as a Chief Data Scientist, this insight will help you organize and find the people you will need to solve those problems. My perspective in this post is coming from an enterprise context when the headcount budgets are outside of the individual team’s scope. Also, the challenges and investment of initiatives are different in the enterprise than a direct-to-consumer startup. Each problem to solve is vastly different and depends on many factors, but the types of data you are working with will determine what type of skill the people you are looking for need in order to actually solve that problem. I also make the assumption that the people you need will have software engineering skills. These skills are a must when you are building products and autonomous systems. To break this down properly would be an entire book worth of knowledge, so I’ve handpicked a few high-level examples that cover a majority of cases. The skills break down into two categories based on the types of data they work with and the skills you need to work with that data.
The first is time series data. Time series data is when the time or the sequence of events has a specific meaning to the dataset itself. You might want to build an API that can take a stock ticker as input, return in real-time if an anomaly in its options pricing has occurred, and then push the ticker and the anomaly to the subscription feed. In addition to engineering skills, this type of talent will have strong skills in ARIMA modeling, stabilizing volatility, modeling periodicity, and moving averages while understanding the nuances of each in a batch vs streaming computation world.
Next is textual data. This category breaks down into far too many categories to write about here. Yet, we can cast a net and say anything that contains words is textual data: archives of web pages, chat logs, document repositories, plain textbooks, etc. You might want to build an API that takes a document as input (pdf file), reads the text, and maps it to a category automatically for a search index. You might want to make an API that, given a sentence, map nouns and verbs to sentences, and then automatically fixes poor sentence structures. In addition to engineering skills, this type of talent will have strong skills in NLP, (Natural Language Processing) linguistics, semantic techniques, name and entity resolution, disambiguation techniques, and topic modeling approaches, while understanding the nuances of each in a batch vs streaming computation world.
Next is operational data. Operational data consists of data that enables a business to run and operate. This could be collections of purchase transactions, outcomes of marketing campaigns, sales, and leads, supply chain, inventory, etc. You might want to look for new ways to cut costs and be more efficient in how you stock raw materials for your product's demand. Or you might want to look for cross-selling opportunities within different product portfolios that sales teams can use to develop leads. Talent in this area will generally have a strong background in operations research, economics, and statistics, as well as a diverse set of modeling techniques typically in the parametric modeling category. They will also be comfortable getting at the data they need on a variety of platforms.
Finally, we have data where recognition needs to happen. This could be a collection of audio files, an archive of images, a collection of videos, etc. You might want to build a system that crawls an archive of images and automatically organizes them into folders based on the image and ONLY the image, no metadata. Recognition problems are the basis of what AI is today and Neural Network techniques are a majority favorite, proving to be robust with a broad application and fit a variety of problems. Talent in this area will have a strong understanding of the differences between neural network variants: convolutional neural networks, deep belief networks, and backpropagation. While understanding how each of these maps to different computing paradigms.
A collection of people without common goals and something to work towards can make it challenging to get on a consistent rhythm of delivering results so thinking about how you want to give structure and create communication channels is not something to take lightly.
You might think a resource pool is a good approach initially. Everyone has a consistent manager and like-minded people; they get stuffed inside other organizations on a project/product, and then go back to home base when done with the project/product. This is great if all departments have uniform types of problems they are solving, but that is often, not the case. Each department has different demands ...and each of those demands has a different set of expectations. Embedded teams are great (i.e., just let everyone hire whoever they want) with you acting as a consultant on who they hire. However, there is no consistency across the board, creating standards to control quality becomes a nightmare, and it quickly starts to snowball. Teams also struggle with who they want to hire because the objective isn’t well-defined beforehand. A more flexible model I suggest exploring is to think of things in terms of discovery, establish, optimize, and maintain functions. This sets you up for rigidity when tackling many different types of problems, allows you some flexibility in how you assemble the teams depending on the objective you need to accomplish, and also allows you to create consistency and control quality in what you build and deploy.
Your mindset will help pave a path for you, as a Chief Data Scientist, and others in your organization to continuously improve their skills while developing a drive for relentless consistently toward unquestionable results and innovation that moves the needle forward. By defining the problems you solve (suggestion, recognition, decisive action) and their states (optimizing, establishing, or maintaining) you develop an execution model that gives clarity to your organization and the people you need to carry out that plan. Your job is to move the needle forward. Your responsibility as a human is to maintain the livelihood of your technology, community, and most importantly, your people.
Nick Kolegraff is the founder of True Footage and Whiteboard, a strategic innovation company focused on machine learning and AI. Previously, Kolegraff was the Chief Data Scientist at Rackspace and a Principal Data Scientist at Accenture
Subscribe to the Domino Newsletter
Receive data science tips and tutorials from leading Data Science leaders, right to your inbox.