Reflections on "Buy vs Build" for Data Science Tools

“Buy vs build”, “not-invented-here syndrome” and even “invented-here-syndrome” have been written about extensively. I want to share a few reflections on the topic, based on my observations both as an engineering manager (where I had to decide whether to build or buy solutions) and more recently as a founder selling a platform to other companies.

As a shorthand, I'll use the term “buy” to mean “using a third-party solution" (including something free or open-source).

It’s about your business, not your technology

I’ve seen countless blog posts that describe the pros and cons of building vs buying in terms of features and engineering effort required. What’s the quality of the third-party solution you’re looking at; does it have all the features you’ll need; how much time will it take to learn/integrate another solution vs building your own.

In my opinion, this type of calculus misses the most important question: is the capability you’re trying to provide core to your company’s business and competitive advantage — or not? If you are delivering a capability that is core to your business or what differentiates you, you should have a strong bias to build, so you can control your own destiny. If you are providing a capability that is peripheral to, or supporting of your core business, you should have a bias toward buying, so that you can focus your precious engineering resources on your differentiated capabilities.

This framing represents a subtle but important shift in mindset from how most people approach the buy-vs-build question. Consider an argument like “a third-party product has all the features we need, so we’d be wasting time building it ourselves” — if those features are key to your competitive advantage, you may want to build your own so you can evolve them precisely how you need to. Conversely, if a product has only 80% of what you want, but it’s serving a function that won’t make or break your success as a business, it may be better to take the 80% so you can focus on your must-do functionality.

As an example, a hedge fund might build its own time series database, even though there are plenty of powerful database solutions, including many with great time series support. This makes sense because representation and manipulation of time series are critical to the firm’s competitive advantage. On the other hand, that same firm would be crazy to manage its own web hosting infrastructure. Conversely, a high-volume consumer web app might need to build and manage its own hosting infrastructure, but might use an off-the-shelf database for storing metrics (as time series). The buy-vs-build choice should be made with a deep, thoughtful understanding of what’s most important to your business.

A pitfall with this line of thinking is that it's easy to conflate two separate capabilities, instead of recognizing that one is core to your business and another isn't. For example, you may have predictive models that are critical to your competitive advantage — but a grid infrastructure to compute those models is probably a generic problem that many companies have.

Most engineers are inherently biased

Most engineers are engineers because they like to build things. So if you ask an engineer whether it will be better to build or buy, in most cases, her default posture will be to build. It's not bad that engineers like building things — that's great, in fact — but it does mean you need to keep that in mind when asking an engineer's opinion.

Here’s an actual email I received from a data scientist who had seen Domino and thought it would help his team:

Hey Nick,
Sorry for the delay in response. I have chatted with a few of the developers here and they have decided to tailor build our platform in house. I am not entirely sure why, but this has been a common trend with [company]. Thank you for your time though.

Or another email:

Hi Nick,
Congrats! Seems like an awesome product, my team was definitely impressed. I don't think it's a fit for us as we prefer to homegrow and err on the side of needing control but would love to let you know if that changes.

Unfortunately, this bias often extends to technology management positions, as well. I know one engineer who was building his message queue because, according to him, “if I keep suggesting existing solutions instead of building my own, I’ll lose credibility with my CTO.”

When technologists are divorced from an understanding of the business, the effect of their natural inclinations becomes amplified. If your engineers believe their raison d'être is to build things, rather than something like “to optimally use technology to achieve our key strategic business goals,” then your engineering organization, and your business as a whole, will suffer.

"We build everything ourselves"

I’ve heard dozens of engineers and technology managers say that they build all their technology in house. But of course, these companies have not built their own operating systems, programming languages, database systems, email clients, text editors, etc etc. And I suspect they would never consider building an in-house solution for any of those capabilities.

I think this bit of cognitive dissonance highlights the challenge of applying my earlier advice (considering your core/differentiated capabilities), in practice. Specifically, I suspect many companies have inflated, aggrandized views of what is actually core to their business and, at the same time, I suspect they underestimate the cost of building their own solutions.

Questions to consider

None of my observations above are meant to suggest that buy-vs-build is always a simple decision. A variety of factors — often in conflict with each other — are involved. A good technology manager — with a visceral understanding of the relevant business considerations — will be able to cut to the heart of the issue, identifying which of these questions are most relevant in any particular circumstance.

Most important, enumerate the capabilities you need from a solution and identify which of them are core to your business, i.e., which of them you need to control in order to ensure you stay differentiated compared to your competitors. Think critically about decomposing those capabilities to identify which are specific and which are generic. For example, your proprietary software may be a differentiator, but that doesn’t mean you need a proprietary programming language; your predictive models may be a differentiator, but that doesn’t mean you need your own grid solution to train your models.
Consider the opportunity cost of building your own solution: what could your engineers be doing instead?
Visualize ongoing maintenance (fixes and improvements)

- If you’re building a solution, remember that 40-80% of the “cost” of building a solution comes from the ongoing support and maintenance. Can you commit to that, or do you want a third-party to be doing that?
- Conversely, if you’re buying a solution, will your requirements for improvements be aligned with the interests of the vendor or provider or the solution? Will you benefit from updates, or will the solution evolve in a direction at odds with your goals?

Of course, consider the functionality, matched against your requirements. Will a third-party solution give you 50, 80, or 100% of what you want? And if there are gaps, are they must-haves or nice-to-haves? (And again, your tolerance for gaps should depend on how central the functionality is to what differentiates you as a business.)

- Of the functionality you’re getting, how complicated is it? Would it be easy to build and maintain yourself, or actually quite subtle and involved?
- Consider the cost of integrating a third-party solution. Even if it has all the functionality you need, how much work will it be to integrate with your existing environment?

Our own examples

Here are some concrete examples of decisions we’ve made, when building Domino, about what to buy and what to build.

Buy: We used third-party tools (Mixpanel) for user analytics. This is a generic problem, not at all specific to us or central to our competitive advantage. It was easy to integrate, and though it didn’t do everything we wanted ideally, the gaps weren’t a big deal.
Build: We built our own job scheduler and “auto-scaler” that, together, spin EC2 machines up and down and assign users’ jobs to different machines in our cluster. This strikes some people as a crazy decision -- AWS has auto-scaling features, after all. But we found that
This functionality is core to what our product does, so we wanted fine-grained control over how we evolved it;
The basic set of functionality we were required isn’t that complex to implement. (We had a first version, deployed and working, in about 4 weeks)
Integrating a third-party job scheduler would have been awkward, because we would have needed to map our domain objects and vocabulary onto a more generic, generalized domain. The specificity of our implementation — tailored precisely to our domain — has made us more efficient working in the codebase.
Hybrid: For our revisioned file store (which supports large files), we took a hybrid approach. We use git under the hood, but we built support for large files on top of that. We wanted to use as much as we could off the shelf, because version control is a notorious complex problem, so we wanted to benefit from all the engineering that has gone into solving it well. But git couldn’t handle large files.

We ended up building something very similar to Github’s Large File Storage, but we did this almost two years ago, before Github’s solution existed. (If we were doing it now, we’d probably use their solution.) We looked at git-annex and a couple similar solutions but none of them seemed to give us the control we wanted, and/or the integration complexity would have been awkward. So we opted to mostly use git, with our own custom development on top, carved out to be as minimal as possible.