Eight Considerations When Choosing a Data Store for Data Science

Nikolay Manchev2022-07-27 | 18 min read

Return to blog home

Selecting the right technology that enables data scientists to focus on data science and not IT infrastructure will help enterprises reap the benefits of their investments in data science and machine learning.

Here’s a paradox that has tripped up more than a few enterprises scaling their data science operations: By the time they realize they need to be proactive about assembling the right infrastructure to position data science at the core of as many business operations as possible, it’s already too late to make new infrastructure investments without incurring a bit of pain.

But whether your organization is just starting to scale its data science operations or you’re well on your way to putting data science at the core of as many business processes as possible, you need to think through the requirements for supporting a robust data science practice. For all businesses, this is going to be the foundation of your organization’s future survival.

For this discussion, we’re going to focus on the data store and the functionality that will support data science at scale across your enterprise. If you want to set your company on a course to success, you need to think about the optimal way to collect and organize data. And, you need to leverage all the data that will be useful to your MLOps program; you don’t want to work with a data set compromised by your data store’s inability to capture or manage certain data feeds.

Remember, in machine learning and artificial intelligence, the garbage-in/garbage-out (GIGO) effect is extremely strong. And you can’t have high-quality data with reliable access and usable performance if your storage platform is lacking.

So, eight factors to consider:

Scalability

More data beats better algorithms and we have seen a convincing amount of evidence. InThe Unreasonable Effectiveness of Data, by Halevy, Norvig, and Pereira, the authors contend that “…invariably, simple models and a lot of data trump more elaborate models based on less data.” So when your data store can’t scale to accommodate more data and, in the midst of a project, you find that you need to do a data platform migration because of this limitation, your project is doomed.

Today’s neural networks are capable of amazing things! That’s not because the architectures have improved since they were developed in the 1960s. Rather, it’s because we now have the capacity to store and process immense volumes of data. There’s no compensating for a lack of data with very clever algorithms. It’s simply not an option.

Every industry has become data-driven. But it’s mind-boggling to think about the lost opportunities that stem from the inability to scale data storage. Take, for example, the healthcare industry: Statista reports that in 2020, healthcare data generation outstripped data storage capabilities by more than 2 to 1. When over half the data generated gets discarded because organizations don’t have the means to store it, that represents a tremendous amount of lost intelligence that otherwise could have been put to good use to improve patient outcomes. Think of it this way: The answer to the question, “Should we store it or not?” should be based on whether the data is useful, not based on capacity limitations.

Further, scalability means more than just the capability to accommodate large amounts of data. The system must also be able to maintain a good level of performance and responsiveness as you work with that data. A large number of machine learning algorithms are trained iteratively, which means going over the data hundreds and thousands of times. If you are operating with a large dataset (which you can’t pin to memory), the access times and speed of the data store become a significant bottleneck.

Data scientists generally love open source tools and look at things like PostgreSQL, MongoDB etc. But they are far from the Massively Parallel Processing (MPP) capabilities of Snowflake, for example. Open source tools are attractive when considering them for small-scale data operations. But when you’re making the move to enterprise-wide MLOps initiatives, you’ll quickly find that relying on these open source solutions and maintaining them in-house become completely infeasible.

Flexibility

The days when the relational database was king are long gone. Relational databases are great for processing structured data. However, in many of the machine learning domains, like computer vision or natural language processing (NLP), support for semi-structured and unstructured data is imperative.

Enterprise databases have limited support for unstructured data. This is not what they were engineered for, so their role in MLOps is not as dominant. When selecting a data store for your data operations, you need to not only consider your present needs, but also take into account as best you can what kind of models you will be running in two, three or five years. A data storage system that supports machine learning workloads must be flexible in terms of format. It should easily handle tabular data and other formats like JSON, Avro, ORC, Parquet, XML and so on.

Hybrid workloads

Some data stores are good at processing transactional workloads where they need to perform calculations on only one or two data rows and execute with sub-second response times. Others are adept at managing analytical workloads that require processing millions and millions of rows with somewhat relaxed response times. For the purposes of data operations, data scientists need data stores that excel at both.

Snowflake’s architecture is really interesting. They take a compressed columnar approach to storing data, which allows for performant processing of both analytical and transactional workloads at scale. Snowflake’s “virtual warehouses,” which are essentially MPP compute clusters, have inherent flexibility in that they can be created, resized and deleted dynamically as resource needs change. These micro-partitions enable efficient scanning of individual columns to support the needed execution of both analytical and transactional queries.

Reliability

When an outage makes a company’s data unavailable, all data operations come to an abrupt stop; there’s nothing for your data scientists to do but wait out the situation until it gets rectified by the IT team. Having your data operations idle for a period of time is bad enough from a financial perspective, but the problem is most likely compounded by not being able to ingest new data over the period of the outage.

Data is the biggest asset an organization can have. Don’t get me wrong — human capital is incredibly important, too. Data science talent is in high demand. There are numerous reports available that talk about the persistent shortage of data scientists in a field that continues to grow rapidly. If you lose your entire data science/machine learning team, you will be in for a painful journey. But new employees can be hired and existing employees retrained. Yes, it will be costly and will be a setback to the business, but the company can still recover. If you lose all your data, on the other hand, chances are that in a few month’s time your company won’t exist anymore.

Reliability of your data store is absolutely essential. Running a single instance of a database in-house is easy, but if you need to run multiple servers in clusters with load balancing, disaster recovery and other services, that’s a much more complex and expensive operation to manage. In most situations, it’s preferable to enlist the services of a cloud provider for your data storage operations. Providers like Snowflake, GCP, AWS and others have dedicated teams to look after the reliability of the infrastructure, both from an availability and a security perspective. This is what they do 24/7, and the cumulative experience they bring to bear gives them an advantage over in-house teams.

Cloud-native storage

This is, in part, an extension of the Reliability consideration discussed above. Tapping cloud native storage resources unlocks several important benefits:

  • Flexible capacity: Utilizing cloud resources allows your data science teams to tweak the allocated resources according to demand. If your system is exclusively on-premises, you’ll have to invest in costly hardware to scale up. And then, once the demand has subsided, scaling down is even more problematic.
  • Maintenance and security: As mentioned, offloading these services to the vendor is almost always advantageous. They’ll typically do it better than an internal team because this is their main focus; they do it for a living, and responding to threats on a daily basis gives the service providers much more experience than your internal team.
  • Geographic resilience — This is not only about cluster reliability through redundancy, but also about the ability to sequester data in specific regions to remain in compliance with regional mandates.

It’s typical for companies that work with sensitive data to have multiple networks that are prohibited from connecting with each other. They may have dedicated on-premises environments for containing sensitive data in order to comply with local or regional data governance requirements. Other, non-restricted data can be stored anywhere, so it is often sent to the cloud. The question for data scientists is, how can we collaborate when we’re dealing with two (or more) distinct, isolated environments?

When data can’t cross borders, you need to take the computation to the data, not the other way around. But if you have distributed data operations supported by a cloud data store, you can develop and experiment with models wherever your best data scientists reside (so long as you are only using sample/anonymized data). With cloud-native storage, you can build and train these models anywhere in the world. And once the remote data scientists have tuned the models, they can easily export them to the region where the sensitive data resides and where local data scientists can deploy the models and execute against the actual sensitive data.

Ease of use

The key idea here is to let data scientists do data science. The data store must be easily accessible from whatever languages and tools (Python, R, MATLAB, etc.) your data scientists prefer. It is also important to have a data science platform that is open and easily integrates with a wide range of data stores. No point in going through an extensive data store selection process only to realize that it is not supported by your current DS stack.

Furthermore, the learning curve for using the data store should not be too steep. If your data science team has a time budget for self improvement, they should be spending it on learning about new algorithms and techniques, acquiring more domain knowledge or getting more proficient with the languages and frameworks they use on a daily basis. What they should NOT be doing is spending the allocated time on learning an obscure proprietary API or SQL syntax just to be able to get the data they need. If a given solution requires users to develop a significant volume of proprietary knowledge — for example, a developer certification path that features five training courses (3–5 days each) and five certification exams — maybe that’s not the ideal system for your data science team to be relying on.

The technology tools you employ should be as close to invisible to your data scientists as possible. Your data scientists were hired and are being compensated based on the deep and specialized expertise they bring. As such, their time is highly valuable; they should be spending it on developing, training, deploying and monitoring models, not messing around with IT infrastructure.

Consider a hospital setting: We wouldn’t want to see surgeons spending their time making sure the notes that primary care physicians or emergency room personnel have captured in electronic health records (EHRs) are accessible and complete. No doubt a surgeon’s access to this information is vital to patient outcomes, but the surgeons need to spend their time and energy on the activities for which they are uniquely qualified. In data science operations, we don’t want data scientists to have to manage the infrastructure they are using to develop, train and deploy models. Or, worse yet, we don’t want to see them adapting their processes to accommodate or work around the limitations their technology tools impose.

Data versioning

Reproducibility in data science is extremely important. If you don’t want to handle data versioning manually, a database that supports this capability out of the box can make things really easy.Snowflake time travel and zero-copy cloning are great examples of this vital capability.

Take the following hypothetical situation: Data scientists working for a financial institution develop and train a scoring model using machine learning to approve or reject applicants for a loan. The model is deployed and runs well across the business for the first two years. But then, the bank is hit by allegations that the model is biased. The first question the data science team has to answer during an audit is how they trained the model. They’ll need access to the training data they used two years ago.

For reproducibility, companies need their original data. Some systems automate this nicely; Oracle has “flashback” technology that will give you data from a specific time period. It automatically tracks all the changes to the data over time and returns the data from the date requested in the query. Not all data stores work in this way, requiring instead the data scientists to manually track changes or manually take snapshots at different points in time.

Future focus

This speaks to a longer-term data operations strategy. Is the potential vendor for your data store forward looking? What is their roadmap? Do they understand machine learning and data science? Do they provide (or consider) support for in-database execution of Python/R? Can they leverage GPUs? Can they deploy models for in-database scoring? Do they think about Kubernetes support? Questions such as these will help you suss out how good a fit the data store provider will be in the years to come.

It’s important to see how much the tools and products used in your data operations align with the vendor’s product strategy and what part of their roadmap is machine learning and data science oriented. Simply put, look for a data store vendor that understands data science and machine learning and whose vision of where the discipline will be in five years meshes with the vision of your data science team.

Conclusion

In almost all cases, the purchasing decision on a data store will lie with a buying committee within the enterprise. The maturity of the organization’s data operations will determine who comprises the committee. Those organizations that are not as far into the process of putting data models at the core of their business operations may leave the decision in the hands of IT leadership. Those that are further down the path of becoming model-driven businesses may have a chief data analytics officer (CDAO) or equivalent who will make the final decision. Ultimately, though, IT personnel and data science leaders and practitioners should come together to decide on a solution that will provide the best blend of functionality according to the eight considerations above.

No matter whether your organization is realizing a little later in the game than you’d like that the functionality of your data store is vital to the success of your data operations, or whether you’re in a position to choose the ideal data store around which you can grow your data operations, all personnel need to understand how important selecting the best solution for your needs today and tomorrow is going to be.

Nikolay Manchev is a former Principal Data Scientist for EMEA at Domino Data Lab. In this role, Nikolay helped clients from a wide range of industries tackle challenging machine learning use-cases and successfully integrate predictive analytics in their domain-specific workflows. He holds an MSc in Software Technologies, an MSc in Data Science, and is currently undertaking postgraduate research at King's College London. His area of expertise is Machine Learning and Data Science, and his research interests are in neural networks and computational neurobiology.

RELATED TAGS

SHARE