Providing fine-grained, trusted access to enterprise datasets with Okera and Domino
David Bloch2020-10-01 | 8 min read
Domino and Okera - Provide data scientists access to trusted datasets within reproducible and instantly provisioned computational environments.
In the last few years, we’ve seen the acceleration of two trends -- the increasing amounts of data stored and utilized by organizations, and the subsequent need for data scientists to help make sense of that data for critical business decisions. This explosion in both the amount of data as well users who need access to it has created new challenges, chief among them being how to provide secure access to this data at scale and how to give data scientists consistent, repeatable, and convenient access to the computational tools they need.
These patterns play out in multiple industries and use cases. For example, in the pharmaceutical world, there is a great deal of data produced for clinical trials and the commercial production of new drugs and treatments, and this has only accelerated since the emergence of COVID-19. This data supports all kinds of use cases within organizations, from helping production analysts understand how production is progressing, to allowing research scientists to look at the results of a set of treatments across different trials and cross-sections of the population.
Domino Data Lab, the world’s leading data science platform, allows data scientists easy access to reproducible and easily provisioned computational environments. They can work with data without worrying about setting up Apache Spark clusters or getting the right version of libraries. They can easily share results with other users and create recurring jobs to produce new results over time as well.
In today's increasingly privacy-aware environment, more and more types of data are considered sensitive. Those datasets must be protected in accordance with industry-specific regulations such as HIPAA, or the slew of emerging consumer data privacy regulations including GDPR, CCPA and other regulations in different jurisdictions. This can serve as a roadblock to data consumers; although Domino Data Lab makes it easy to access computational resources, gaining access to all the data they need is a real challenge.
Traditionally, this problem has been solved by either denying access to this data altogether (a not infrequent outcome), or creating and maintaining multiple copies of many datasets for each possible use case by omitting the data that a particular user is not allowed to see (e.g. PII, PHI, etc). This process of creating duplicate versions of the data not only takes a lot of time (typically months) and increases storage costs (which quickly add up when talking about petabytes of data), but also becomes a management nightmare. Data managers need to keep track of all these copies and the purposes for which they were created, and remember that they need to be kept up to date with new data - or even worse, possible future redactions and transformations as new types of data are deemed sensitive.
Okera, the leading provider of secure data access and data governance, allows you to define fine-grained data access control using attribute-based access policies. Combining the power of Domino Data Labs with Okera, your data scientists only get access to the columns, rows, and cells allowed, easily removing or redacting sensitive data such as PII and PHI not relevant to training models. Additionally, Okera connects to a company's existing technical and business metadata catalogs (such as Collibra), making it easy for data scientists to discover, access and utilize new, approved sources of information.
For the compliance team, the combination of Okera and Domino Data Lab is extremely powerful. It allows compliance to not only govern what information can be accessed, but also to audit and have visibility into how the data is actually being accessed - when, by who, through what tools, how much data was viewed, etc. This can identify data breaches and to see where data access should be further reduced, such as reducing the risk of exposure by removing access to infrequently-used data.
So what does this look like? Consider an example where a data scientist wants to load a CSV file from Amazon S3 into a pandas dataframe for further analysis, such as building a model for a downstream ML process. In Domino Data Lab, the user would use one of the Environments they have access to, and have some code that might look like this:
import boto3
import ios3 = boto3.client('s3')
obj = s3.get_object(Bucket='clinical-trials', Key='drug-xyz/trial-july2020/data.csv')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))
A critical detail embedded in the above snippet is the question of how the data scientist gets permission to access the file. This can be done via IAM permissions by either storing user credentials in secure environment variables inside Domino or using keycloak capabilities to do credential propagation between Domino and AWS.
Finally, if the data scientist was not allowed to see certain columns, rows, or cells within the CSV file, there would be no way to give access to the file.
When Domino Data Lab is integrated with Okera, the same code simply looks like this:
import os
from okera.integration import domino
ctx = domino.context()
with ctx.connect(host=os.environ['OKERA_HOST'], port=int(os.environ['OKERA_PORT'])) as conn:
df = conn.scan_as_pandas('drug_xyz.trial_july2020')
The identity of the current user in Domino Data Lab is automatically and transparently propagated to Okera, with all the requisite fine-grained access control policies applied. This means that if the executing user was only allowed to see certain rows (e.g. trial results from participants in the US, to adhere to data locality regulations) or see certain columns but without exposing PII (e.g. by not exposing a participant’s name but still being able to meaningfully create aggregations), this will be reflected in the result of the query that gets returned, without ever exposing the data scientist to the underlying sensitive data. Finally, this data access is also audited, and that audit log is made available as a dataset for querying and inspection.
In addition to the benefits of being able to access data securely while maintaining fine-grained access control policies, it is now much easier for data scientists to find the data that they need to access. Previously, this involved sifting through object storage such as Amazon S3 or Azure ADLS, but with the combination of Okera and Domino Data Lab, data scientists can easily inspect and search Okera's metadata registry to find data they have access to that has been validated, qualified and documented by subject matter experts, preview it, and get simple instructions on how to access it in their Domino Data Lab environments.
As your organization's investment in your data and the productivity of your data scientists increases, it's critical that they have the right tools and access to the right data. With the combination of Okera and Domino Data Lab, the whole is more than the sum of its parts. If you’re already leveraging Domino Data Lab, adding Okera can allow you to unlock data for analysis that was previously forbidden due to privacy and security concerns. If you're already using Okera, adding Domino Data Lab can increase the productivity of your data scientists by giving them easy access to reproducible and easily provisioned computational environments.
For more information about Okera and their partnership with Domino, see Okera's blog post or the documentation for integrating Okera with Domino.