Dask is a powerful open-source Python library for flexible parallel computing, released in 2018. It is designed to scale Python workflows from single laptops to large clusters, offering a lighter-weight and easier integration alternative to frameworks like Apache Spark.

What are the two main components of Dask?

Dask is primarily composed of: 1) a dynamic task scheduler optimized for computation and interactive workloads, which coordinates processes across multiple machines and client requests; and 2) Big Data collections (like parallel arrays, dataframes, and lists) that extend familiar Python interfaces (NumPy, Pandas) to distributed environments, running atop the task schedulers.

How does Dask integrate with existing Python data science tools?

Dask integrates seamlessly with popular Python data science tools. It supports Pandas dataframes and NumPy array data structures, allowing data scientists to continue using familiar tools. Additionally, it integrates tightly with Scikit-learn’s JobLib library, enabling parallel processing of Scikit-learn code with minimal modifications.

Dask

What is Dask?

Dask was released in 2018 to create a powerful parallel computing framework that is extremely usable to Python users, and can run well on a single laptop or a cluster. Dask is lighter weight and easier to integrate into existing code and hardware than Apache Spark.

Dask is a flexible library for parallel computing in Python. Dask is composed of two parts:

Dynamic task scheduling optimized for computation and interactive computational workloads. The central dask-scheduler process coordinates the actions of several dask-worker processes spread across multiple machines and the concurrent requests of several clients.
Big Data collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to distributed environments. These parallel collections run on top of dynamic task schedulers.

Internally, Dask encodes algorithms in a simple format involving Python dicts, tuples, and functions. This graph format can be used in isolation from the Dask collections. Working directly with task graphs is rare, unless you intend to develop new modules with Dask.

Source: Dask Documentation

Since Dask supports Pandas dataframes and NumPy array data structures, data scientists can continue using the tools they know and love. Dask also integrates tightly with Scikit-learn’s JobLib parallel computing library that enables parallel processing of Scikit-learn code with minimal code changes.

Additional Resources

Code

Parallel computing with Dask: a step-by-step tutorial

Learn more

Ray

Spark, Dask, and Ray: choosing the right framework

Learn more

Perspective

Considerations for Using Spark in Your Data Science Stack

Learn more