Skip to main content
Home
Contact us
Watch Demo
  • Rev 2026
Contact us
Watch Demo
Domino's logo

Who is Domino?

Domino Data Lab empowers the largest AI-driven enterprises to build and operate AI at scale. Domino’s Enterprise AI Platform provides an integrated experience encompassing model development, MLOps, collaboration, and governance. With Domino, global enterprises can develop better medicines, grow more productive crops, develop more competitive products, and more. Founded in 2013, Domino is backed by Sequoia Capital, Coatue Management, NVIDIA, Snowflake, and other leading investors.

Watch Demo
  • Platform

      • AI infrastructure
      • Data management
      • AI workbench
      • MLOps
      • AI governance
      • FinOps
      • Pricing
      • Security & compliance
      • What's new
  • Solutions

    • Industries

      • Life sciences
      • Finance
      • Public sector
      • Retail
      • Manufacturing
    • Use Cases

      • Generative AI
      • Cost-effective data science
      • Self-service data science
      • Model risk management
      • Cloud data science
  • Learn

      • Events
      • Blog
      • Podcast
      • Courses and certifications
      • Data Science Dictionary
      • Documentation
      • Support
      • Demo hub
  • Company

      • About
      • Why Domino
      • Careers
      • News and press
      • Partners
      • Customers
      • Contact us

© 2026 Domino Data Lab, Inc. Made in San Francisco.

  • Do not sell my personal information
  • Privacy policy
  • Terms and conditions
  • Security
  • Legal
  • Agentic AI
  • AI Governance
  • Airflow
  • Anaconda
  • Apache Spark
  • Artificial Intelligence
  • Clustering
  • Dask
  • Data Science
  • Density-based clustering
  • dplyr
  • Factor analysis
  • Feature
  • Feature Engineering
  • Feature Extraction
  • Feature selection
  • Folium
  • GenomicRanges
  • ggmap
  • ggplot
  • GPU
  • Ground Truth
  • Hash table
  • Hyperparameter Tuning
  • Interpretability
  • Jupyter Notebook
  • Kubernetes
  • LLMOps
  • Machine Learning
  • Machine Learning Algorithms
  • MLOps
  • Model Drift
  • Model Evaluation
  • Model monitoring
  • Model Selection
  • Model Tuning
  • Overfitting
  • Plotly
  • PySpark
  • PyTorch
  • Responsible AI
  • Shiny (in R)
  • sklearn
  • spaCy
  • SR 26-2
  • Statistical Computing Environment (SCE)
  • TensorFlow
  • Underfitting
  • XGBoost
  • Apache Spark

    What is Apache Spark?

    Apache Spark is an open source, distributed computing framework and set of libraries for real-time, large-scale data processing. Spark was created in 2009 at UC Berkeley to address many of Apache Hadoop’s shortcomings, and is much faster than Hadoop for analytic workloads because it stores data in-memory (RAM) rather than on disk.

    Spark has many built-in libraries that implement machine learning algorithms as parallel processing jobs, making them easy to parallelize across many compute resources. Spark is the most actively developed open-source framework for large-scale data processing.

    Spark applications consist of a driver process and a set of executor processes. The driver process is responsible for three things:

    1. maintaining information about the Spark application;
    2. responding to a user’s program or input; and
    3. analyzing, distributing, and scheduling work across the executors.

    The executors are responsible for executing code assigned to it by the driver and reporting the state of the computation, on that executor, back to the driver node.

    Apache Spark Diagram
    Apache Spark Diagram

    Source: Apache Spark

    In general, Spark will be most appropriate when your data cannot fit into memory on a single machine – i.e., data greater than hundreds of gigabytes. Some of the most popular use cases for Spark include:

    • Streaming data: Spark Streaming unifies disparate data processing capabilities, allowing developers to use a single framework to continually clean and aggregate data before they are pushed into data stores. Spark Streaming also supports trigger event detection, data enrichment, and complex session analysis.
    • Interactive analysis: Spark is fast enough to perform exploratory queries on very large data sets without sampling. By combining Spark with visualization tools, complex data sets can be processed and visualized interactively.
    • Machine learning: Spark comes with an integrated framework for performing advanced analytics that helps users run repeated queries on sets of data. Among the components found in this framework is Spark’s scalable Machine Learning Library (MLlib). MLlib can work in areas such as clustering, classification, and dimensionality reduction.

    Spark involves more processing overhead and a more complicated set-up than other data processing options. Alternatives such as Ray and Dask have recently emerged.