Data Science Dictionary


Apache Airflow is a powerful open-source tool that helps you author, schedule, and monitor workflows. Airbnb created Airflow in 2014 to help manage its data processing needs and has since become a far-reaching tool for data scientists across the industry.


Anaconda is an open-source distribution of the Python and R programming languages for data science that aims to simplify package management and deployment. Package versions in Anaconda are managed by the package management system, conda, which analyzes the current environment before executing an installation to avoid disrupting other frameworks and packages.

Apache Spark

Apache Spark is an open source, distributed computing framework and set of libraries for real-time, large-scale data processing. Spark was created in 2009 at UC Berkeley to address many of Apache Hadoop’s shortcomings, and is much faster than Hadoop for analytic workloads because it stores data in-memory (RAM) rather than on disk.

Artificial Intelligence

Artificial Intelligence (AI) is a class of solutions that is able to perform tasks that ordinarily require human intelligence. If you speak with Siri on your phone, play a competitive game against a computer, or ride in a self-driving car, you are interacting with AI.


Data clustering is a widely used technique based on machine learning. This technique is used for segregating groups of abstract objects into classes of similar objects.


Dask was released in 2018 to create a powerful parallel computing framework that is extremely usable to Python users, and can run well on a single laptop or a cluster. Dask is lighter weight and easier to integrate into existing code and hardware than Apache Spark.

Data Science

Data science is a discipline that looks for patterns in complex datasets to build models that predict what may happen in the future and/or explain systems. Data science combines domain expertise, programming skills, and knowledge of mathematics and statistics to extract meaningful insights from data.

Density-Based Clustering

Density-Based Clustering refers to unsupervised machine learning methods that identify distinctive clusters in the data, based on the idea that a cluster/group in a data space is a contiguous region of high point density, separated from other clusters by sparse regions. The data points in the separating, sparse regions are typically considered noise/outliers.


Dplyr (pronounced “dee-ply-er”) is the preeminent tool for data wrangling in R. Learning and using dplyr helps data scientists make the data preparation and management process faster and easier to understand. Data scientists typically use dplyr to transform existing datasets into a format better suited for some particular type of analysis or data visualization.

Factor Analysis

Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed variables mainly reflect the variations in two unobserved (underlying) variables.


Data science is a discipline that looks for patterns in complex datasets to build models that predict what may happen in the future and/or explain systems. Data science combines domain expertise, programming skills, and knowledge of mathematics and statistics to extract meaningful insights from data.

Feature Engineering

Feature engineering refers to manipulation — addition, deletion, combination, mutation — of your data set to improve machine learning model training, leading to better performance and greater accuracy. Effective feature engineering is based on sound knowledge of the business problem and the available data sources.

Feature Extraction

Feature extraction is a process in machine learning and data analysis that involves identifying and extracting relevant features from raw data. These features are later used to create a more informative dataset, which can be further utilized for various tasks.

Feature Selection

Feature selection is the process by which we select a subset of input features from the data for a model to reduce noise.


Folium is a powerful Python library that helps you create several types of Leaflet maps. By default, Folium creates a map in a separate HTML file. Since Folium results are interactive, this library is very useful for dashboard building. You can also create inline Jupyter maps in Folium.


The GenomicRanges package serves as the foundation for representing genomic locations within the Bioconductor project. This R package lays the groundwork for genomic analysis by introducing three classes (GRanges, GPos, and GRangesList), which are used to represent genomic ranges, genomic positions, and groups of genomic ranges.


ggmap is an R package that makes it easy to retrieve raster map tiles from popular online mapping services such as Google Maps and Stamen Maps, and plot them using the ggplot2 framework. The result is an easy, consistent and modular framework for spatial graphics with several tools for spatial data analysis.


ggplot2 is a data visualization package for the statistical programming language R. ggplot2 is an implementation of Leland Wilkinson's Grammar of Graphics—a scheme for data visualization which breaks up graphs into semantic components such as scales and layers. ggplot2 is an alternative to the base graphics in R, and contains a number of plotting defaults.


A graphics processing unit (GPU) is a specialized circuit designed to rapidly manipulate and alter memory to accelerate computer graphics and image processing. Modern GPUs’ highly parallel structure makes them more efficient than central processing units (CPUs) for algorithms that process large blocks of data in parallel.

Ground Truth

Ground truth in machine learning refers to the reality you want to model with your supervised machine learning algorithm. Ground truth is also known as the target for training or validating the model with a labeled dataset.

Hash table

Hash tables are a type of data structure in which the address/ index value of the data element is generated from a hash function. This enables very fast data access as the index value behaves as a key for the data value.

Hyperparameter Tuning

Hyperparameter tuning is the process of finding the optimal hyperparameters for any given machine learning algorithm.


Interpretable machine learning means humans can capture relevant knowledge from a model concerning relationships either contained in data or learned by the model. Machine learning algorithms have historically been “black boxes”, that provided no way to understand their inner processes, and made it difficult to explain resulting insights to regulatory agencies and stakeholders.

Jupyter Notebook

Jupyter Notebook (formerly known as IPython Notebook) is an interactive web application for creating and sharing computational documents.


Kubernetes is an open source container-orchestration system for automating application deployment, scaling, and management. Kubernetes (aka, K8s) was developed to manage the complex architecture of multiple containers (e.g., Docker) and hosts running in production environments. K8s is quickly becoming essential to IT departments as they move towards containerized applications and microservices.


LLMOps, short for Large Language Model Operations, is a specialized discipline within the broader field of MLOps.

Machine Learning

Machine learning (ML) is the application of computer algorithms that improve automatically through experience. Machine learning algorithms build a model based on sample data, known as "training data," in order to make predictions or decisions without being explicitly programmed to do so.

Machine Learning Algorithms

Machine Learning algorithms are computational procedures aimed at solving a problem. In the realm of data science, an algorithm is a precisely defined and logically structured set of computational instructions designed to process and analyze data, extracting meaningful insights or making predictions.


Machine Learning Operations (MLOps) is a set of technologies and best practices that streamline the management, development, deployment, and monitoring of data science models at scale across a diverse enterprise.

Model Drift

Model drift is the decay of models' predictive power as a result of the changes in real world environments.

Model Evaluation

Model evaluation is the process of using different evaluation metrics to understand a machine learning model’s performance, as well as its strengths and weaknesses. Model evaluation is important to assess the efficacy of a model during initial research phases, and it also plays a role in model monitoring.

Model Monitoring

Model Monitoring is an operational stage in the machine learning lifecycle that comes after model deployment. It entails monitoring your ML models for changes such as model degradation, data drift, and concept drift, and ensuring that your model is maintaining an acceptable level of performance.

Model Selection

Model selection is the process of selecting the best model from all the available models for a particular business problem on the basis of different criterions such as robustness and model complexity.

Model Tuning

Model tuning is the experimental process of finding the optimal values of hyperparameters to maximize model performance.


Overfitting describes the phenomenon in which a model becomes too sensitive to the noise in its training set, leading it to not generalize, or to generalize poorly, to new and previously unseen data.

Plotly, colloquially referred to as Plotly, is an interactive, open-source, and browser-based graphing library.


PySpark is the Python API for Apache Spark, an open source, distributed computing framework
and set of libraries for real-time, large-scale data processing. If you’re already familiar with Python and libraries such as Pandas, then PySpark is a good language to learn to create more scalable analyses and pipelines.


PyTorch is an open source machine learning library, released by Facebook's AI Research lab in 2016. It can be used across a range of tasks, but is particularly focused on training and inference of deep learning tasks, like computer vision and natural language processing.

Shiny (in R)

Shiny is an R package that enables building interactive web applications that can execute R code on the backend. With Shiny, you can host standalone applications on a webpage, embed interactive charts in R Markdown documents, or build dashboards. You can also extend your Shiny applications with CSS themes, HTML widgets, and JavaScript actions.


Scikit-learn, also known as sklearn, is an open-source, machine learning and data modeling library for Python. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python libraries, NumPy and SciPy.


spaCy is a free, open-source Python library that provides advanced capabilities to conduct natural language processing (NLP) on large volumes of text at high speed. It helps you build models and production applications that can underpin document analysis, chatbot capabilities, and all other forms of text analysis.


TensorFlow is an open source framework for machine learning. It has a comprehensive ecosystem of tools, libraries, and community resources that lets developers easily build and deploy ML-powered applications, and researchers innovate in ML. It can be used across a range of tasks, but is particularly focused on training and inference of deep neural networks.


Underfitting describes a model which does not capture the underlying relationship in the dataset on which it’s trained.


XGBoost is an open source, ensemble machine learning algorithm that utilizes a high-performance implementation of gradient boosted decision trees. An underlying C++ codebase combined with a Python interface sitting on top makes XGBoost a very fast, scalable, and highly usable library.