Supervised vs. Unsupervised Learning: What’s the Difference?
By David Weedmark2021-12-068 min read
Of all the thousands of algorithms available for machine learning, or ML, the vast majority use one of three main branches of learning techniques.
3 Primary Types of Learning in Machine Learning
Supervised learning uses labeled data during training to point the algorithm to the right answers. Unsupervised learning contains no such labels, and the algorithm must divine its answers on its own. In reinforcement learning, the algorithm is directed toward the right answers by triggering a series of rewards and penalties determined by the model designers. Each of these types of learning can also be accomplished using deep learning techniques.
Supervised vs. Unsupervised Learning
When a human is learning in a supervised environment, a teacher is present with the answers in hand. With supervised learning, ML models don’t have teachers, but they have access to the answers in the form of labeled data in the training dataset. Labeled data means that the training data already contains the answers the algorithm should find.
Unsupervised learning is fundamentally different. When a student without a teacher must solve a complex problem, it’s often a matter of trial and error, and the student must determine for herself what the correct answer might be. In machine learning, unsupervised learning involves unlabeled data, without clear answers, so the algorithm must find patterns between data points on its own and it must arrive at answers that were not defined at the outset.
What Is Supervised Learning?
Supervised learning is a technique used in ML that uses labeled datasets to train algorithms. This type of supervised learning can only be used when the training data is labeled. It’s used when you want to solve classification or regression problems.
To solve a classification problem, the algorithm predicts a discrete value, identifying the input data as part of a class. If the algorithm should distinguish between photos of dogs and cats, for example, the images are all labeled correctly and the algorithm will compare each of its answers to the labels after it has made its prediction. When it is given a new set of data, it can compare the new data to the lessons learned from labeled training data to more accurately make its new predictions.
To solve a regression problem, the algorithm is typically fed continuous data. Linear regression is a common example, in which a y-value can be predicted from any x-value in the dataset. However, for an ML model’s development to be justified, there are usually many different variables, and the algorithm must determine the relationships or patterns between several variables, and their respective weights, to determine the correct answer.
Semi-supervised learning is similar to supervised learning, except that not all of the data is labeled. This is a preferable alternative when labeling examples takes too much time, or when extracting features from the data is too difficult. Semi-supervised learning is often used in medical imagery, such as analyzing MRI scans.
Supervised Learning Process
Because supervised learning requires labeled training data, processing the training dataset can require a great deal of time and effort. Once the model has been trained with that data, it is then given a new set of data to test its predictions. The algorithms available to use depend on whether you need to solve a classification problem or a regression problem.
To solve classification problems, the algorithm is trained on labeled data and then tested to see whether it can recognize entities within a new test dataset to classify those. Examples of classification algorithms include:
- Logistic regression
- Support vector machines
- Decision trees
- K-nearest neighbor
- Random forest
When the model needs to solve a problem using regression, it makes projections based on the relationships between dependent and independent variables. Examples of regression include linear regression polynomial regression. Examples of regression algorithms include:
- Neural networks with real-valued outputs
- Lasso regression
- Support vector regression
- Random forest regressor
What Is Unsupervised Learning?
Quite often, clean, labeled data is not available, or researchers may need an algorithm to answer questions for which there are no obvious answers, even during training. In these cases, unsupervised learning is used. Unsupervised learning is a ML technique that uses algorithms to analyze unstructured and unlabeled data.
Complex models, like neural networks, can determine patterns in the data by analyzing its structure and extracting useful features. In these cases, the datasets are usually complex, as are the algorithms and the problems that need to be solved. Examples of unsupervised learning include:
- Anomaly detection: The model searches for unusual patterns, like detecting credit-card fraud by unusual locations for charges.
- Association: The model identifies key attributes of data points to create associations, like recommending accessories to online consumers after they place an item in a shopping cart.
- Clustering: The model searches for similarities in the training data and groups them together, like identifying groups of market demographics.
The term “semi-unsupervised” is sometimes used to describe cases where labeled data is sparse and the data in one or more classes is not labeled at all. This method is actually a subset of semi-supervised learning, but with similarities to zero-shot learning, as well as transfer learning, where deep generative models are being used.
Unsupervised Learning Process
In an unsupervised learning environment, an algorithm is given unlabeled data and tasked with discovering patterns within variables on its own. There are three primary ways of doing this: clustering, association rules, and dimensionality reduction.
Clustering involves grouping data points into groups based on their similarities or differences. Exclusive clustering, used in K-means clustering, involves placing data points into as many groups as needed, although they can be weighted higher in one group than in another. Other clustering approaches include hierarchical clustering and probabilistic clustering.
An association rule is used to discover the relationships between variables, and is often used in marketing for recommending products or understanding consumer behavior. Apriori algorithms, for example, are commonly used to identify items that are associated in purchases by some customers to recommend them to others.
Dimensionality reduction is used to reduce the number of data inputs when the number of dimensions or features in a dataset is too high. Examples of dimensionality reduction include principal component analysis, singular value decomposition and autoencoders. These techniques are useful when mitigating issues related to the curse of dimensionality.
Supervised and Unsupervised Learning Techniques
It’s not always possible to predict which will be the best learning technique when you’re about to develop a machine-learning model. The learning curve in making this determination can be reduced with experience, provided that you have easy access to the data, tools and documentation used for previous models.
This is why model-driven organizations rely on the Domino Data Lab MLOps platform, which provides data science teams with the tools and resources they need while also fostering a collaborative, documented process for each project. To begin exploring the advantages of Domino’s Enterprise MLOps platform, sign up for a free 14-day trial.
David Weedmark is a published author who has worked as a project manager, software developer and as a network security consultant.
Subscribe to the Domino Newsletter
Receive data science tips and tutorials from leading Data Science leaders, right to your inbox.