Ground Truth
What is ground truth?
In machine learning, the term ground truth refers to the reality you want to model with your supervised machine learning algorithm. Ground truth is also known as the target for training or validating the model with a labeled dataset. During inference, a classification model predicts a label, which can be compared with the ground truth label, if it is available.
Developing ground truth datasets often require major tasks such as model design, data labeling, classifier design and training/testing. Ground truth labels for datasets are mostly annotated manually by a group of annotators and then later compared using different techniques to set target labels for the dataset. More substantial annotated datasets enable ground truth for supervised learning and deep learning algorithms to learn better patterns by increasing data variety.
Defining a goal with your model
It is the responsibility of humans to define the objective for the ground truth machine learning algorithm. In machine learning, the objective is always subjective. There are often disagreements between decision-makers when setting the objective, because in most cases there are no hard-and-fast rules to define the objective or ground truth label in all situations.
All the individual attributes that can influence the predefined objective or target label are chosen as feature sets in the dataset. It is important to ensure that none of these features cause data leakage. Data leakages happen when a model learns a relationship between its target and some data that would not be normally available during inference. Data leakage can result in a model performing very well on the train and validation data but fails miserably in real world test data.
Labeling ground truth data
Once the training objectives are clearly defined, you need to get your data labeled accordingly. Several third-platforms provide data labeling services. Labelbox allows users to invite team members and collaborate over workflows, along with importing and exporting several different kinds of annotation formats. Some other popular platforms are Scale AI and Clarifai used for labeling computer vision, NLP, and audio data.