The 80/20 rule in Machine learning is a splitting strategy for datasets, wherein you can divide a dataset in any ratio as long as your model receives a sufficient amount of data for training and a comparable amount of data for validation in order to obtain an accurate assessment of model performance.
Now, these splits can be 90/10, 80/20, 75/25, 70/30, 66.6/33.3, 65/35, 55/45 and many others. However, the most commonly used splitting ratio is the 80/20 splitting ratio, in which 80% of the available data is allocated for the training dataset and 20% for the test dataset.
If you join a good Machine Learning Bootcamp, this is the splitting ratio that you will most probably learn. In this article, let’s examine why it is important to use the 80/20 rule in Machine Learning and how it might alter your data-processing workflow.
Martin Doyle proposes the 80:20 Rule to assess the expenses of Jim Barker’s Type I and Type II data quality issues.
Briefly stated, Type I data quality issues are those that come under the completeness, consistency, uniqueness, and validity data quality dimensions and require “know what” to recognize. Data quality software can quickly identify and even fix Type I data quality issues.
Type II data quality issues are those “semantic” issues that call for domain expertise and “know-how” to identify and address. This kind of issue is more mysterious. The information appears to be in order and is generally useful. However, the harm they do to businesses could be considerably worse.
According to Martin Doyle’s 80:20 Rule, only 20% of data quality issues are Type II issues, while 80% of issues are Type I issues. However, it requires 80% more time and money to tackle Type II difficulties than Type I problems, which only require 20% as much.
Benefits of the 80/20 Rule
There is a lot of anecdotal evidence that supports the 80-20 rule as being essentially valid, if not mathematically correct, despite the fact that there is a little scientific investigation that supports or refutes its validity.
The 80-20 rule has been successfully implemented by salespeople in a variety of industries, as evidenced by their performance results. Additionally, external consultants who employ Six Sigma and other management techniques have successfully incorporated the 80-20 rule into their procedures.
Machine learning aims to mimic human learning by using data and algorithms to gradually increase a system’s accuracy.
Data science is a fast-expanding area, and machine learning is a key component. Algorithms are trained using statistical techniques to offer classifications or predictions and unearth critical insights in data mining operations.
The decisions made as a result of these insights should ideally have an impact on crucial growth indicators in applications and enterprises.
Data scientists will be in higher demand as big data develops and grows. They will be required to help identify the most important business queries and the data required to answer them.
Most often, accelerated solution development frameworks like TensorFlow and PyTorch are used to create machine learning algorithms.
How The 80/20 Rule and Machine Learning Changes Everything
It is obvious from how a machine learning engineer spends his or her time studying, creating, and selecting ML projects, sprint planning and evaluation during the project, coding and bug fixing, and many other activities how important it is for him or her to follow the Pareto principle.
While modelling and training your data sets, a general rule is to keep a higher ratio on the training side because you would want the model to train first to be validated later, so the training fraction is higher than the validation fraction.
To generate the validation set for the machine learning model, the dataset should be split in half at a ratio of 80% to 20%. 80% of this dataset is designated for training, and 20% is designated for testing. This ratio may be 90-20%, 70-30%, or 60-40%, but these ratios would not be preferable.
How to divide a dataset into training and validation sets using the 80/20 principle?
There are two competing issues: your parameter estimations have a higher variance when there are less training data. Your performance metric will have more volatility if there are fewer testing data points.
It is generally recommended to divide data so that neither variance is excessive; this has more to do with the precise number of instances in each group than it does with the percentage.
No single split will be able to provide you with sufficient variance in your estimates if you have a total of 100 examples; therefore, cross-validation is likely your only option.
It doesn’t really matter if you choose an 80:20 split or a 90:10 split if you have 100,000 instances (indeed, you may choose to use less training data if your method is particularly computationally intensive).
The following is a helpful technique to understand variances, assuming you have enough data to perform proper held-out test data (instead of cross-validation):
- Divide your data between training and testing (80/20 is a decent place to start)
- Divide the training data into training and validation (again, an 80/20 split is reasonable)
- Sample random subsets of your training data, use this to train the classifier and then track how well it does on the validation set.
Try a series of runs using various training data densities: Do a ten-time random sampling of 20% of it, for example, then evaluate performance using validation data before repeating the process with 40%, 60%, and 80%.
With additional data, you should observe improved performance as well as the decreased variance between the various random samples.
Use the same method backward to obtain a handle on variance resulting from the size of the test data. Train using all of your training data, then analyze performance using a set of random samples from your validation data.
You should now observe that while the variance is significantly higher with fewer test samples, the mean performance on tiny samples of your validation data should be about equivalent to the mean performance on all the validation data.
A cross-validation technique that provides you with the most accurate estimate of your model’s performance is the best course of action.
Also, keep in mind that Cross Validation is a method for estimating model performance rather than for training your models. After testing, train your model using all of the data (100%) and utilize it for deployment and testing.