In the world of machine learning, the ultimate goal is generalisation. We want our model to learn from its training data so well that it can make accurate predictions on new, unseen data. However, the two most common pitfalls on this journey are underfitting and overfitting.
Getting this balance right is one of the most critical skills in machine learning. This guide will break down exactly what underfitting and overfitting are, why they happen, and the modern techniques you can use to find that “just right” balance for a high-performing model.

The Core Concept: An Analogy for Model Fitting

Imagine you are a student preparing for your final exam and the “training data” is your textbook and the practice problems you’ve been given. The final exam is the “test data“—new questions you’ve never seen before.

  • Underfitting: You only read the chapter titles. You learn the high-level concepts but have no depth. When you take the exam, you fail because you can’t answer any specific questions. Your model is too simple.
  • Overfitting: You memorise every single word and punctuation mark in the textbook. You can answer the practice problems perfectly. But when the final exam asks a slightly different question that requires you to apply a concept, you are lost. Your model is too complex and has memorised the noise, not the signal.
  • Good Fit: You study the textbook to understand the underlying principles and concepts. You work through the practice problems to learn how to apply them. You can now answer both the practice questions and the new exam questions with high accuracy, your model now generalises well.

What is Underfitting? The “Too Simple” Model in ML

Underfitting occurs when a model is too simple to capture the underlying patterns in the training data. It fails to learn the relationships between the input and output variables, resulting in poor performance on both the training data and the test data.

An underfit model has high bias and low variance. “Bias” is the error introduced by approximating a real-world problem, which may be complex, with a model that is too simplistic.

straight line trying to fit a complex, curving set of data points

Key Causes of Underfitting:

  • Oversimplified Model: Using a linear model (like Linear Regression) for complex, non-linear data.
  • Insufficient Features: The data used for training lacks the key features that would allow the model to detect patterns.
  • Excessive Regularisation: Techniques used to prevent overfitting are too aggressive, overly simplifying the model.
  • Inadequate Training: The model hasn’t been trained for enough epochs (cycles) to learn the patterns.

How to Fix Underfitting:

  1. Increase Model Complexity: Switch from a simple model to a more complex one (e.g., from Linear Regression to a Polynomial Regression or a deep neural network).
  2. Add More Features (Feature Engineering): Create new features from the existing data that might have a stronger relationship with the output.
  3. Reduce Regularisation: Decrease the penalty for complexity in your model.
  4. Increase Training Time: Allow the model to train for longer on the data.

What is Overfitting? The “Too Complex” Model in ML

Overfitting occurs when a model learns the training data too well, capturing not only the underlying patterns but also the noise and random fluctuations. It essentially “memorises” the training set, leading to excellent performance on training data but very poor performance on new, unseen test data.

An overfit model has low bias but high variance. “Variance” is the model’s sensitivity to small fluctuations in the training data. High variance means even a small change in the training set could cause the model to change significantly.

squiggly line that passes through every single data point perfectly, including the outliers

Key Causes of Overfitting:

  • Overly Complex Model: Using a model with too many parameters or layers, like a very deep decision tree or a neural network with too many neurons.
  • Insufficient Training Data: The model doesn’t have enough data to learn the true underlying patterns, so it starts memorising the few examples it has.
  • Noisy Data: If the training data is full of errors or irrelevant information (noise), the model may learn this noise instead of the signal.

How to Fix Overfitting (Modern Techniques):

  1. Use More Data: The most effective way to combat overfitting. More data provides a clearer signal of the true underlying pattern.
  2. Data Augmentation: If you can’t collect more data, artificially create it. For image data, this involves rotating, flipping, or cropping existing images. This teaches the model to be robust to variations.
  3. Regularisation: This is a core technique that introduces a penalty for model complexity. The two most common types are:
    • L1 Regularisation (Lasso): Adds a penalty equal to the absolute value of the magnitude of coefficients. It can shrink some coefficients to zero, effectively performing feature selection.
    • L2 Regularisation (Ridge): Adds a penalty equal to the square of the magnitude of coefficients. It forces weights to be small but rarely zero.
  4. Cross-Validation: Use techniques like k-fold cross-validation, where the data is split into ‘k’ subsets. The model is trained on k-1 folds and validated on the remaining one, rotating through all folds. This gives a more reliable estimate of the model’s performance on unseen data.
  5. Simplify the Model:
    • For Neural Networks: Use Dropout, a technique where a random percentage of neurons are “dropped out” or ignored during each training step. This forces other neurons to learn better, more robust features.
    • For Decision Trees: Use Pruning, which involves removing branches that have low importance. This reduces complexity and helps the tree generalise better.
  6. Early Stopping: Monitor the model’s performance on a validation set during training and stop the training process when performance on the validation set begins to degrade, even if performance on the training set is still improving.

The Bias-Variance Tradeoff: Finding the Sweet Spot

The concepts of underfitting and overfitting are governed by the bias-variance tradeoff. This is the fundamental challenge of machine learning.

  • High Bias (Underfitting): The model is simple and makes strong assumptions, leading to high errors on both training and test data.
  • High Variance (Overfitting): The model is complex and highly sensitive to the training data, leading to low training error but high test error.

The Goal: Find the optimal balance. A good model has enough complexity to capture the underlying patterns (low bias) but is not so complex that it memorises the noise (low variance). This is the point where the error on the test set is at its minimum.

Bias-Variance Tradeoff

Summary: A Quick Reference Table

FeatureUnderfittingOverfittingGood Fit
PerformancePoor on train & testGreat on train, poor on testGreat on train & test
Model ComplexityToo SimpleToo ComplexBalanced
BiasHighLowLow
VarianceLowHighLow
AnalogyKnows only chapter titlesMemorised the whole bookUnderstands the concepts
Primary FixIncrease complexity/featuresAdd more data/regulariseYou’re there!

By understanding this crucial balance, you can diagnose your models effectively and apply the right techniques to build robust, accurate, and truly intelligent systems.

Why Choose Aya Data for Your ML Needs?

High-quality, representative data is foundational to avoiding overfitting and underfitting. At Aya Data we offer end-to-end services: data acquisition, annotation for text/images/3D, and custom ML model development. Our partnerships ensure domain-specific datasets for healthcare, agriculture, and geospatial applications.

Contact us to discuss how we can optimise your ML pipelines for 2025’s demands. For more insights, explore our blog on AI training data.