Common Challenges in ML still exist

Jul 12, 2023

While Large Language Models (LLMs) have gained all the attention with their recent improvements and enchanted us with their promising capabilities, the fundamental challenges of ML are still there and it’s worth revisiting them. In this article, we dive into the challenges faced by all of us working on data science projects, inspired by the book Machine Learning Design Patterns. As tempting as it is to impress your C-level stakeholders with the latest AI gadgets at the next company event, you might want to address the following (less sexy) challenges with them first:

Data quality
Reproducibility
Data drift
Scalability
Unaligned objectives

Data quality

To prevent “garbage in, garbage out”, it’s important to ensure data accuracy, completeness, consistency, and timeliness.

Accurate data means that data has the correct values of the real-world example that it represents. ML models are highly impacted by the data used. It is obvious that more complete and reliable data will lead to a more reliable ML model. Make sure that the ground truth labels are corresponding to the features.

Complete

Data completeness means that the dataset includes all the necessary and expected information without missing any class or relevant data points. Incomplete data might lead to unreliable and biased predictions. For example, you are implementing a model to detect and classify the type of a vehicle, and you have 2 labels in your dataset, hatchback car, and sedan car. But the end users are also expected to upload images of SUVs that are not included in the training set, thus your model wouldn’t correctly identify them. Make sure that the training data has the various data points of each label. In this example, it would be photos taken from all possible angles for hatchback, sedan, and SUV cars.

Consistent

Consistent data means that data is collected and labeled by following a standard process, without any biases involved. For example, you are creating a dataset to classify topics of customer reviews, and the labels are “product quality”, “price and value”, “discounts and promotions”, “shipping & delivery”, and “customer experience”. You might have some disagreements when it comes to people assigning labels to each review, some might think the review “it’s a great website to shop, I received my order in 1 day” is a customer experience, while others think it’s “shipment and delivery”.

Timely

Timeliness means the data is up to date and represent the current state of the information. Outdated data can lead to wrong predictions and provide incorrect insights. For example in a customer database, the purchase history should include the most recent updates in order to provide relevant product recommendations.

Reproducibility

Reproducibility in ML is the ability to recreate the same results from the same ML experiment. For ML it is different from traditional programming. For instance, a Python function that calculates the sum of two digits will always give the same output for given inputs 2 and 3, which is 5. But in ML algorithms, there is randomness and data involved, which make recreation difficult to guarantee.

It’s important to implement reproducible ML models for several reasons: validation and verification, comparison and evaluation, detecting errors and bugs, etc.

To achieve reproducibility, the following components should be considered to be the same:

Code and algorithm
Training data
Randomness configuration (for example the seed value used in the model for randomness should be set)
Environment and dependencies (libraries, requirements for both training and serving environments should be clearly defined)

Data drift

Data drift means the deviation in the statistical features of training data and the data used for inference. It leads to a mismatch between the training and inference data which impacts the reliability of models in production. Let’s say you are implementing a recommender system for customers in e-commerce and use historical data. Over time there are new product categories added and customer behavior changed. The model trained on old data is no longer representing the current customer behavior.

Tips to prevent this kind of drift:

Monitor the input data, statistical properties, distributions
Retrain the ML model regularly to ensure it’s adapting to data distribution
Create robust features that are less dependent on the raw data, instead capturing the underlying pattern.
Build a feedback loop from end users to get insights on the model performance.

Scalability

Scaling in ML means adapting the current solution to the changes in different angles, which can be an increase in data, an increase in model complexity, or an increase in demand for underlying resources for serving. Often ML engineers are expected to address the challenges and decide what is needed for scaling up.

Distributed storage systems and efficient preprocessing techniques can be used for handling large datasets.

When model complexity increases, for instance, you deploy a collaborative filtering algorithm for the recommendation system and later your data gets larger and you decide to use a deep learning model, distributed computing resources (GPUs), or distributed clusters might be needed to accelerate training and inference.

To handle increased demand in real-time predictions, let’s say you decide to scale your real-time recommendation model from 100K customers to 1 million, scalable infrastructure (cloud-based services or serverless approach), load balancing distributing incoming inference, and caching the frequently requested data are common techniques to apply.

Unaligned objectives

In a data science project, it is important to align objectives from all the teams involved. What data scientists are aiming to achieve may not be the same as what business stakeholders are expecting to get. For example, you are building a classifier model to identify customer churn (0: not likely to churn, 1: likely to churn), and give relevant promotions to those who are likely to churn. As a data scientist, your goal would be the achieve highest f1-score to increase accuracy, whereas the business does not want risks and thus will not want to give an incentive to all customers who are likely to churn. That would be casting too big of a net, possibly leading to huge incentive costs and eventually a low ROI. So the best-performing model from a data science perspective should not be the objective here in itself. Some additional business ruling or even a custom model evaluation metric might be needed to achieve the business goals.

To avoid misalignments later in the process, make sure to establish a clear project goal with the stakeholders before starting the development, and identify the key deliverables and KPIs.

Marvelous MLOps Substack

Discussion about this post