Everyday Challenges in Machine Learning

Saniya Parveez
CodeX
Published in
9 min readMar 3, 2022

--

Introduction

Machine learning is a process of building models that learn from data. It has a set of algorithms that is applied with data. However traditional approach can also solve problems with a minimal set of variables by applying explicit rules. But, it gets complicated when the number of variables increases. Machine learning models, although of how they are represented visually, are mathematical functions and can consequently be implemented from scratch using a numerical software package.

Figure 1: Different types of machine learning

Machine learning persists to become more available, and one interesting development is the availability of machine learning models. Data is at the essence of any machine learning problem. Data is used for the training, validation, and testing of models. Performance reports of the machine learning model need to be calculated on the independent test data, rather than the training or validation tests. It is also noted that the data be split in such a way that all three datasets like training, test, validation have related statistical characteristics.

The first step in a standard machine learning workflow is training — the method of passing training data to a model so that it can learn to identify patterns. After training, the subsequent step in the process is testing how the model performs on data outside of the training set. This is recognized as model evaluation. It might run training and evaluation multiple times, delivering additional feature engineering and tweaking model architecture. Once the model’s performance is good during evaluation, the model is served so that others can access it to make predictions.

Figure 2: Machine learning model development process

As a data scientist, it is important to translate the product team’s needs into the circumstances of a model by saying false negatives are five times more costly than false positives. Hence, it should be optimized for recall over precision to satisfy this when designing a model. It is also important to find a balance between the product team’s goal of optimizing for precision and the goal of minimizing the model’s loss.

Tools for the Data and Model

Various products are available that provide tooling for solving data and machine learning problems. Below are a few tools:

BigQuery

It is an enterprise data warehouse designed for analyzing large datasets quickly with SQL. Data in BigQuery is organized by Datasets, and a Dataset can have multiple Tables.

Figure: Big Query

BigQuery ML

BigQuery ML is a tool for building models from data stored in BigQuery. With BigQuery ML, we can train, evaluate, and generate predictions on our models using SQL. It supports classification and regression models, along with unsupervised clustering models. It’s also possible to import previously trained TensorFlow models to BigQuery ML for prediction.

Figure: Big Query ML

Challenges in Machine Learning

The process of building ML systems offers many different challenges that influence ML architecture design. Identification of these challenges can be eliminated.

Below are a few important challenges in machine learning:

Data Quality

Machine learning models are only reliable if it is trained and generalized. It should neither be overfitted nor under-fitted. Data is a very important factor for the reliability of any model. If a model is trained on a deficient dataset, on data with badly selected features, or on data that doesn’t accurately interpret the population using the model, the model’s predictions will be a direct imputation of that data. Data should have quality and its quality should be based on accuracy, completeness, consistency, and timeliness.

Data Accuracy

Data accuracy belongs to both training data’s features and important truth labels agreeing with those features. If a machine learning model is trained on a deficient dataset, on data with inadequately selected features, or on data that doesn’t accurately represent the population using the model, the model’s predictions will be a direct reflection of that data. So, the model will either be overfitted or underfit.

Figure 3: Underfitting and overfitting

Duplicates in the training dataset, for example, can cause the ML model to inaccurately assign more weight to these data points.

Operations to perform to maintain data quality:

  • Understanding where data came from and any potential errors in the data collection steps can help assure feature accuracy.
  • Analysis screen for typos.
  • Identification of duplicate entries.
  • Measurement of inconsistencies in tabular data.
  • Analysis missing features.
  • Identification of any other errors that may affect data quality.

Accurate data labels are just as important as feature accuracy. As a result, wrongly labeled training examples can produce misleading model accuracy. The model relies only on the ground truth labels in training data to update its weights and minimize loss.

Example:

Let’s take an example that you are developing a sentiment analysis model and 25% of your “positive” training examples have been wrongly labeled as “negative.” Your model will have a wrong picture of what should be counted as negative sentiment, and this will be directly reflected in its predictions.

Data Completeness

It is easy to understand data completeness by taking an example.

Figure 4: Incomplete data

Let’s take an example of a model that is being trained to identify cat breeds.

You train the model on a large dataset of cat images, and the resulting model is capable to classify images into 1 of 10 possible categories of cats like Bengal, Siamese, etc. with 99% accuracy.

Now deploy this model on production so you find that in interest to uploading cat photos for classification, multiple of your users are uploading photos of dogs and are frustrated with the model’s results.

Because the model was trained just to identify 10 separate cat breeds. No matter what you input the model, you can foresee it slot it into one of these 10 categories. It may even do so with a big resolution for an image that looks nothing like a cat. An, there is no say “not a cat” if this data and label weren’t included in the training dataset.

An important aspect of data completeness is to ensure training data should contain a diverse exhibition of each label. For example, if you are developing a model to predict the price of real estate in a particular city but only cover training examples of houses larger than 3,000 square feet, your resulting model will work poorly on smaller houses.

Data Consistency

Data inconsistencies can be observed in both data features and labels. There should be standards to help ensure consistency across datasets. Let's take an example of this.

Let’s say the government is collecting atmospheric data from temperature sensors. If each sensor has been calibrated to different standards, this will follow inaccurate and deceptive model predictions. This data have below deceptive:

  • The difference in measurement units like miles and kilometers.
  • A problem in location data like some people may write out a full street address as “Main Street” and others may abbreviate it as “Main St.”
Data Consistency in Microservices Architecture - DZone Microservices
Figure 5: data Inconsistency

Data Timeliness

Timeliness in data belongs to the latency between when an event happened and when it was added to the database.

For example, for a dataset capturing credit card transactions, it might take one day from when the transaction happened before it is reported in the system. To deal with timeliness, it is helpful to record as much information as feasible about a special data point and make sure that information is displayed when you change your data into features for a machine learning model.

Figure 6: Timeliness of data

Data Reproducibility

Machine learning models have an integral element of randomness. When training, ML model weights are initialized with random values. These weights then converge during training as the model iterates and learn from the data. Due to this, the equal model code given the equal training data will produce significantly different results across training runs. This acquaints a challenge of reproducibility. If you train a model to 98.1% accuracy, a repeated training run is not guaranteed to reach the same result. This can make it hard to run measurements across experiments.

In procedure to address this predicament of repeatability, it is normal to set the random seed value used by the model to guarantee that the same randomness will be applied each time when run training.

Following the ways by which training an ML model involves that need to be fixed to ensure reproducibility:

  • The data used
  • The splitting mechanism used to generate datasets for training and validation
  • Data preparation and model hyperparameters
  • Apply variables like the batch size
  • Learning rate schedule

Data Drift

Machine learning models mainly represent a static connection between inputs and outputs, data can change significantly over time. Data drift leads to the difficulty of ensuring machine learning models stay relevant, and that model predictions are an accurate representation of the environment in which they are being used.

Example:

Let’s there is a model getting trained to classify news article headlines like “politics,” “business,” and “technology.” So, if you train and evaluate your model on historical news articles from the 20th century, it likely won’t perform as well on current data. Today, it is known that an article with the word “smartphone” in the headline is probably about technology. A model trained on historical data would not know this word. This is a data drift.

A solution to solve data drift:

  • Continually update your training dataset
  • Retrain model
  • Modify the weight of the model assigned to particular groups of input data
Figure 7: model with data drift

Scale

When ingesting and developing data for a machine learning model, the size of the dataset will deliver the tooling required for your solution. It is frequently the job of data engineers to build out data pipelines that can scale to handle datasets with millions of rows.

For model training, ML engineers are accountable for managing the necessary infrastructure for a specific training job. Depending on the type and size of the dataset, model training can be time-consuming and computationally expensive, requiring infrastructure (like GPUs) designed specifically for ML workloads. Image models, for example, typically require much more training infrastructure than models trained entirely on tabular data.

Lack of scaling also influences the efficacy of L1 or L2 regularization because the magnitude of weights for a feature depends on the magnitude of values of that feature, and so different features will be affected differently by regularization. By scaling all features to endure between [–1, 1], we assure that there is not much of a difference in the relative magnitudes of different features.

Developers and ML engineers are typically accountable for handling the scaling challenges associated with model deployment and serving prediction requests.

Scaling can be further categorized:

  • Linear Scaling
  • Non-linear Transformation

Summary

Designing, building, and deploying machine learning systems are important steps. Building production machine learning models are frequently becoming an engineering system, taking advantage of ML methods that have been established in research environments and applying them to business problems. As machine learning becomes more mainstream, practitioners must take the benefit of tried-and-proven methods to address recurring problems. We are lucky to work with the TensorFlow, Keras, BigQuery ML, TPU, and Cloud AI Platform teams that are driving the democratization of machine learning research and infrastructure. Once you have collected your dataset and discovered the features for your model, data validation is the process of computing statistics on your data, knowing your schema, and evaluating the dataset to recognize problems like drift and training-serving skew. At the core of any machine learning model is a mathematical function that is defined to work on particular types of data only. Similarly, real-world machine learning models require to run on data that may not be directly pluggable into the mathematical function. Most maximum modern, large-scale machine learning models like random forests, support vector machines, neural networks, etc. work on numerical values, and so if our input is numeric, we can pass it through to the model consistently. It is important to scale ML moel is that some machine learning algorithms and techniques are very sensitive to the relative magnitudes of the different features of data. For example, a k-means clustering algorithm that uses the Euclidean distance as its closeness measure will end up relying massively on features with larger magnitudes.

--

--