Machine learning offers a powerful toolkit for building sound complex prediction systems quickly.
This article summarizes a research article published in December 2015, ”Hidden Technical Debt in Machine Learning Systems”, and authored by Google researchers.
As Machine Learning is getting more and more traction and we get more experience with live systems, issues are also emerging.
Developing and deploying Machine Learning systems is relatively fast and cost-effective, but maintaining them over time is difficult and expensive.
To reason about the long-term cost of quick implementation in Machine Learning and similarly to software engineering, we can talk about technical debt.
Because Machine Learning systems have maintenance problems and their own set of specific issues, they can incur technical debts.
If you like this content, you will never miss any new article by subscribing to my free Private Email Newsletter.
Sources Of Technical Debts in Machine Learning Systems
Due to the complexity of the models at hand, different problems can often emerge and lead to technical debts in Machine Learning systems.
For instance, entanglement can be a source of technical debts to Machine Learning systems.
Machine learning systems often mix signals, entangling them and making isolation of improvements impossible. Because of this, removing, adding, or modifying a feature can influence others as no features are totally independent. This is known as the CACE principle: Changing Anything Changes everything.
One possible strategy to reduce the severity of this issue is to isolate models and serve ensembles. However, it is essential to note that resolving one component can potentially damage the whole system if the remaining errors are strongly correlated with one another.
Therefore, a second strategy could be to focus on detecting changes in prediction behavior as they occur, emphasizing the importance of constant monitoring to prevent concept and data drifts.
Correction Cascades are another risk of technical debts. While correction cascades may seem helpful to improve errors on individual components, they can be detrimental to the Machine Learning system as a whole. A way to remedy this problem can be to make the model learn the corrections directly by adding features to distinguish the different cases.
Undeclared Consumers can also incur technical debts. This situation occurs when a prediction from a machine learning model is made widely accessible, either at runtime or by writing to files or logs. In this case, outputs of the systems may later be consumed by other ones. Without access controls, some of these consumers may be undeclared, silently using the output of a given model as an input to another system. In more classical software engineering, these issues are referred to as visibility debt.
Undeclared consumers can be expensive as they can potentially create change that will impact the whole Machine Learning system and are also difficult to detect and remain hidden.
The Cost Of Data Dependencies in Machine Learning Systems
Similar to dependency debts in Software Engineering, in Machine Learning, data dependencies can be responsible for building technical debt.
Data Dependencies can be unstable because the signals can change quickly. This is because of the dynamic nature of Machine Learning systems in the real world. For instance, if the data change, for better or worse, it can impact the model because it fitted the previously calibrated data. Even in the case of data improvement, this can lead to poorer model performance.
One common way to mitigate the issue of unstable data dependencies is to create a versioned copy of a given signal. But, it should be noted that it can also come at the cost of maintaining multiple versions of the same signal over time.
Another issue might arise from underutilized Data Dependencies, which are input signals with little incremental modeling benefit. It results in a model vulnerable to change.
Underutilized data dependencies can affect a model in several ways.
The most common scenario is when a feature is included in a model early in its development. Over time, the feature is made redundant by new ones, but this goes undetected and penalizes the whole Machine Learning system.
Also, sometimes, a group of features is evaluated and found to be beneficial. Unfortunately, because of deadline pressures or other issues arising during the product development, all the features in the bundle are added to the model together, possibly including features that add little or no value.
As machine learning practitioners, it is tempting to improve model accuracy even when the gain is minimal or when the complexity might be high. In a production setting, this can be damaging and reduce the value of the project.
In a case where two features are strongly correlated, one is often more directly causal. Many Machine Learning methods have difficulty detecting this and equally credit the two features or pick the non-causal one. This can break the model if the correlations change later due to the dynamic nature of the data.
The figure above shows that the Machine Learning code is a minimal component of a Machine Learning System, as shown by the small black box in the middle. On the contrary, the required surrounding infrastructure is vast and complex.
This is why tools for static analysis of data dependencies are essential for error checking, migrations, and updates. This process can be automated and improves the Machine Learning workflow.
The Cost of Analysis Debt in Machine Learning Systems
Machine Learning systems can influence their own behavior if they update over time. This phenomenon is known as Analysis Debt. It makes it difficult to predict the behavior of a given model before it is released.
Furthermore, it is difficult to detect and address, mainly if it happens over time when the model is updated infrequently.
In some cases, a model can directly influence the selection of its future training data.
Fortunately, it is possible to reduce its effects by using some amount of randomization or by isolating specific parts of data from being influenced by a given model.
In some other cases, it is also possible that two systems influence each other indirectly through the world. For instance, such a situation happens when Machine Learning systems interact independently as two market actors, for example, on the stock market.
Another example of this may be if two systems independently determine facets of a web page, such as selecting products to show and selecting related reviews. Improvements on one system can lead to change in the other as the behavior of one model will influence the behavior of the other one.
The Cost Of Machine Learning Systems Anti-Patterns
It is unfortunately common for systems that incorporate Machine Learning methods to end up with high-debt design patterns. Below, we describe some examples of high-debt design patterns to avoid them.
One high-debt design pattern or anti-pattern is the use of generic packages. It often leads to a glue code system design pattern, in which a massive amount of supporting code is written to get data into and out of general-purpose packages.
Glue code is costly in the long term because it tends to freeze a system to the characteristics of a specific package and make testing alternative solutions expensive.
For these reasons, developing a clean native solution can be less costly in the long term.
Finally, a strategy to reduce glue code is to wrap black-box packages into common APIs. This makes infrastructures more reusable and reduces the costs of changing packages.
Likewise, pipeline jungles often appear in data preparation and are a particular case of glue code. These pipelines can become a jungle of scrapes or joins spread into different files, making them challenging to correct and maintain.
Pipeline jungles can only be avoided by thinking holistically about data collection and feature extraction. When such a situation occurs, it is, in fact, better to rebuild a clean pipeline.
Glue code and pipeline jungles are symptomatic of integration issues that may have a root cause in overly separated “research” and “engineering” roles.
Consequently, pipeline jungles and glue code can lead to Dead Experimental Codepaths. This is because it becomes increasingly attractive in the short term to perform experiments with alternative methods by implementing experimental code paths as conditional branches within the main production code.
It can lead to a growing technical debt, making testing and maintaining all the possible code paths impossible. To remedy this issue, it is often beneficial to periodically reassess the use and need of each experimental branch.
In software engineering, a design smell may indicate an underlying problem in a component or system. Below is some Machine Learning smells that can be used as indicators:
Plain-Old-Data Type Smell Machine Learning systems input and output valuable information but are often encoded with plain data types like raw floats and integers. In a robust system, a model parameter should know if it is a log-odds multiplier or a decision threshold.
Multiple-Language Smell It can be tempting to use multiple programming languages to build a Machine Learning system, but it often increases the cost of adequate testing. It can increase the difficulty of transferring ownership to other individuals.
Prototype Smell While it is convenient to test new ideas via prototypes, regularly relying on a prototyping environment may indicate that the full-scale system is broken, difficult to change, or could benefit from improved abstractions and interfaces. Also, maintaining a prototyping environment comes at a cost, and there is a risk that time pressures encourage a prototyping system to be used as a production solution. Additionally, results found at a small scale rarely reflect the reality at full scale.
Configuration Debt In Machine Learning Systems
Another area where debt can accumulate is in the configuration of Machine Learning systems. In fact, verification or testing of configurations may not even be seen as important. However, in a mature system, the number of configuration lines can exceed by far the number of lines of the traditional code and implies that each configuration line has a potential for mistakes.
It is important to note that configuration mistakes can be costly, leading to serious loss of time, waste of computing resources, or production issues.
Here are some principles of sound configuration systems:
- It should be easy to specify a configuration as a slight change from a previous configuration.
- It should be hard to make manual errors, omissions, or oversights.
- It should be easy to see the difference in configuration between the two models visually.
- It should be easy to automatically assert and verify basic facts about the configuration: the number of features used, the transitive closure of data dependencies, etc.
- It should be possible to detect unused or redundant settings.
- Configurations should undergo a full code review and be checked into a repository.
Dealing with Changes in the External World
Machine Learning systems often interact directly with the external world, which implies an ongoing maintenance cost.
Fixed Thresholds in Dynamic Systems can present a risk for Machine Learning systems. It is often necessary to pick a decision threshold for a given model to perform some action: to predict true or false, to mark an email as spam or not spam, to show or not show a given ad.
One classic approach in machine learning is choosing a threshold from a set of possible points to get good tradeoffs on specific metrics, such as precision and recall.
One way of reducing the risk of worse-performing Machine Learning systems is to develop systems in which thresholds are learned via simple evaluation on held-out validation data.
Monitoring and testing of individual components and end-to-end systems are important and valuable. Still, in the face of a changing world, such tests are insufficient to prove that a system is working as intended. Real-time monitoring and automated responses are essential.
But, this begs the question of knowing what to monitor? Testable invariants are not always evident, given that many Machine Learning systems are intended to adapt over time.
Below are some parameters that are important to monitor:
Prediction Bias In a system that is working as intended, it should usually be the case that the distribution of predicted labels is equal to the distribution of observed labels. Changes in such variables are often indicative of an issue that requires attention. It can show a sudden change of dynamic in the real world, and when this occurs, it means that the training distribution does not reflect the reality anymore and has become irrelevant.
Action Limits In systems used to take actions in the real world, such as bidding on items or marking messages as spam, it can be helpful to set and enforce action limits as a sanity check.
Upstream Producers Data is often fed through to a learning system from various upstream producers, which need to be monitored, tested to ensure their accuracy.
Last but not least, because external changes occur in real-time, the response must also happen in real-time, thus the need for automated responses as much as possible, especially for time-sensitive matters.
Other Areas of Machine Learning Related Debt
Let’s briefly highlight some additional areas where Machine Learning related technical debt may occur.
Data Testing Debt can occur If data replace code in Machine Learning Systems, and code should be tested, input data testing is critical to a well-functioning system.
Reproducibility Debt As scientists, we need to be able to reproduce experiments and get similar results. But, designing real-world systems to allow for strict reproducibility is a task made difficult by randomized algorithms, the inability to predict an outcome due to parallel learning, the importance of the initial conditions, and interactions with the external world.
Process Management Debt Mature systems may have dozens or hundreds of models running simultaneously. This can be the source of a wide range of problems, including the issue of updating many configurations for many similar models safely and automatically, how to manage and assign resources among models with different business priorities, and how to visualize and detect blockages in the flow of data in a production pipeline.
In this view, It is essential to develop appropriate tools to facilitate recovery from production incidents. An important system-level smell to avoid is common processes with many manual steps, which should be automated as much as possible.
Conclusion & Closing Thoughts
In this article, we defined technical debts in Machine Learning systems and how they can occur. We mentioned some ways of reducing their risk and prevent them from happening.
In the end, Machine Learning systems are living entities and should be treated as such, meaning they can evolve and will change over time; hence monitoring and testing are essential at every step of the pipeline.
If you are curious to learn more about this topic, you can access the paper here.
If you like this content, you will never miss any new article by subscribing to my free Private Email Newsletter.