Monday, September 16, 2019

Building a Reproducible Machine learning Pipeline

Problems that we normally encounter when we build machine learning pipelines and how we make sure we minimize them by implementing the correct design of ml pipeline right from the start.

Lack of reproducibility can have significant financial cost. Also lost of time and potential loss of reputation.



Remember we just don't deploy ml models, we deploy entire ml pipeline, so we need to make sure every step of pipeline is reproducible. In the ML pipeline(refer Machine Learning Model Pipeline Overview), All the steps except Data analysis need reproducibility. So all these steps must produce identical result given the same data both in research and deployed production env.



In case of SQL loading(random loading), if the data that was loaded in one env does not coincide with another env, we will have reproducibility problems. This comes from the fact that, when we divide the train and test set, we utilize the random function, so we need the training set in another env(research and production) is exactly the same. We solve this via keeping the same seed in the random function between envs.

Also when we store snapshot of data, with GDPR, you might not be allowed to store data other than the source.







Neural networks pose particular challenge because we need to set the seed on several occasions, depending on the pattern we are using to try and make reusable many random initializations parameters it need in order to be trained. So In NN, all the required seeds needs to be saved.





Much of the loss of benefit that the model should provide comes from incomplete or erroneous integration of the models with the other systems environment.

Additional Resources.

Scaling Machine Learning as a service: Uber’s pipeline

A systems perspective to reproducibility in Production Machine Learning

Hidden technical debt in machine learning systems

No comments:

Post a Comment