Tech Repo: Feature Engineering

http://localhost:8888/notebooks/Desktop/TechStack/DMLM/MLPipeline-Notebooks/02.7_ML_Pipeline_Step2-FeatureEngineering.ipynb

Why seeding is important during research and development/production env for ml model?

It is important to note that we are engineering variables and pre-processing data with the idea of deploying the model if we find business value in it. Therefore, from now on, for each step that includes some element of randomness, it is extremely important that we set the seed. This way, we can obtain reproducibility between our research and our development code. Reproducibility here means same output comes for same input during reaseach and development(production deployment) code.

Why do we want to set the seed? Manny of the ml algorithm we use, involves element of randomness. Settting the seed makes the reproducibility between research and production/development env possible and control the randomness.

This is perhaps one of the most important lessons that you need to take away from this course: Always set the seeds.

What is seeding?

>>> import random
>>> random.seed(9001)
>>> random.randint(1, 10)
1
>>> random.randint(1, 10)
3
>>> random.randint(1, 10)
6
>>> random.randint(1, 10)
6
>>> random.randint(1, 10)
7

Pseudo-random number generators work by performing some operation on a value. Generally this value is the previous number generated by the generator. However, the first time you use the generator, there is no previous value.

Seeding a pseudo-random number generator gives it its first "previous" value. Each seed value will correspond to a sequence of generated values for a given random number generator. That is, if you provide the same seed twice, you get the same sequence of numbers twice.

Generally, you want to seed your random number generator with some value that will change each execution of the program. For instance, the current time is a frequently-used seed. The reason why this doesn't happen automatically is so that if you want, you can provide a specific seed to get a known sequence of numbers.

Why do we need randomness?
https://www.kdnuggets.com/2017/06/surprising-complexity-randomness.html

Embrace randomness in machine learning

https://machinelearningmastery.com/randomness-in-machine-learning/

Reproducible machine learning
http://www.rctatman.com/files/Tatman_2018_ReproducibleML.pdf

Machine learning reproducibility crisis.
https://petewarden.com/2018/03/19/the-machine-learning-reproducibility-crisis/

Regularization

Pre-step:
Bias and Variance:

https://www.youtube.com/watch?v=EuBBz3bI-aA&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=5

1. Ridge Regularization

https://www.youtube.com/watch?v=Q81RR3yKn30

2. Lasso Regularization

https://www.youtube.com/watch?v=NGf0voTMlcs

3. Elastic net Regularization.

https://www.youtube.com/watch?v=1dKRdX9bfIo

Tech Repo

Saturday, September 14, 2019

Feature Engineering

No comments:

Post a Comment