Feature Scaling

MLAI/Preprocessing 2020. 1. 18. 21:39

1. Issue

Let's explain what its features scaling and why we need to do it. So as you can see we have these two columns age and salary that contain numerical numbers. Let's just focus on the age and the salary. You notice that the variables are not on the same scale because the age is going from 27 to 50. And the salaries going from 40K to like 90K. So because this age variable in the salary variable doesn't have the same scale. This will cause some issues in your machine learning models. It's because your machine models a lot of machine learning models are based on what is called the Euclidean distance.

Actually, since the salary has a much wider range of values because it's going from zero to 100 k. The Euclidean distance will be dominated by the salary because for example if we take two observations

And now let's take the square equals this square 441 so you can see very clearly how this square difference dominates this square difference. And that's because these two variables are not on the same scale. So you know in the machine or in the equations it will be like this doesn't exist because it will be dominated by this. So that's why we absolutely need to put the variables on the same scale.

2. Resolution

A very common one is standardization which means that for each observation and each feature you withdraw the mean value of all the values of the feature and you divide it by the standard deviation. So that's the first type of features gaining and another type is normalization which means that you subtract your observation feature X by the minimum value of all the future values and you divide it by the difference between the maximum of your future values and the minimum of your future values. what you need to understand is that we are putting our variables in the same range on the same scale so that no variable is dominated by the other.

That will largely improve your machine remodels. And besides, I forgot to mention that even if sometimes machine models are not based on Euclidean distances we will still need to do features scaling because the algorithm will converge much faster. That will be the case for decision trees are not based on Euclidean distances but you will see that we will need to do feature scaling because if we don't do it they will run for a very long time.

But you will see that for regression when the dependent variable will take a huge range of values. We will need to apply feature scaling to the dependent variable y as well.

저작자표시

'MLAI > Preprocessing' 카테고리의 다른 글

Categorical Data (0)	2020.01.18
Missing Data (0)	2020.01.18
Term frequency–inverse document frequency(TF-IDF) (0)	2019.09.25

ABOUT ME

Demyank's Tlog Demyank's Tlog

1. Issue

2. Resolution

'MLAI > Preprocessing' 카테고리의 다른 글

티스토리툴바

ABOUT ME

1. Issue

2. Resolution

'MLAI > Preprocessing' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바