ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • [Data Cleansing] scaling과 normalization의 차이
    ML&DL 2021. 2. 26. 16:37

    Scaling vs. Normalization: What's the difference?

    One of the reasons that it's easy to get confused between scaling and normalization is because the terms are sometimes used interchangeably and, to make it even more confusing, they are very similar! In both cases, you're transforming the values of numeric variables so that the transformed data points have specific helpful properties. The difference is that:

    • in scaling, you're changing the range of your data, while
    • in normalization, you're changing the shape of the distribution of your data.

    Let's talk a little more in-depth about each of these options.

     

    Scaling

    This means that you're transforming your data so that it fits within a specific scale, like 0-100 or 0-1. You want to scale data when you're using methods based on measures of how far apart data points are, like support vector machines (SVM) or k-nearest neighbors (KNN). With these algorithms, a change of "1" in any numeric feature is given the same importance.

    For example, you might be looking at the prices of some products in both Yen and US Dollars. One US Dollar is worth about 100 Yen, but if you don't scale your prices, methods like SVM or KNN will consider a difference in price of 1 Yen as important as a difference of 1 US Dollar! This clearly doesn't fit with our intuitions of the world. With currency, you can convert between currencies. But what about if you're looking at something like height and weight? It's not entirely clear how many pounds should equal one inch (or how many kilograms should equal one meter).

    By scaling your variables, you can help compare different variables on equal footing. To help solidify what scaling looks like, let's look at a made-up example. (Don't worry, we'll work with real data in the following exercise!)

    In [2]:

    # generate 1000 data points randomly drawn from an exponential distribution original_data = np.random.exponential(size=1000) # mix-max scale the data between 0 and 1 scaled_data = minmax_scaling(original_data, columns=[0]) # plot both together to compare fig, ax = plt.subplots(1,2) sns.distplot(original_data, ax=ax[0]) ax[0].set_title("Original Data") sns.distplot(scaled_data, ax=ax[1]) ax[1].set_title("Scaled data")

    Out[2]:

    Text(0.5, 1.0, 'Scaled data')

    Notice that the shape of the data doesn't change, but that instead of ranging from 0 to 8ish, it now ranges from 0 to 1.

    Normalization

    Scaling just changes the range of your data. Normalization is a more radical transformation. The point of normalization is to change your observations so that they can be described as a normal distribution.

    Normal distribution: Also known as the "bell curve", this is a specific statistical distribution where a roughly equal observations fall above and below the mean, the mean and the median are the same, and there are more observations closer to the mean. The normal distribution is also known as the Gaussian distribution.

    In general, you'll normalize your data if you're going to be using a machine learning or statistics technique that assumes your data is normally distributed. Some examples of these include linear discriminant analysis (LDA) and Gaussian naive Bayes. (Pro tip: any method with "Gaussian" in the name probably assumes normality.)

    The method we're using to normalize here is called the Box-Cox Transformation. Let's take a quick peek at what normalizing some data looks like:

    In [3]:

    # normalize the exponential data with boxcox normalized_data = stats.boxcox(original_data) # plot both together to compare fig, ax=plt.subplots(1,2) sns.distplot(original_data, ax=ax[0]) ax[0].set_title("Original Data") sns.distplot(normalized_data[0], ax=ax[1]) ax[1].set_title("Normalized data")

    Out[3]:

    Text(0.5, 1.0, 'Normalized data')

    Notice that the shape of our data has changed. Before normalizing it was almost L-shaped. But after normalizing it looks more like the outline of a bell (hence "bell curve").

    'ML&DL' 카테고리의 다른 글

    [dacon] 와인품질 EDA 및 1차 모델 개발  (0) 2021.06.14
    roc curve  (0) 2021.05.06
    bagging Emsemble, semi-supervised  (2) 2021.03.11
    [Kaggle] santander product recomendation EDA  (0) 2021.02.19
    데이터 분석시 해야 할 작업  (0) 2020.04.20
Designed by Tistory.