Processing math: 27%

ABOUT ME

For organizing technical knowledge and experience. demyank88@gmail.com

Today
Yesterday
Total
  • Matrix Factorization
    MLAI/RecommendSystem 2022. 7. 12. 18:58

    Introduction

    • Split the matrix into the product of 2 other matrices
    • We call it R hat because it only approximates R - it is our model of R
    • We would like W and U to be very skinny
    • W(N×K) - users matrix, U(M×K) - movie matrix
    • K somewhere from 10-50

    The scale of matrix R, W, and U

    • Key: W and U should be much smaller than R
    • R is N×M
    • Generally, we can't store R in memory
    • We can represent it using a special data structure
      • Dict{(u, m) -> r}
    • If N = 130k, M = 26k
      • N×M = 3.38 billion
      • # ratings = 20 million
      • Space used: 20 million / 3.38 billion = 0.006
    • This is called a sparse representation

    Scale down to K

    • If K = 10, N = 130k, M = 26k, then size of W and U: NK + MK = 1.56 million
    • How much savings? 1.56 million / 3.38 billion = 0.0005
      • This takes up less space than the ratings
      • This is good, we like # parameters < # of data pts)

    Calculate only at i and j

    • What happens if you try to calculate WUT in code?
    • Don't do it. The result is N×M, which can't load on memory unless you've selected a small subset of your data

    What if I just want one rating, for user i, item j?

    • Just a dot product between 2 vectors of size K
    • Unlike calculating all of our ˆR, this is very fast and takes up a trivial amount of space 

    ˆrij=wTiuj

    where ˆrij=ˆR[i,j],wi=W[i],uj=U[j]

    Singular value decomposition (SVD)

    A matrix X can be decomposed into 3 separate matrices multiplied together

    X=USVT

    What's the justification for calling matrix factorization SVD?

    • S can be absorbed into U or V, we get an (N×K)×(K×M)(N×M)
    • The model is thus exactly like MF

    ˆxij=kuikskkvkj=k(uikskk)vkj=kuikvkj=uTivj

    Linear algebra

    • X(N×M), U(N×K), S(K×K), V(M×K)
    • If I multiply U by S, I just get another N×K matrix
      • Then X is a product of 2 matrices, just like matrix factorization
      • Or equivalently, I could combine S with VT
    • R (rating matrix) is sparse
      • If U, S, and V can properly approximate a full X matrix, then surely it can approximate a mostly empty R matrix

    Interpretation

    • Each of the K elements in wi and uj is a feature
    • Let's suppose K=5, and they are:
      • Action/adventure
      • Comedy
      • Romance
      • Horror
      • Animation
    • wi(1) is how much user_i likes action
    • wi(2) is how much user_i likes comedy, etc.
    • uj(1) is how much movie j contains action
    • uj(2) is how much movie j contains comedy, etc.

    Questions

    • What happens when we dot wTiuj?
    • How well do user_i's preferences correlate with movie_j's attributes?

    wTiuj=

    Example

    Batman Animation

    • w_{i}=(1, 0.8, -1, 0.1, 1)
    • u_{j}=(1, 1.5, -1.3, 0, 1.2)
    • result = 1*1 + 0.8*1.5+1*1.3+0.1*0+1*1.2=4.7

    THE NOTEBOOK

    • w_{i}=(1, 0.8, -1, 0.1, 1)
    • u_{j}=(-1, -1, 1, 0, -1)
    • result = -1*1 + -1*0.8 + -1*1 -1*1 = -3.8

    Supervised Machine Learning

    • We could predict how much a user likes an item, by extracting features from both and feeding it into a model like Random Forest or Neural Network
      • We might include features about the user like age, gender, location, and so forth
      • Some attributes about the item like RAM, Display, etc.
    • The difference is that MF extracts the features automatically using only ratings
      • Only by looking at the patterns between users, items, and the ratings

    Dimensionality Reduction

    • Is exactly equal to X if K=M (assuming N > M and the rank of X is M)
    • By shrinking U, S, and V, we make it an approximation
    • Is called Truncated SVD: and yields our best rank-K approximation of X
    • Similarly, MF reduces the dimensionality of R
    • We encourage the model to learn the most important features required to generate R

    Training

    How can we ensure our approximation is good? We would like R and \hat{R} to close together which are rating matrices

    R\approx \hat{R}=WU^{T}

    Squared error loss

    J=\sum_{i,j\in \Omega }(r_{ij}-\hat{r}_{ij})^{2}=\sum_{i,j\in \Omega}(r_{ij}-w_{i}^{T}u_{j})^{2}

    \Omega=set of pairs (i, j) where user_i rated movie_j

    Minimize the loss

    How? Find the gradient, set it to 0, and solve for the parameters

    J=\sum_{i,j\in \Omega }(r_{ij}-\hat{r}_{ij})^{2}=\sum_{i,j\in \Omega}(r_{ij}-w_{i}^{T}u_{j})^{2}

    Solving for W

    J=\sum_{i,j\in \Omega }(r_{ij}-\hat{r}_{ij})^{2}=\sum_{i,j\in \Omega}(r_{ij}-w_{i}^{T}u_{j})^{2}

    \frac{\partial J}{\partial w_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j})(-u_{j})=0

    • Careful about which sets are being summed over
    • \Omega represents the set of pairs (i,j) where user_i has rated movie_j
    • \Psi_{i} is the set of all movies that user_i has rated 
    • For J, we want to sum over all rating
    • For a particular user vector w_{i}, we only care about movies that user rated (because only those ratings involve w_{i}

    \sum_{j\in \Psi_{i} }(w_{i}^{T}u_{j})(u_{j})=\sum_{j\in \Psi_{i} }r_{ij}u_{j}

    • Try to isolate w_{i}
    • Dot product is commutative

    \sum_{j\in \Psi_{i} }(u_{j}^{T}w_{i})u_{j}=\sum_{j\in \Psi_{i} }r_{ij}u_{j}

    • Result of dot product is scalar
    • scalar \times vector = vector \times scalar

    \sum_{j\in \Psi_{i} }u_{j}(u_{j}^{T}w_{i})=\sum_{j\in \Psi_{i} }r_{ij}u_{j}

    \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T}w_{i}=\sum_{j\in \Psi_{i} }r_{ij}u_{j}

    • Summation doesn't actually depend on i 
    • Add more brackets
    • Do the outer products between u_{j} and itself sum over all values of j

    \left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )w_{i}=\sum_{j\in \Psi_{i} }r_{ij}u_{j}

    • Now it's just Ax = b, which we know how to solve
    • x = np.linalg.solve(A, b)

    w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }r_{ij}u_{j}

    Solving for U

    • Loss is symmetric in W and U, so the steps should be the same
    • Remember to be careful which set to sum over
    • Since we only one terms involving u_{j}, that means we only want the user_i who rated item_j, that's \Omega_{j}

    J=\sum_{i,j\in \Omega }(r_{ij}-\hat{r}_{ij})^{2}=\sum_{i,j\in \Omega}(r_{ij}-w_{i}^{T}u_{j})^{2}

    \frac{\partial J}{\partial u_{j}}=2\sum_{i\in \Omega_{j} }(r_{ij}-w_{i}^{T}u_{j})(-w_{i})=0

    • Try to isolate u_{j}

    \sum_{i\in \Omega_{j} }(w_{i}^{T}u_{j})w_{i}=\sum_{i\in \Omega_{j} }r_{ij}w_{i}

    • Change order of mulitiplication

    \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T}u_{j}=\sum_{i\in \Omega_{j}}r_{ij}w_{i}

    \left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )u_{j}=\sum_{i\in \Omega_{j}}r_{ij}w_{i}

    u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}r_{ij}w_{i}

    Training algorithm

    2-way dependency

    • Solution for W depends on U
    • Solution for U depends on W
    • The issue is we're going to have two separate steps, but there isn't just one global equation
      • This is the nature of our model
    • Simply apply the equation as is, iteratively
    • Is called Alternating Least Squares

    Algorithm

    W = randn(N, K); U = randn(M, K);
    for t in range(T):

    w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }r_{ij}u_{j}

    u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}r_{ij}w_{i}

    FAQ

    • Does it matter which order you update in? No
    • Should you use the old values of W when updating U?
      • Tends to go faster if you use the new values
      • Computationally, if you wanted to use the old values, you'd have to make a copy (very slow)

    Bias Terms

    • The predicted rating, \hat{r}_{ij}, is the sum of the product of the user vector multiplied by the movie vector + user bias b_{i}, the movie bias c_{j}, and the global average \mu
    • Normally, simple linear regression needs only one bias term, but MF would like to have 3

    \hat{r}_{ij}=w_{i}^{T}u_{j}+b_{i}+c_{j}+\mu

    • \mu: Global average. centering the data set. just calculate it directly from train data.
    • The reason we like both a user bias and a movie bias is that both these dimensions can be biased

    Ex: Movie Bias

    • Consider a universally well-liked movie like James Cameron's Avatar
    • MF might find latent features, e.g.
      • Sci-if
      • Aliens
      • Humans fighting aliens
    • Battlefield Earth has these attributes
    • Widely considered to be the worst sci-fi movie of all time
    • Thus, a movie-specific bias term makes sense

    Training

    Solving for W

    • Objective function

    J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^2

    \hat{r}_{ij}=w_{i}^{T}u_{j}+b_{i}+c_{j}+\mu

    • Differentiate wrt w_{i}

    \frac{\partial J}{\partial w_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-u_{j})=0

    • Move other terms to RHS

    \sum_{j\in \Psi_{i} }(w_{i}^{T}u_{j})(u_{j})=\sum_{j\in \Psi_{i} }(r_{ij}-b_{i}-c_{j}-\mu)u_{j}

    • Use the same tricks from before to isolate w

    w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }(r_{ij}-b_{i}-c_{j}-\mu)u_{j}

    Solving for U

    u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}

    Solving for b

    • Differentiate wrt b

    J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^2

    \hat{r}_{ij}=w_{i}^{T}u_{j}+b_{i}+c_{j}+\mu

    \frac{\partial J}{\partial b_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-1)=0

    • Isolate b
    • Common mistake: moving a variable outside a summation is equal to the variable itself (don't forget to multiply by # of terms being summed)

    b_{i}=\frac{1}{\left|\Psi_{i} \right|}\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)

    • b_{i} is the average deviation between target and modeling prediction, if the prediction did not involve b_{i}
    • i.e. b_{i} is exactly how much you need to add to the model prediction without b_{i}

    Solving for c

    • The only difference is what we sum over which is now all the users who rate movie_j

    J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^2

    \hat{r}_{ij}=w_{i}^{T}u_{j}+b_{i}+c_{j}+\mu

    \frac{\partial J}{\partial c_{j}}=2\sum_{i\in \Omega_{j} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-1)=0

    • Again makes sense that c_{j} is the average deviation of a model that doesn't involve c_{j}

    c_{j}=\frac{1}{\left|\Omega_{j} \right|}\sum_{i\in \Omega_{j} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-\mu)

    Summary

    w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }(r_{ij}-b_{i}-c_{j}-\mu)u_{j}

    u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}

    b_{i}=\frac{1}{\left|\Psi_{i} \right|}\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)

    c_{j}=\frac{1}{\left|\Omega_{j} \right|}\sum_{i\in \Omega_{j} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-\mu)

    Regularization

    A technique to prevent overfitting and help generalization

    In linear regression

    • Model: \hat{y}=w^{T}x
    • Objective: J=\sum_{i=1}^{N}(y_{i}-\hat{y}_{i})^{2}+\lambda\left\| w\right\|_{2}^{2}
    • Solution: w=(\lambda I + X^{T}X)^{-1}X^{T}y

    where \left\| w\right\|_{2}^{2} is squared magnitude of the weights themselves which is called penalty term. The idea is if the weights become too large, that's a good sign of overfitting. So large weights are penalized.

    Regularization in Matrix Factorization

    • Same approach, add squared magnitude of each parameter multiplied by regularization constant
    • \left\| *\right\|_{F} is called the Frobenius norm on a matrix

    J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^{2}+\lambda(\left\| W\right\|_{F}^{2}+\left\| U\right\|_{F}^{2}+\left\| b\right\|_{2}^{2}+\left\| c\right\|_{2}^{2})

    Solve for W

    Derivatives are additive, we just need to differentiate the 2nd term and add it to the existing derivative

    \frac{\partial J}{\partial w_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-u_{j})+2\lambda w_{i}=0

    • If you can't see how I differentiated w_{i} wrt Frobenius norm, expand it
    • Now it's just a dot product which we know how to differentiate
    • The Frobenius norm is just the sum of the magnitudes of each individual element 
    • Therefore, I can also split them up in terms of each row as well
    • The squared magnitude of a vector is just the dot product of the vector with itself
    • The derivative of the dot product is just the vector itself

    \left\| W\right\|_{F}^{2}=\sum_{i=1}^{N}\sum_{k=1}^{K}\left| w_{ik}\right|^{2}=\sum_{i=1}^{N}\left\| w_{i}\right\|_{2}^{2}=\sum_{i=1}^{N}w_{i}^{T}w_{i}

    • Group like-terms

    \sum_{j\in \Psi_{i}}u_{j}u_{j}^{T}w_{i}+\lambda w_{i}=\sum_{j\in \Psi_{i}}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}

    • Factor out w

    \left (  \sum_{j\in \Psi_{i}}u_{j}u_{j}^{T} + \lambda I\right )w_{i}=\sum_{j\in \Psi_{i}}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}

    • Apply our usual solution

    w_{i}=\left (  \sum_{j\in \Psi_{i}}u_{j}u_{j}^{T} + \lambda I\right )^{-1}\sum_{j\in \Psi_{i}}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}

    Solve for U

    • Use symmetry reasoning

    u_{j}=\left (  \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} + \lambda I\right )^{-1}\sum_{i\in \Omega_{j}}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}

    Solve for b

    • differentiate

    J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^{2}+\lambda(\left\| W\right\|_{F}^{2}+\left\| U\right\|_{F}^{2}+\left\| b\right\|_{2}^{2}+\left\| c\right\|_{2}^{2})

    \frac{\partial J}{\partial b_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-1)+2\lambda b_{i}=0

    • Isolate b_{i}

    \sum_{j\in \Psi_{i}}b_{i}+\lambda b_{i}=\sum_{j\in \Psi_{i}}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)

    • Factor out b_{i}

    b_{i}\left\{ \left ( \sum_{j\in \Psi_{i}}1 \right )+\lambda \right\} =\sum_{j\in \Psi_{i}}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)

    • Divide by constant

    1+\lambda is bigger than 1, so b_{i} become smaller

    b_{i}=\frac{1}{\left| \Psi_{i} \right| + \lambda}\sum_{j\in \Psi_{i}}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)

    Solve for c

    • Apply symmetry reasoning again

    c_{j}=\frac{1}{\left| \Omega_{j} \right| + \lambda}\sum_{i\in \Omega_{j}}(r_{ij}-w_{i}^{T}u_{j}-b_{i}-\mu)

    Summary

    w_{i}=\left (  \sum_{j\in \Psi_{i}}u_{j}u_{j}^{T} + \lambda I\right )^{-1}\sum_{j\in \Psi_{i}}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}

    u_{j}=\left (  \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} + \lambda I\right )^{-1}\sum_{i\in \Omega_{j}}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}

    b_{i}=\frac{1}{\left| \Psi_{i} \right| + \lambda}\sum_{j\in \Psi_{i}}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)

    c_{j}=\frac{1}{\left| \Omega_{j} \right| + \lambda}\sum_{i\in \Omega_{j}}(r_{ij}-w_{i}^{T}u_{j}-b_{i}-\mu)

    'MLAI > RecommendSystem' 카테고리의 다른 글

    Item-Item Collaborative Filtering  (0) 2022.07.12
    AWS Personalize  (0) 2022.07.07
    User-User Collaborative Filtering  (0) 2022.07.07
    Association Analysis  (0) 2022.07.07

    댓글

Designed by Tistory.