ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Matrix Factorization
    MLAI/RecommendSystem 2022. 7. 12. 18:58

    Introduction

    • Split the matrix into the product of 2 other matrices
    • We call it R hat because it only approximates R - it is our model of R
    • We would like W and U to be very skinny
    • $W(N \times K)$ - users matrix, $U(M \times K)$ - movie matrix
    • K somewhere from 10-50

    The scale of matrix R, W, and U

    • Key: $W$ and $U$ should be much smaller than $R$
    • $R$ is $N \times M$
    • Generally, we can't store $R$ in memory
    • We can represent it using a special data structure
      • Dict{(u, m) -> r}
    • If N = 130k, M = 26k
      • $N \times M$ = 3.38 billion
      • # ratings = 20 million
      • Space used: 20 million / 3.38 billion = 0.006
    • This is called a sparse representation

    Scale down to K

    • If K = 10, N = 130k, M = 26k, then size of W and U: NK + MK = 1.56 million
    • How much savings? 1.56 million / 3.38 billion = 0.0005
      • This takes up less space than the ratings
      • This is good, we like # parameters < # of data pts)

    Calculate only at i and j

    • What happens if you try to calculate $WU^{T}$ in code?
    • Don't do it. The result is $N \times M$, which can't load on memory unless you've selected a small subset of your data

    What if I just want one rating, for user i, item j?

    • Just a dot product between 2 vectors of size K
    • Unlike calculating all of our $\hat{R}$, this is very fast and takes up a trivial amount of space 

    $$\hat{r}_{ij}=w_{i}^{T}u_{j}$$

    $where \ \hat{r}_{ij}=\hat{R}[i,j],w_{i}=W[i],u_{j}=U[j]$

    Singular value decomposition (SVD)

    A matrix X can be decomposed into 3 separate matrices multiplied together

    $$X=USV^{T}$$

    What's the justification for calling matrix factorization SVD?

    • S can be absorbed into $U$ or $V$, we get an $(N \times K) \times (K \times M) \to (N \times M)$
    • The model is thus exactly like MF

    $$\hat{x}_{ij}=\sum_{k}u_{ik}s_{kk}v_kj=\sum_{k}(u_{ik}s_{kk})v_{kj}=\sum_{k}{u}'_{ik}v_{kj}=u_{i}^{'T}v_{j}$$

    Linear algebra

    • $X(N \times M)$, $U(N \times K)$, $S(K \times K)$, $V(M \times K)$
    • If I multiply $U$ by $S$, I just get another $N \times K$ matrix
      • Then $X$ is a product of 2 matrices, just like matrix factorization
      • Or equivalently, I could combine $S$ with $V^{T}$
    • $R$ (rating matrix) is sparse
      • If $U$, $S$, and $V$ can properly approximate a full $X$ matrix, then surely it can approximate a mostly empty $R$ matrix

    Interpretation

    • Each of the K elements in $w_{i}$ and $u_{j}$ is a feature
    • Let's suppose K=5, and they are:
      • Action/adventure
      • Comedy
      • Romance
      • Horror
      • Animation
    • $w_{i}(1)$ is how much user_i likes action
    • $w_{i}(2)$ is how much user_i likes comedy, etc.
    • $u_{j}(1)$ is how much movie j contains action
    • $u_{j}(2)$ is how much movie j contains comedy, etc.

    Questions

    • What happens when we dot $w_{i}^{T}u_{j}$?
    • How well do user_i's preferences correlate with movie_j's attributes?

    $$w_{i}^{T}u_{j}=\left\| w_{i}\right\|\left\| u_{j}\right\|cos\theta \propto sim(i,j)$$

    Example

    Batman Animation

    • $w_{i}=(1, 0.8, -1, 0.1, 1)$
    • $u_{j}=(1, 1.5, -1.3, 0, 1.2)$
    • result = 1*1 + 0.8*1.5+1*1.3+0.1*0+1*1.2=4.7

    THE NOTEBOOK

    • $w_{i}=(1, 0.8, -1, 0.1, 1)$
    • $u_{j}=(-1, -1, 1, 0, -1)$
    • result = -1*1 + -1*0.8 + -1*1 -1*1 = -3.8

    Supervised Machine Learning

    • We could predict how much a user likes an item, by extracting features from both and feeding it into a model like Random Forest or Neural Network
      • We might include features about the user like age, gender, location, and so forth
      • Some attributes about the item like RAM, Display, etc.
    • The difference is that MF extracts the features automatically using only ratings
      • Only by looking at the patterns between users, items, and the ratings

    Dimensionality Reduction

    • Is exactly equal to X if K=M (assuming N > M and the rank of X is M)
    • By shrinking U, S, and V, we make it an approximation
    • Is called Truncated SVD: and yields our best rank-K approximation of X
    • Similarly, MF reduces the dimensionality of R
    • We encourage the model to learn the most important features required to generate R

    Training

    How can we ensure our approximation is good? We would like $R$ and $\hat{R}$ to close together which are rating matrices

    $$R\approx \hat{R}=WU^{T}$$

    Squared error loss

    $$J=\sum_{i,j\in \Omega }(r_{ij}-\hat{r}_{ij})^{2}=\sum_{i,j\in \Omega}(r_{ij}-w_{i}^{T}u_{j})^{2}$$

    $\Omega=$set of pairs (i, j) where user_i rated movie_j

    Minimize the loss

    How? Find the gradient, set it to 0, and solve for the parameters

    $$J=\sum_{i,j\in \Omega }(r_{ij}-\hat{r}_{ij})^{2}=\sum_{i,j\in \Omega}(r_{ij}-w_{i}^{T}u_{j})^{2}$$

    Solving for $W$

    $$J=\sum_{i,j\in \Omega }(r_{ij}-\hat{r}_{ij})^{2}=\sum_{i,j\in \Omega}(r_{ij}-w_{i}^{T}u_{j})^{2}$$

    $$\frac{\partial J}{\partial w_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j})(-u_{j})=0$$

    • Careful about which sets are being summed over
    • $\Omega$ represents the set of pairs (i,j) where user_i has rated movie_j
    • $\Psi_{i}$ is the set of all movies that user_i has rated 
    • For $J$, we want to sum over all rating
    • For a particular user vector $w_{i}$, we only care about movies that user rated (because only those ratings involve $w_{i}$

    $$\sum_{j\in \Psi_{i} }(w_{i}^{T}u_{j})(u_{j})=\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$

    • Try to isolate $w_{i}$
    • Dot product is commutative

    $$\sum_{j\in \Psi_{i} }(u_{j}^{T}w_{i})u_{j}=\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$

    • Result of dot product is scalar
    • $scalar \times vector = vector \times scalar$

    $$\sum_{j\in \Psi_{i} }u_{j}(u_{j}^{T}w_{i})=\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$

    $$\sum_{j\in \Psi_{i} }u_{j}u_{j}^{T}w_{i}=\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$

    • Summation doesn't actually depend on i 
    • Add more brackets
    • Do the outer products between $u_{j}$ and itself sum over all values of j

    $$\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )w_{i}=\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$

    • Now it's just $Ax = b$, which we know how to solve
    • x = np.linalg.solve(A, b)

    $$w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$

    Solving for $U$

    • Loss is symmetric in $W$ and $U$, so the steps should be the same
    • Remember to be careful which set to sum over
    • Since we only one terms involving $u_{j}$, that means we only want the user_i who rated item_j, that's $\Omega_{j}$

    $$J=\sum_{i,j\in \Omega }(r_{ij}-\hat{r}_{ij})^{2}=\sum_{i,j\in \Omega}(r_{ij}-w_{i}^{T}u_{j})^{2}$$

    $$\frac{\partial J}{\partial u_{j}}=2\sum_{i\in \Omega_{j} }(r_{ij}-w_{i}^{T}u_{j})(-w_{i})=0$$

    • Try to isolate $u_{j}$

    $$\sum_{i\in \Omega_{j} }(w_{i}^{T}u_{j})w_{i}=\sum_{i\in \Omega_{j} }r_{ij}w_{i}$$

    • Change order of mulitiplication

    $$\sum_{i\in \Omega_{j}}w_{i}w_{i}^{T}u_{j}=\sum_{i\in \Omega_{j}}r_{ij}w_{i}$$

    $$\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )u_{j}=\sum_{i\in \Omega_{j}}r_{ij}w_{i}$$

    $$u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}r_{ij}w_{i}$$

    Training algorithm

    2-way dependency

    • Solution for $W$ depends on $U$
    • Solution for $U$ depends on $W$
    • The issue is we're going to have two separate steps, but there isn't just one global equation
      • This is the nature of our model
    • Simply apply the equation as is, iteratively
    • Is called Alternating Least Squares

    Algorithm

    W = randn(N, K); U = randn(M, K);
    for t in range(T):

    $$w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$

    $$u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}r_{ij}w_{i}$$

    FAQ

    • Does it matter which order you update in? No
    • Should you use the old values of $W$ when updating $U$?
      • Tends to go faster if you use the new values
      • Computationally, if you wanted to use the old values, you'd have to make a copy (very slow)

    Bias Terms

    • The predicted rating, $\hat{r}_{ij}$, is the sum of the product of the user vector multiplied by the movie vector + user bias $b_{i}$, the movie bias $c_{j}$, and the global average $\mu$
    • Normally, simple linear regression needs only one bias term, but MF would like to have 3

    $$\hat{r}_{ij}=w_{i}^{T}u_{j}+b_{i}+c_{j}+\mu$$

    • $\mu$: Global average. centering the data set. just calculate it directly from train data.
    • The reason we like both a user bias and a movie bias is that both these dimensions can be biased

    Ex: Movie Bias

    • Consider a universally well-liked movie like James Cameron's Avatar
    • MF might find latent features, e.g.
      • Sci-if
      • Aliens
      • Humans fighting aliens
    • Battlefield Earth has these attributes
    • Widely considered to be the worst sci-fi movie of all time
    • Thus, a movie-specific bias term makes sense

    Training

    Solving for $W$

    • Objective function

    $$J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^2$$

    $$\hat{r}_{ij}=w_{i}^{T}u_{j}+b_{i}+c_{j}+\mu$$

    • Differentiate wrt $w_{i}$

    $$\frac{\partial J}{\partial w_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-u_{j})=0$$

    • Move other terms to RHS

    $$\sum_{j\in \Psi_{i} }(w_{i}^{T}u_{j})(u_{j})=\sum_{j\in \Psi_{i} }(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$

    • Use the same tricks from before to isolate w

    $$w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$

    Solving for $U$

    $$u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}$$

    Solving for b

    • Differentiate wrt b

    $$J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^2$$

    $$\hat{r}_{ij}=w_{i}^{T}u_{j}+b_{i}+c_{j}+\mu$$

    $$\frac{\partial J}{\partial b_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-1)=0$$

    • Isolate b
    • Common mistake: moving a variable outside a summation is equal to the variable itself (don't forget to multiply by # of terms being summed)

    $$b_{i}=\frac{1}{\left|\Psi_{i} \right|}\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)$$

    • $b_{i}$ is the average deviation between target and modeling prediction, if the prediction did not involve $b_{i}$
    • i.e. $b_{i}$ is exactly how much you need to add to the model prediction without $b_{i}$

    Solving for c

    • The only difference is what we sum over which is now all the users who rate movie_j

    $$J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^2$$

    $$\hat{r}_{ij}=w_{i}^{T}u_{j}+b_{i}+c_{j}+\mu$$

    $$\frac{\partial J}{\partial c_{j}}=2\sum_{i\in \Omega_{j} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-1)=0$$

    • Again makes sense that $c_{j}$ is the average deviation of a model that doesn't involve $c_{j}$

    $$c_{j}=\frac{1}{\left|\Omega_{j} \right|}\sum_{i\in \Omega_{j} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-\mu)$$

    Summary

    $$w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$

    $$u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}$$

    $$b_{i}=\frac{1}{\left|\Psi_{i} \right|}\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)$$

    $$c_{j}=\frac{1}{\left|\Omega_{j} \right|}\sum_{i\in \Omega_{j} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-\mu)$$

    Regularization

    A technique to prevent overfitting and help generalization

    In linear regression

    • Model: $\hat{y}=w^{T}x$
    • Objective: $J=\sum_{i=1}^{N}(y_{i}-\hat{y}_{i})^{2}+\lambda\left\| w\right\|_{2}^{2}$
    • Solution: $w=(\lambda I + X^{T}X)^{-1}X^{T}y$

    where $\left\| w\right\|_{2}^{2}$ is squared magnitude of the weights themselves which is called penalty term. The idea is if the weights become too large, that's a good sign of overfitting. So large weights are penalized.

    Regularization in Matrix Factorization

    • Same approach, add squared magnitude of each parameter multiplied by regularization constant
    • $\left\| *\right\|_{F}$ is called the Frobenius norm on a matrix

    $$J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^{2}+\lambda(\left\| W\right\|_{F}^{2}+\left\| U\right\|_{F}^{2}+\left\| b\right\|_{2}^{2}+\left\| c\right\|_{2}^{2})$$

    Solve for $W$

    Derivatives are additive, we just need to differentiate the 2nd term and add it to the existing derivative

    $$\frac{\partial J}{\partial w_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-u_{j})+2\lambda w_{i}=0$$

    • If you can't see how I differentiated $w_{i}$ wrt Frobenius norm, expand it
    • Now it's just a dot product which we know how to differentiate
    • The Frobenius norm is just the sum of the magnitudes of each individual element 
    • Therefore, I can also split them up in terms of each row as well
    • The squared magnitude of a vector is just the dot product of the vector with itself
    • The derivative of the dot product is just the vector itself

    $$\left\| W\right\|_{F}^{2}=\sum_{i=1}^{N}\sum_{k=1}^{K}\left| w_{ik}\right|^{2}=\sum_{i=1}^{N}\left\| w_{i}\right\|_{2}^{2}=\sum_{i=1}^{N}w_{i}^{T}w_{i}$$

    • Group like-terms

    $$\sum_{j\in \Psi_{i}}u_{j}u_{j}^{T}w_{i}+\lambda w_{i}=\sum_{j\in \Psi_{i}}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$

    • Factor out w

    $$\left (  \sum_{j\in \Psi_{i}}u_{j}u_{j}^{T} + \lambda I\right )w_{i}=\sum_{j\in \Psi_{i}}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$

    • Apply our usual solution

    $$w_{i}=\left (  \sum_{j\in \Psi_{i}}u_{j}u_{j}^{T} + \lambda I\right )^{-1}\sum_{j\in \Psi_{i}}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$

    Solve for $U$

    • Use symmetry reasoning

    $$u_{j}=\left (  \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} + \lambda I\right )^{-1}\sum_{i\in \Omega_{j}}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}$$

    Solve for b

    • differentiate

    $$J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^{2}+\lambda(\left\| W\right\|_{F}^{2}+\left\| U\right\|_{F}^{2}+\left\| b\right\|_{2}^{2}+\left\| c\right\|_{2}^{2})$$

    $$\frac{\partial J}{\partial b_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-1)+2\lambda b_{i}=0$$

    • Isolate $b_{i}$

    $$\sum_{j\in \Psi_{i}}b_{i}+\lambda b_{i}=\sum_{j\in \Psi_{i}}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)$$

    • Factor out $b_{i}$

    $$b_{i}\left\{ \left ( \sum_{j\in \Psi_{i}}1 \right )+\lambda \right\} =\sum_{j\in \Psi_{i}}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)$$

    • Divide by constant

    $1+\lambda$ is bigger than 1, so $b_{i}$ become smaller

    $$b_{i}=\frac{1}{\left| \Psi_{i} \right| + \lambda}\sum_{j\in \Psi_{i}}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)$$

    Solve for c

    • Apply symmetry reasoning again

    $$c_{j}=\frac{1}{\left| \Omega_{j} \right| + \lambda}\sum_{i\in \Omega_{j}}(r_{ij}-w_{i}^{T}u_{j}-b_{i}-\mu)$$

    Summary

    $$w_{i}=\left (  \sum_{j\in \Psi_{i}}u_{j}u_{j}^{T} + \lambda I\right )^{-1}\sum_{j\in \Psi_{i}}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$

    $$u_{j}=\left (  \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} + \lambda I\right )^{-1}\sum_{i\in \Omega_{j}}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}$$

    $$b_{i}=\frac{1}{\left| \Psi_{i} \right| + \lambda}\sum_{j\in \Psi_{i}}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)$$

    $$c_{j}=\frac{1}{\left| \Omega_{j} \right| + \lambda}\sum_{i\in \Omega_{j}}(r_{ij}-w_{i}^{T}u_{j}-b_{i}-\mu)$$

    'MLAI > RecommendSystem' 카테고리의 다른 글

    Item-Item Collaborative Filtering  (0) 2022.07.12
    AWS Personalize  (0) 2022.07.07
    User-User Collaborative Filtering  (0) 2022.07.07
    Association Analysis  (0) 2022.07.07

    댓글

Designed by Tistory.