-
Matrix FactorizationMLAI/RecommendSystem 2022. 7. 12. 18:58
Introduction
- Split the matrix into the product of 2 other matrices
- We call it R hat because it only approximates R - it is our model of R
- We would like W and U to be very skinny
- $W(N \times K)$ - users matrix, $U(M \times K)$ - movie matrix
- K somewhere from 10-50
The scale of matrix R, W, and U
- Key: $W$ and $U$ should be much smaller than $R$
- $R$ is $N \times M$
- Generally, we can't store $R$ in memory
- We can represent it using a special data structure
- Dict{(u, m) -> r}
- If N = 130k, M = 26k
- $N \times M$ = 3.38 billion
- # ratings = 20 million
- Space used: 20 million / 3.38 billion = 0.006
- This is called a sparse representation
Scale down to K
- If K = 10, N = 130k, M = 26k, then size of W and U: NK + MK = 1.56 million
- How much savings? 1.56 million / 3.38 billion = 0.0005
- This takes up less space than the ratings
- This is good, we like # parameters < # of data pts)
Calculate only at i and j
- What happens if you try to calculate $WU^{T}$ in code?
- Don't do it. The result is $N \times M$, which can't load on memory unless you've selected a small subset of your data
What if I just want one rating, for user i, item j?
- Just a dot product between 2 vectors of size K
- Unlike calculating all of our $\hat{R}$, this is very fast and takes up a trivial amount of space
$$\hat{r}_{ij}=w_{i}^{T}u_{j}$$
$where \ \hat{r}_{ij}=\hat{R}[i,j],w_{i}=W[i],u_{j}=U[j]$
Singular value decomposition (SVD)
A matrix X can be decomposed into 3 separate matrices multiplied together
$$X=USV^{T}$$
What's the justification for calling matrix factorization SVD?
- S can be absorbed into $U$ or $V$, we get an $(N \times K) \times (K \times M) \to (N \times M)$
- The model is thus exactly like MF
$$\hat{x}_{ij}=\sum_{k}u_{ik}s_{kk}v_kj=\sum_{k}(u_{ik}s_{kk})v_{kj}=\sum_{k}{u}'_{ik}v_{kj}=u_{i}^{'T}v_{j}$$
Linear algebra
- $X(N \times M)$, $U(N \times K)$, $S(K \times K)$, $V(M \times K)$
- If I multiply $U$ by $S$, I just get another $N \times K$ matrix
- Then $X$ is a product of 2 matrices, just like matrix factorization
- Or equivalently, I could combine $S$ with $V^{T}$
- $R$ (rating matrix) is sparse
- If $U$, $S$, and $V$ can properly approximate a full $X$ matrix, then surely it can approximate a mostly empty $R$ matrix
Interpretation
- Each of the K elements in $w_{i}$ and $u_{j}$ is a feature
- Let's suppose K=5, and they are:
- Action/adventure
- Comedy
- Romance
- Horror
- Animation
- $w_{i}(1)$ is how much user_i likes action
- $w_{i}(2)$ is how much user_i likes comedy, etc.
- $u_{j}(1)$ is how much movie j contains action
- $u_{j}(2)$ is how much movie j contains comedy, etc.
Questions
- What happens when we dot $w_{i}^{T}u_{j}$?
- How well do user_i's preferences correlate with movie_j's attributes?
$$w_{i}^{T}u_{j}=\left\| w_{i}\right\|\left\| u_{j}\right\|cos\theta \propto sim(i,j)$$
Example
Batman Animation
- $w_{i}=(1, 0.8, -1, 0.1, 1)$
- $u_{j}=(1, 1.5, -1.3, 0, 1.2)$
- result = 1*1 + 0.8*1.5+1*1.3+0.1*0+1*1.2=4.7
THE NOTEBOOK
- $w_{i}=(1, 0.8, -1, 0.1, 1)$
- $u_{j}=(-1, -1, 1, 0, -1)$
- result = -1*1 + -1*0.8 + -1*1 -1*1 = -3.8
Supervised Machine Learning
- We could predict how much a user likes an item, by extracting features from both and feeding it into a model like Random Forest or Neural Network
- We might include features about the user like age, gender, location, and so forth
- Some attributes about the item like RAM, Display, etc.
- The difference is that MF extracts the features automatically using only ratings
- Only by looking at the patterns between users, items, and the ratings
Dimensionality Reduction
- Is exactly equal to X if K=M (assuming N > M and the rank of X is M)
- By shrinking U, S, and V, we make it an approximation
- Is called Truncated SVD: and yields our best rank-K approximation of X
- Similarly, MF reduces the dimensionality of R
- We encourage the model to learn the most important features required to generate R
Training
How can we ensure our approximation is good? We would like $R$ and $\hat{R}$ to close together which are rating matrices
$$R\approx \hat{R}=WU^{T}$$
Squared error loss
$$J=\sum_{i,j\in \Omega }(r_{ij}-\hat{r}_{ij})^{2}=\sum_{i,j\in \Omega}(r_{ij}-w_{i}^{T}u_{j})^{2}$$
$\Omega=$set of pairs (i, j) where user_i rated movie_j
Minimize the loss
How? Find the gradient, set it to 0, and solve for the parameters
$$J=\sum_{i,j\in \Omega }(r_{ij}-\hat{r}_{ij})^{2}=\sum_{i,j\in \Omega}(r_{ij}-w_{i}^{T}u_{j})^{2}$$
Solving for $W$
$$J=\sum_{i,j\in \Omega }(r_{ij}-\hat{r}_{ij})^{2}=\sum_{i,j\in \Omega}(r_{ij}-w_{i}^{T}u_{j})^{2}$$
$$\frac{\partial J}{\partial w_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j})(-u_{j})=0$$
- Careful about which sets are being summed over
- $\Omega$ represents the set of pairs (i,j) where user_i has rated movie_j
- $\Psi_{i}$ is the set of all movies that user_i has rated
- For $J$, we want to sum over all rating
- For a particular user vector $w_{i}$, we only care about movies that user rated (because only those ratings involve $w_{i}$
$$\sum_{j\in \Psi_{i} }(w_{i}^{T}u_{j})(u_{j})=\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$
- Try to isolate $w_{i}$
- Dot product is commutative
$$\sum_{j\in \Psi_{i} }(u_{j}^{T}w_{i})u_{j}=\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$
- Result of dot product is scalar
- $scalar \times vector = vector \times scalar$
$$\sum_{j\in \Psi_{i} }u_{j}(u_{j}^{T}w_{i})=\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$
$$\sum_{j\in \Psi_{i} }u_{j}u_{j}^{T}w_{i}=\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$
- Summation doesn't actually depend on i
- Add more brackets
- Do the outer products between $u_{j}$ and itself sum over all values of j
$$\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )w_{i}=\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$
- Now it's just $Ax = b$, which we know how to solve
- x = np.linalg.solve(A, b)
$$w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$
Solving for $U$
- Loss is symmetric in $W$ and $U$, so the steps should be the same
- Remember to be careful which set to sum over
- Since we only one terms involving $u_{j}$, that means we only want the user_i who rated item_j, that's $\Omega_{j}$
$$J=\sum_{i,j\in \Omega }(r_{ij}-\hat{r}_{ij})^{2}=\sum_{i,j\in \Omega}(r_{ij}-w_{i}^{T}u_{j})^{2}$$
$$\frac{\partial J}{\partial u_{j}}=2\sum_{i\in \Omega_{j} }(r_{ij}-w_{i}^{T}u_{j})(-w_{i})=0$$
- Try to isolate $u_{j}$
$$\sum_{i\in \Omega_{j} }(w_{i}^{T}u_{j})w_{i}=\sum_{i\in \Omega_{j} }r_{ij}w_{i}$$
- Change order of mulitiplication
$$\sum_{i\in \Omega_{j}}w_{i}w_{i}^{T}u_{j}=\sum_{i\in \Omega_{j}}r_{ij}w_{i}$$
$$\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )u_{j}=\sum_{i\in \Omega_{j}}r_{ij}w_{i}$$
$$u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}r_{ij}w_{i}$$
Training algorithm
2-way dependency
- Solution for $W$ depends on $U$
- Solution for $U$ depends on $W$
- The issue is we're going to have two separate steps, but there isn't just one global equation
- This is the nature of our model
- Simply apply the equation as is, iteratively
- Is called Alternating Least Squares
Algorithm
W = randn(N, K); U = randn(M, K); for t in range(T):
$$w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$
$$u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}r_{ij}w_{i}$$
FAQ
- Does it matter which order you update in? No
- Should you use the old values of $W$ when updating $U$?
- Tends to go faster if you use the new values
- Computationally, if you wanted to use the old values, you'd have to make a copy (very slow)
Bias Terms
- The predicted rating, $\hat{r}_{ij}$, is the sum of the product of the user vector multiplied by the movie vector + user bias $b_{i}$, the movie bias $c_{j}$, and the global average $\mu$
- Normally, simple linear regression needs only one bias term, but MF would like to have 3
$$\hat{r}_{ij}=w_{i}^{T}u_{j}+b_{i}+c_{j}+\mu$$
- $\mu$: Global average. centering the data set. just calculate it directly from train data.
- The reason we like both a user bias and a movie bias is that both these dimensions can be biased
Ex: Movie Bias
- Consider a universally well-liked movie like James Cameron's Avatar
- MF might find latent features, e.g.
- Sci-if
- Aliens
- Humans fighting aliens
- Battlefield Earth has these attributes
- Widely considered to be the worst sci-fi movie of all time
- Thus, a movie-specific bias term makes sense
Training
Solving for $W$
- Objective function
$$J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^2$$
$$\hat{r}_{ij}=w_{i}^{T}u_{j}+b_{i}+c_{j}+\mu$$
- Differentiate wrt $w_{i}$
$$\frac{\partial J}{\partial w_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-u_{j})=0$$
- Move other terms to RHS
$$\sum_{j\in \Psi_{i} }(w_{i}^{T}u_{j})(u_{j})=\sum_{j\in \Psi_{i} }(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$
- Use the same tricks from before to isolate w
$$w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$
Solving for $U$
$$u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}$$
Solving for b
- Differentiate wrt b
$$J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^2$$
$$\hat{r}_{ij}=w_{i}^{T}u_{j}+b_{i}+c_{j}+\mu$$
$$\frac{\partial J}{\partial b_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-1)=0$$
- Isolate b
- Common mistake: moving a variable outside a summation is equal to the variable itself (don't forget to multiply by # of terms being summed)
$$b_{i}=\frac{1}{\left|\Psi_{i} \right|}\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)$$
- $b_{i}$ is the average deviation between target and modeling prediction, if the prediction did not involve $b_{i}$
- i.e. $b_{i}$ is exactly how much you need to add to the model prediction without $b_{i}$
Solving for c
- The only difference is what we sum over which is now all the users who rate movie_j
$$J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^2$$
$$\hat{r}_{ij}=w_{i}^{T}u_{j}+b_{i}+c_{j}+\mu$$
$$\frac{\partial J}{\partial c_{j}}=2\sum_{i\in \Omega_{j} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-1)=0$$
- Again makes sense that $c_{j}$ is the average deviation of a model that doesn't involve $c_{j}$
$$c_{j}=\frac{1}{\left|\Omega_{j} \right|}\sum_{i\in \Omega_{j} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-\mu)$$
Summary
$$w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$
$$u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}$$
$$b_{i}=\frac{1}{\left|\Psi_{i} \right|}\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)$$
$$c_{j}=\frac{1}{\left|\Omega_{j} \right|}\sum_{i\in \Omega_{j} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-\mu)$$
Regularization
A technique to prevent overfitting and help generalization
In linear regression
- Model: $\hat{y}=w^{T}x$
- Objective: $J=\sum_{i=1}^{N}(y_{i}-\hat{y}_{i})^{2}+\lambda\left\| w\right\|_{2}^{2}$
- Solution: $w=(\lambda I + X^{T}X)^{-1}X^{T}y$
where $\left\| w\right\|_{2}^{2}$ is squared magnitude of the weights themselves which is called penalty term. The idea is if the weights become too large, that's a good sign of overfitting. So large weights are penalized.
Regularization in Matrix Factorization
- Same approach, add squared magnitude of each parameter multiplied by regularization constant
- $\left\| *\right\|_{F}$ is called the Frobenius norm on a matrix
$$J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^{2}+\lambda(\left\| W\right\|_{F}^{2}+\left\| U\right\|_{F}^{2}+\left\| b\right\|_{2}^{2}+\left\| c\right\|_{2}^{2})$$
Solve for $W$
Derivatives are additive, we just need to differentiate the 2nd term and add it to the existing derivative
$$\frac{\partial J}{\partial w_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-u_{j})+2\lambda w_{i}=0$$
- If you can't see how I differentiated $w_{i}$ wrt Frobenius norm, expand it
- Now it's just a dot product which we know how to differentiate
- The Frobenius norm is just the sum of the magnitudes of each individual element
- Therefore, I can also split them up in terms of each row as well
- The squared magnitude of a vector is just the dot product of the vector with itself
- The derivative of the dot product is just the vector itself
$$\left\| W\right\|_{F}^{2}=\sum_{i=1}^{N}\sum_{k=1}^{K}\left| w_{ik}\right|^{2}=\sum_{i=1}^{N}\left\| w_{i}\right\|_{2}^{2}=\sum_{i=1}^{N}w_{i}^{T}w_{i}$$
- Group like-terms
$$\sum_{j\in \Psi_{i}}u_{j}u_{j}^{T}w_{i}+\lambda w_{i}=\sum_{j\in \Psi_{i}}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$
- Factor out w
$$\left ( \sum_{j\in \Psi_{i}}u_{j}u_{j}^{T} + \lambda I\right )w_{i}=\sum_{j\in \Psi_{i}}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$
- Apply our usual solution
$$w_{i}=\left ( \sum_{j\in \Psi_{i}}u_{j}u_{j}^{T} + \lambda I\right )^{-1}\sum_{j\in \Psi_{i}}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$
Solve for $U$
- Use symmetry reasoning
$$u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} + \lambda I\right )^{-1}\sum_{i\in \Omega_{j}}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}$$
Solve for b
- differentiate
$$J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^{2}+\lambda(\left\| W\right\|_{F}^{2}+\left\| U\right\|_{F}^{2}+\left\| b\right\|_{2}^{2}+\left\| c\right\|_{2}^{2})$$
$$\frac{\partial J}{\partial b_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-1)+2\lambda b_{i}=0$$
- Isolate $b_{i}$
$$\sum_{j\in \Psi_{i}}b_{i}+\lambda b_{i}=\sum_{j\in \Psi_{i}}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)$$
- Factor out $b_{i}$
$$b_{i}\left\{ \left ( \sum_{j\in \Psi_{i}}1 \right )+\lambda \right\} =\sum_{j\in \Psi_{i}}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)$$
- Divide by constant
$1+\lambda$ is bigger than 1, so $b_{i}$ become smaller
$$b_{i}=\frac{1}{\left| \Psi_{i} \right| + \lambda}\sum_{j\in \Psi_{i}}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)$$
Solve for c
- Apply symmetry reasoning again
$$c_{j}=\frac{1}{\left| \Omega_{j} \right| + \lambda}\sum_{i\in \Omega_{j}}(r_{ij}-w_{i}^{T}u_{j}-b_{i}-\mu)$$
Summary
$$w_{i}=\left ( \sum_{j\in \Psi_{i}}u_{j}u_{j}^{T} + \lambda I\right )^{-1}\sum_{j\in \Psi_{i}}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$
$$u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} + \lambda I\right )^{-1}\sum_{i\in \Omega_{j}}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}$$
$$b_{i}=\frac{1}{\left| \Psi_{i} \right| + \lambda}\sum_{j\in \Psi_{i}}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)$$
$$c_{j}=\frac{1}{\left| \Omega_{j} \right| + \lambda}\sum_{i\in \Omega_{j}}(r_{ij}-w_{i}^{T}u_{j}-b_{i}-\mu)$$
'MLAI > RecommendSystem' 카테고리의 다른 글
Item-Item Collaborative Filtering (0) 2022.07.12 AWS Personalize (0) 2022.07.07 User-User Collaborative Filtering (0) 2022.07.07 Association Analysis (0) 2022.07.07