-
Matrix FactorizationMLAI/RecommendSystem 2022. 7. 12. 18:58
Introduction
- Split the matrix into the product of 2 other matrices
- We call it R hat because it only approximates R - it is our model of R
- We would like W and U to be very skinny
- W(N×K) - users matrix, U(M×K) - movie matrix
- K somewhere from 10-50
The scale of matrix R, W, and U
- Key: W and U should be much smaller than R
- R is N×M
- Generally, we can't store R in memory
- We can represent it using a special data structure
- Dict{(u, m) -> r}
- If N = 130k, M = 26k
- N×M = 3.38 billion
- # ratings = 20 million
- Space used: 20 million / 3.38 billion = 0.006
- This is called a sparse representation
Scale down to K
- If K = 10, N = 130k, M = 26k, then size of W and U: NK + MK = 1.56 million
- How much savings? 1.56 million / 3.38 billion = 0.0005
- This takes up less space than the ratings
- This is good, we like # parameters < # of data pts)
Calculate only at i and j
- What happens if you try to calculate WUT in code?
- Don't do it. The result is N×M, which can't load on memory unless you've selected a small subset of your data
What if I just want one rating, for user i, item j?
- Just a dot product between 2 vectors of size K
- Unlike calculating all of our ˆR, this is very fast and takes up a trivial amount of space
ˆrij=wTiuj
where ˆrij=ˆR[i,j],wi=W[i],uj=U[j]
Singular value decomposition (SVD)
A matrix X can be decomposed into 3 separate matrices multiplied together
X=USVT
What's the justification for calling matrix factorization SVD?
- S can be absorbed into U or V, we get an (N×K)×(K×M)→(N×M)
- The model is thus exactly like MF
ˆxij=∑kuikskkvkj=∑k(uikskk)vkj=∑ku′ikvkj=u′Tivj
Linear algebra
- X(N×M), U(N×K), S(K×K), V(M×K)
- If I multiply U by S, I just get another N×K matrix
- Then X is a product of 2 matrices, just like matrix factorization
- Or equivalently, I could combine S with VT
- R (rating matrix) is sparse
- If U, S, and V can properly approximate a full X matrix, then surely it can approximate a mostly empty R matrix
Interpretation
- Each of the K elements in wi and uj is a feature
- Let's suppose K=5, and they are:
- Action/adventure
- Comedy
- Romance
- Horror
- Animation
- wi(1) is how much user_i likes action
- wi(2) is how much user_i likes comedy, etc.
- uj(1) is how much movie j contains action
- uj(2) is how much movie j contains comedy, etc.
Questions
- What happens when we dot wTiuj?
- How well do user_i's preferences correlate with movie_j's attributes?
wTiuj=‖
Example
Batman Animation
- w_{i}=(1, 0.8, -1, 0.1, 1)
- u_{j}=(1, 1.5, -1.3, 0, 1.2)
- result = 1*1 + 0.8*1.5+1*1.3+0.1*0+1*1.2=4.7
THE NOTEBOOK
- w_{i}=(1, 0.8, -1, 0.1, 1)
- u_{j}=(-1, -1, 1, 0, -1)
- result = -1*1 + -1*0.8 + -1*1 -1*1 = -3.8
Supervised Machine Learning
- We could predict how much a user likes an item, by extracting features from both and feeding it into a model like Random Forest or Neural Network
- We might include features about the user like age, gender, location, and so forth
- Some attributes about the item like RAM, Display, etc.
- The difference is that MF extracts the features automatically using only ratings
- Only by looking at the patterns between users, items, and the ratings
Dimensionality Reduction
- Is exactly equal to X if K=M (assuming N > M and the rank of X is M)
- By shrinking U, S, and V, we make it an approximation
- Is called Truncated SVD: and yields our best rank-K approximation of X
- Similarly, MF reduces the dimensionality of R
- We encourage the model to learn the most important features required to generate R
Training
How can we ensure our approximation is good? We would like R and \hat{R} to close together which are rating matrices
R\approx \hat{R}=WU^{T}
Squared error loss
J=\sum_{i,j\in \Omega }(r_{ij}-\hat{r}_{ij})^{2}=\sum_{i,j\in \Omega}(r_{ij}-w_{i}^{T}u_{j})^{2}
\Omega=set of pairs (i, j) where user_i rated movie_j
Minimize the loss
How? Find the gradient, set it to 0, and solve for the parameters
J=\sum_{i,j\in \Omega }(r_{ij}-\hat{r}_{ij})^{2}=\sum_{i,j\in \Omega}(r_{ij}-w_{i}^{T}u_{j})^{2}
Solving for W
J=\sum_{i,j\in \Omega }(r_{ij}-\hat{r}_{ij})^{2}=\sum_{i,j\in \Omega}(r_{ij}-w_{i}^{T}u_{j})^{2}
\frac{\partial J}{\partial w_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j})(-u_{j})=0
- Careful about which sets are being summed over
- \Omega represents the set of pairs (i,j) where user_i has rated movie_j
- \Psi_{i} is the set of all movies that user_i has rated
- For J, we want to sum over all rating
- For a particular user vector w_{i}, we only care about movies that user rated (because only those ratings involve w_{i}
\sum_{j\in \Psi_{i} }(w_{i}^{T}u_{j})(u_{j})=\sum_{j\in \Psi_{i} }r_{ij}u_{j}
- Try to isolate w_{i}
- Dot product is commutative
\sum_{j\in \Psi_{i} }(u_{j}^{T}w_{i})u_{j}=\sum_{j\in \Psi_{i} }r_{ij}u_{j}
- Result of dot product is scalar
- scalar \times vector = vector \times scalar
\sum_{j\in \Psi_{i} }u_{j}(u_{j}^{T}w_{i})=\sum_{j\in \Psi_{i} }r_{ij}u_{j}
\sum_{j\in \Psi_{i} }u_{j}u_{j}^{T}w_{i}=\sum_{j\in \Psi_{i} }r_{ij}u_{j}
- Summation doesn't actually depend on i
- Add more brackets
- Do the outer products between u_{j} and itself sum over all values of j
\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )w_{i}=\sum_{j\in \Psi_{i} }r_{ij}u_{j}
- Now it's just Ax = b, which we know how to solve
- x = np.linalg.solve(A, b)
w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }r_{ij}u_{j}
Solving for U
- Loss is symmetric in W and U, so the steps should be the same
- Remember to be careful which set to sum over
- Since we only one terms involving u_{j}, that means we only want the user_i who rated item_j, that's \Omega_{j}
J=\sum_{i,j\in \Omega }(r_{ij}-\hat{r}_{ij})^{2}=\sum_{i,j\in \Omega}(r_{ij}-w_{i}^{T}u_{j})^{2}
\frac{\partial J}{\partial u_{j}}=2\sum_{i\in \Omega_{j} }(r_{ij}-w_{i}^{T}u_{j})(-w_{i})=0
- Try to isolate u_{j}
\sum_{i\in \Omega_{j} }(w_{i}^{T}u_{j})w_{i}=\sum_{i\in \Omega_{j} }r_{ij}w_{i}
- Change order of mulitiplication
\sum_{i\in \Omega_{j}}w_{i}w_{i}^{T}u_{j}=\sum_{i\in \Omega_{j}}r_{ij}w_{i}
\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )u_{j}=\sum_{i\in \Omega_{j}}r_{ij}w_{i}
u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}r_{ij}w_{i}
Training algorithm
2-way dependency
- Solution for W depends on U
- Solution for U depends on W
- The issue is we're going to have two separate steps, but there isn't just one global equation
- This is the nature of our model
- Simply apply the equation as is, iteratively
- Is called Alternating Least Squares
Algorithm
W = randn(N, K); U = randn(M, K); for t in range(T):
w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }r_{ij}u_{j}
u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}r_{ij}w_{i}
FAQ
- Does it matter which order you update in? No
- Should you use the old values of W when updating U?
- Tends to go faster if you use the new values
- Computationally, if you wanted to use the old values, you'd have to make a copy (very slow)
Bias Terms
- The predicted rating, \hat{r}_{ij}, is the sum of the product of the user vector multiplied by the movie vector + user bias b_{i}, the movie bias c_{j}, and the global average \mu
- Normally, simple linear regression needs only one bias term, but MF would like to have 3
\hat{r}_{ij}=w_{i}^{T}u_{j}+b_{i}+c_{j}+\mu
- \mu: Global average. centering the data set. just calculate it directly from train data.
- The reason we like both a user bias and a movie bias is that both these dimensions can be biased
Ex: Movie Bias
- Consider a universally well-liked movie like James Cameron's Avatar
- MF might find latent features, e.g.
- Sci-if
- Aliens
- Humans fighting aliens
- Battlefield Earth has these attributes
- Widely considered to be the worst sci-fi movie of all time
- Thus, a movie-specific bias term makes sense
Training
Solving for W
- Objective function
J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^2
\hat{r}_{ij}=w_{i}^{T}u_{j}+b_{i}+c_{j}+\mu
- Differentiate wrt w_{i}
\frac{\partial J}{\partial w_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-u_{j})=0
- Move other terms to RHS
\sum_{j\in \Psi_{i} }(w_{i}^{T}u_{j})(u_{j})=\sum_{j\in \Psi_{i} }(r_{ij}-b_{i}-c_{j}-\mu)u_{j}
- Use the same tricks from before to isolate w
w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }(r_{ij}-b_{i}-c_{j}-\mu)u_{j}
Solving for U
u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}
Solving for b
- Differentiate wrt b
J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^2
\hat{r}_{ij}=w_{i}^{T}u_{j}+b_{i}+c_{j}+\mu
\frac{\partial J}{\partial b_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-1)=0
- Isolate b
- Common mistake: moving a variable outside a summation is equal to the variable itself (don't forget to multiply by # of terms being summed)
b_{i}=\frac{1}{\left|\Psi_{i} \right|}\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)
- b_{i} is the average deviation between target and modeling prediction, if the prediction did not involve b_{i}
- i.e. b_{i} is exactly how much you need to add to the model prediction without b_{i}
Solving for c
- The only difference is what we sum over which is now all the users who rate movie_j
J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^2
\hat{r}_{ij}=w_{i}^{T}u_{j}+b_{i}+c_{j}+\mu
\frac{\partial J}{\partial c_{j}}=2\sum_{i\in \Omega_{j} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-1)=0
- Again makes sense that c_{j} is the average deviation of a model that doesn't involve c_{j}
c_{j}=\frac{1}{\left|\Omega_{j} \right|}\sum_{i\in \Omega_{j} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-\mu)
Summary
w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }(r_{ij}-b_{i}-c_{j}-\mu)u_{j}
u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}
b_{i}=\frac{1}{\left|\Psi_{i} \right|}\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)
c_{j}=\frac{1}{\left|\Omega_{j} \right|}\sum_{i\in \Omega_{j} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-\mu)
Regularization
A technique to prevent overfitting and help generalization
In linear regression
- Model: \hat{y}=w^{T}x
- Objective: J=\sum_{i=1}^{N}(y_{i}-\hat{y}_{i})^{2}+\lambda\left\| w\right\|_{2}^{2}
- Solution: w=(\lambda I + X^{T}X)^{-1}X^{T}y
where \left\| w\right\|_{2}^{2} is squared magnitude of the weights themselves which is called penalty term. The idea is if the weights become too large, that's a good sign of overfitting. So large weights are penalized.
Regularization in Matrix Factorization
- Same approach, add squared magnitude of each parameter multiplied by regularization constant
- \left\| *\right\|_{F} is called the Frobenius norm on a matrix
J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^{2}+\lambda(\left\| W\right\|_{F}^{2}+\left\| U\right\|_{F}^{2}+\left\| b\right\|_{2}^{2}+\left\| c\right\|_{2}^{2})
Solve for W
Derivatives are additive, we just need to differentiate the 2nd term and add it to the existing derivative
\frac{\partial J}{\partial w_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-u_{j})+2\lambda w_{i}=0
- If you can't see how I differentiated w_{i} wrt Frobenius norm, expand it
- Now it's just a dot product which we know how to differentiate
- The Frobenius norm is just the sum of the magnitudes of each individual element
- Therefore, I can also split them up in terms of each row as well
- The squared magnitude of a vector is just the dot product of the vector with itself
- The derivative of the dot product is just the vector itself
\left\| W\right\|_{F}^{2}=\sum_{i=1}^{N}\sum_{k=1}^{K}\left| w_{ik}\right|^{2}=\sum_{i=1}^{N}\left\| w_{i}\right\|_{2}^{2}=\sum_{i=1}^{N}w_{i}^{T}w_{i}
- Group like-terms
\sum_{j\in \Psi_{i}}u_{j}u_{j}^{T}w_{i}+\lambda w_{i}=\sum_{j\in \Psi_{i}}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}
- Factor out w
\left ( \sum_{j\in \Psi_{i}}u_{j}u_{j}^{T} + \lambda I\right )w_{i}=\sum_{j\in \Psi_{i}}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}
- Apply our usual solution
w_{i}=\left ( \sum_{j\in \Psi_{i}}u_{j}u_{j}^{T} + \lambda I\right )^{-1}\sum_{j\in \Psi_{i}}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}
Solve for U
- Use symmetry reasoning
u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} + \lambda I\right )^{-1}\sum_{i\in \Omega_{j}}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}
Solve for b
- differentiate
J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^{2}+\lambda(\left\| W\right\|_{F}^{2}+\left\| U\right\|_{F}^{2}+\left\| b\right\|_{2}^{2}+\left\| c\right\|_{2}^{2})
\frac{\partial J}{\partial b_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-1)+2\lambda b_{i}=0
- Isolate b_{i}
\sum_{j\in \Psi_{i}}b_{i}+\lambda b_{i}=\sum_{j\in \Psi_{i}}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)
- Factor out b_{i}
b_{i}\left\{ \left ( \sum_{j\in \Psi_{i}}1 \right )+\lambda \right\} =\sum_{j\in \Psi_{i}}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)
- Divide by constant
1+\lambda is bigger than 1, so b_{i} become smaller
b_{i}=\frac{1}{\left| \Psi_{i} \right| + \lambda}\sum_{j\in \Psi_{i}}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)
Solve for c
- Apply symmetry reasoning again
c_{j}=\frac{1}{\left| \Omega_{j} \right| + \lambda}\sum_{i\in \Omega_{j}}(r_{ij}-w_{i}^{T}u_{j}-b_{i}-\mu)
Summary
w_{i}=\left ( \sum_{j\in \Psi_{i}}u_{j}u_{j}^{T} + \lambda I\right )^{-1}\sum_{j\in \Psi_{i}}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}
u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} + \lambda I\right )^{-1}\sum_{i\in \Omega_{j}}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}
b_{i}=\frac{1}{\left| \Psi_{i} \right| + \lambda}\sum_{j\in \Psi_{i}}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)
c_{j}=\frac{1}{\left| \Omega_{j} \right| + \lambda}\sum_{i\in \Omega_{j}}(r_{ij}-w_{i}^{T}u_{j}-b_{i}-\mu)
'MLAI > RecommendSystem' 카테고리의 다른 글
Item-Item Collaborative Filtering (0) 2022.07.12 AWS Personalize (0) 2022.07.07 User-User Collaborative Filtering (0) 2022.07.07 Association Analysis (0) 2022.07.07