Matrix Factorization

MLAI/RecommendSystem 2022. 7. 12. 18:58

Introduction

Split the matrix into the product of 2 other matrices
We call it R hat because it only approximates R - it is our model of R
We would like W and U to be very skinny
$W(N \times K)$ - users matrix, $U(M \times K)$ - movie matrix
K somewhere from 10-50

The scale of matrix R, W, and U

Key: $W$ and $U$ should be much smaller than $R$
$R$ is $N \times M$
Generally, we can't store $R$ in memory
We can represent it using a special data structure
- Dict{(u, m) -> r}
If N = 130k, M = 26k
- $N \times M$ = 3.38 billion
- # ratings = 20 million
- Space used: 20 million / 3.38 billion = 0.006
This is called a sparse representation

Scale down to K

If K = 10, N = 130k, M = 26k, then size of W and U: NK + MK = 1.56 million
How much savings? 1.56 million / 3.38 billion = 0.0005
- This takes up less space than the ratings
- This is good, we like # parameters < # of data pts)

Calculate only at i and j

What happens if you try to calculate $WU^{T}$ in code?
Don't do it. The result is $N \times M$, which can't load on memory unless you've selected a small subset of your data

What if I just want one rating, for user i, item j?

Just a dot product between 2 vectors of size K
Unlike calculating all of our $\hat{R}$, this is very fast and takes up a trivial amount of space

$$\hat{r}_{ij}=w_{i}^{T}u_{j}$$

$where \ \hat{r}_{ij}=\hat{R}[i,j],w_{i}=W[i],u_{j}=U[j]$

Singular value decomposition (SVD)

A matrix X can be decomposed into 3 separate matrices multiplied together

$$X=USV^{T}$$

What's the justification for calling matrix factorization SVD?

S can be absorbed into $U$ or $V$, we get an $(N \times K) \times (K \times M) \to (N \times M)$
The model is thus exactly like MF

$$\hat{x}_{ij}=\sum_{k}u_{ik}s_{kk}v_kj=\sum_{k}(u_{ik}s_{kk})v_{kj}=\sum_{k}{u}'_{ik}v_{kj}=u_{i}^{'T}v_{j}$$

Linear algebra

$X(N \times M)$, $U(N \times K)$, $S(K \times K)$, $V(M \times K)$
If I multiply $U$ by $S$, I just get another $N \times K$ matrix
- Then $X$ is a product of 2 matrices, just like matrix factorization
- Or equivalently, I could combine $S$ with $V^{T}$
$R$ (rating matrix) is sparse
- If $U$, $S$, and $V$ can properly approximate a full $X$ matrix, then surely it can approximate a mostly empty $R$ matrix

Interpretation

Each of the K elements in $w_{i}$ and $u_{j}$ is a feature
Let's suppose K=5, and they are:
- Action/adventure
- Comedy
- Romance
- Horror
- Animation
$w_{i}(1)$ is how much user_i likes action
$w_{i}(2)$ is how much user_i likes comedy, etc.
$u_{j}(1)$ is how much movie j contains action
$u_{j}(2)$ is how much movie j contains comedy, etc.

Questions

What happens when we dot $w_{i}^{T}u_{j}$?
How well do user_i's preferences correlate with movie_j's attributes?

$$w_{i}^{T}u_{j}=\left\| w_{i}\right\|\left\| u_{j}\right\|cos\theta \propto sim(i,j)$$

Example

Batman Animation

$w_{i}=(1, 0.8, -1, 0.1, 1)$
$u_{j}=(1, 1.5, -1.3, 0, 1.2)$
result = 1*1 + 0.8*1.5+1*1.3+0.1*0+1*1.2=4.7

THE NOTEBOOK

$w_{i}=(1, 0.8, -1, 0.1, 1)$
$u_{j}=(-1, -1, 1, 0, -1)$
result = -1*1 + -1*0.8 + -1*1 -1*1 = -3.8

Supervised Machine Learning

We could predict how much a user likes an item, by extracting features from both and feeding it into a model like Random Forest or Neural Network
- We might include features about the user like age, gender, location, and so forth
- Some attributes about the item like RAM, Display, etc.
The difference is that MF extracts the features automatically using only ratings
- Only by looking at the patterns between users, items, and the ratings

Dimensionality Reduction

Is exactly equal to X if K=M (assuming N > M and the rank of X is M)
By shrinking U, S, and V, we make it an approximation
Is called Truncated SVD: and yields our best rank-K approximation of X
Similarly, MF reduces the dimensionality of R
We encourage the model to learn the most important features required to generate R

Training

How can we ensure our approximation is good? We would like $R$ and $\hat{R}$ to close together which are rating matrices

$$R\approx \hat{R}=WU^{T}$$

Squared error loss

$$J=\sum_{i,j\in \Omega }(r_{ij}-\hat{r}_{ij})^{2}=\sum_{i,j\in \Omega}(r_{ij}-w_{i}^{T}u_{j})^{2}$$

$\Omega=$set of pairs (i, j) where user_i rated movie_j

Minimize the loss

How? Find the gradient, set it to 0, and solve for the parameters

$$J=\sum_{i,j\in \Omega }(r_{ij}-\hat{r}_{ij})^{2}=\sum_{i,j\in \Omega}(r_{ij}-w_{i}^{T}u_{j})^{2}$$

Solving for $W$

$$J=\sum_{i,j\in \Omega }(r_{ij}-\hat{r}_{ij})^{2}=\sum_{i,j\in \Omega}(r_{ij}-w_{i}^{T}u_{j})^{2}$$

$$\frac{\partial J}{\partial w_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j})(-u_{j})=0$$

Careful about which sets are being summed over
$\Omega$ represents the set of pairs (i,j) where user_i has rated movie_j
$\Psi_{i}$ is the set of all movies that user_i has rated
For $J$, we want to sum over all rating
For a particular user vector $w_{i}$, we only care about movies that user rated (because only those ratings involve $w_{i}$

$$\sum_{j\in \Psi_{i} }(w_{i}^{T}u_{j})(u_{j})=\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$

Try to isolate $w_{i}$
Dot product is commutative

$$\sum_{j\in \Psi_{i} }(u_{j}^{T}w_{i})u_{j}=\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$

Result of dot product is scalar
$scalar \times vector = vector \times scalar$

$$\sum_{j\in \Psi_{i} }u_{j}(u_{j}^{T}w_{i})=\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$

$$\sum_{j\in \Psi_{i} }u_{j}u_{j}^{T}w_{i}=\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$

Summation doesn't actually depend on i
Add more brackets
Do the outer products between $u_{j}$ and itself sum over all values of j

$$\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )w_{i}=\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$

Now it's just $Ax = b$, which we know how to solve
x = np.linalg.solve(A, b)

$$w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$

Solving for $U$

Loss is symmetric in $W$ and $U$, so the steps should be the same
Remember to be careful which set to sum over
Since we only one terms involving $u_{j}$, that means we only want the user_i who rated item_j, that's $\Omega_{j}$

$$J=\sum_{i,j\in \Omega }(r_{ij}-\hat{r}_{ij})^{2}=\sum_{i,j\in \Omega}(r_{ij}-w_{i}^{T}u_{j})^{2}$$

$$\frac{\partial J}{\partial u_{j}}=2\sum_{i\in \Omega_{j} }(r_{ij}-w_{i}^{T}u_{j})(-w_{i})=0$$

Try to isolate $u_{j}$

$$\sum_{i\in \Omega_{j} }(w_{i}^{T}u_{j})w_{i}=\sum_{i\in \Omega_{j} }r_{ij}w_{i}$$

Change order of mulitiplication

$$\sum_{i\in \Omega_{j}}w_{i}w_{i}^{T}u_{j}=\sum_{i\in \Omega_{j}}r_{ij}w_{i}$$

$$\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )u_{j}=\sum_{i\in \Omega_{j}}r_{ij}w_{i}$$

$$u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}r_{ij}w_{i}$$

Training algorithm

2-way dependency

Solution for $W$ depends on $U$
Solution for $U$ depends on $W$
The issue is we're going to have two separate steps, but there isn't just one global equation
- This is the nature of our model
Simply apply the equation as is, iteratively
Is called Alternating Least Squares

Algorithm

W = randn(N, K); U = randn(M, K);
for t in range(T):

$$w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$

$$u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}r_{ij}w_{i}$$

FAQ

Does it matter which order you update in? No
Should you use the old values of $W$ when updating $U$?
- Tends to go faster if you use the new values
- Computationally, if you wanted to use the old values, you'd have to make a copy (very slow)

Bias Terms

The predicted rating, $\hat{r}_{ij}$, is the sum of the product of the user vector multiplied by the movie vector + user bias $b_{i}$, the movie bias $c_{j}$, and the global average $\mu$
Normally, simple linear regression needs only one bias term, but MF would like to have 3

$$\hat{r}_{ij}=w_{i}^{T}u_{j}+b_{i}+c_{j}+\mu$$

$\mu$: Global average. centering the data set. just calculate it directly from train data.
The reason we like both a user bias and a movie bias is that both these dimensions can be biased

Ex: Movie Bias

Consider a universally well-liked movie like James Cameron's Avatar
MF might find latent features, e.g.
- Sci-if
- Aliens
- Humans fighting aliens
Battlefield Earth has these attributes
Widely considered to be the worst sci-fi movie of all time
Thus, a movie-specific bias term makes sense

Training

Solving for $W$

Objective function

$$J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^2$$

$$\hat{r}_{ij}=w_{i}^{T}u_{j}+b_{i}+c_{j}+\mu$$

Differentiate wrt $w_{i}$

$$\frac{\partial J}{\partial w_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-u_{j})=0$$

Move other terms to RHS

$$\sum_{j\in \Psi_{i} }(w_{i}^{T}u_{j})(u_{j})=\sum_{j\in \Psi_{i} }(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$

Use the same tricks from before to isolate w

$$w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$

Solving for $U$

$$u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}$$

Solving for b

Differentiate wrt b

$$J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^2$$

$$\hat{r}_{ij}=w_{i}^{T}u_{j}+b_{i}+c_{j}+\mu$$

$$\frac{\partial J}{\partial b_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-1)=0$$

Isolate b
Common mistake: moving a variable outside a summation is equal to the variable itself (don't forget to multiply by # of terms being summed)

$$b_{i}=\frac{1}{\left|\Psi_{i} \right|}\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)$$

$b_{i}$ is the average deviation between target and modeling prediction, if the prediction did not involve $b_{i}$
i.e. $b_{i}$ is exactly how much you need to add to the model prediction without $b_{i}$

Solving for c

The only difference is what we sum over which is now all the users who rate movie_j

$$J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^2$$

$$\hat{r}_{ij}=w_{i}^{T}u_{j}+b_{i}+c_{j}+\mu$$

$$\frac{\partial J}{\partial c_{j}}=2\sum_{i\in \Omega_{j} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-1)=0$$

Again makes sense that $c_{j}$ is the average deviation of a model that doesn't involve $c_{j}$

$$c_{j}=\frac{1}{\left|\Omega_{j} \right|}\sum_{i\in \Omega_{j} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-\mu)$$

Summary

$$w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$

$$u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}$$

$$b_{i}=\frac{1}{\left|\Psi_{i} \right|}\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)$$

$$c_{j}=\frac{1}{\left|\Omega_{j} \right|}\sum_{i\in \Omega_{j} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-\mu)$$

Regularization

A technique to prevent overfitting and help generalization

In linear regression

Model: $\hat{y}=w^{T}x$
Objective: $J=\sum_{i=1}^{N}(y_{i}-\hat{y}_{i})^{2}+\lambda\left\| w\right\|_{2}^{2}$
Solution: $w=(\lambda I + X^{T}X)^{-1}X^{T}y$

where $\left\| w\right\|_{2}^{2}$ is squared magnitude of the weights themselves which is called penalty term. The idea is if the weights become too large, that's a good sign of overfitting. So large weights are penalized.

Regularization in Matrix Factorization

Same approach, add squared magnitude of each parameter multiplied by regularization constant
$\left\| *\right\|_{F}$ is called the Frobenius norm on a matrix

$$J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^{2}+\lambda(\left\| W\right\|_{F}^{2}+\left\| U\right\|_{F}^{2}+\left\| b\right\|_{2}^{2}+\left\| c\right\|_{2}^{2})$$

Solve for $W$

Derivatives are additive, we just need to differentiate the 2nd term and add it to the existing derivative

$$\frac{\partial J}{\partial w_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-u_{j})+2\lambda w_{i}=0$$

If you can't see how I differentiated $w_{i}$ wrt Frobenius norm, expand it
Now it's just a dot product which we know how to differentiate
The Frobenius norm is just the sum of the magnitudes of each individual element
Therefore, I can also split them up in terms of each row as well
The squared magnitude of a vector is just the dot product of the vector with itself
The derivative of the dot product is just the vector itself

$$\left\| W\right\|_{F}^{2}=\sum_{i=1}^{N}\sum_{k=1}^{K}\left| w_{ik}\right|^{2}=\sum_{i=1}^{N}\left\| w_{i}\right\|_{2}^{2}=\sum_{i=1}^{N}w_{i}^{T}w_{i}$$

Group like-terms

$$\sum_{j\in \Psi_{i}}u_{j}u_{j}^{T}w_{i}+\lambda w_{i}=\sum_{j\in \Psi_{i}}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$

Factor out w

$$\left ( \sum_{j\in \Psi_{i}}u_{j}u_{j}^{T} + \lambda I\right )w_{i}=\sum_{j\in \Psi_{i}}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$

Apply our usual solution

$$w_{i}=\left ( \sum_{j\in \Psi_{i}}u_{j}u_{j}^{T} + \lambda I\right )^{-1}\sum_{j\in \Psi_{i}}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$

Solve for $U$

Use symmetry reasoning

$$u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} + \lambda I\right )^{-1}\sum_{i\in \Omega_{j}}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}$$

Solve for b

differentiate

$$J=\sum_{i,j\in \Omega}(r_{ij}-\hat{r}_{ij})^{2}+\lambda(\left\| W\right\|_{F}^{2}+\left\| U\right\|_{F}^{2}+\left\| b\right\|_{2}^{2}+\left\| c\right\|_{2}^{2})$$

$$\frac{\partial J}{\partial b_{i}}=2\sum_{j\in \Psi_{i} }(r_{ij}-w_{i}^{T}u_{j}-b_{i}-c_{j}-\mu)(-1)+2\lambda b_{i}=0$$

Isolate $b_{i}$

$$\sum_{j\in \Psi_{i}}b_{i}+\lambda b_{i}=\sum_{j\in \Psi_{i}}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)$$

Factor out $b_{i}$

$$b_{i}\left\{ \left ( \sum_{j\in \Psi_{i}}1 \right )+\lambda \right\} =\sum_{j\in \Psi_{i}}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)$$

Divide by constant

$1+\lambda$ is bigger than 1, so $b_{i}$ become smaller

$$b_{i}=\frac{1}{\left| \Psi_{i} \right| + \lambda}\sum_{j\in \Psi_{i}}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)$$

Solve for c

Apply symmetry reasoning again

$$c_{j}=\frac{1}{\left| \Omega_{j} \right| + \lambda}\sum_{i\in \Omega_{j}}(r_{ij}-w_{i}^{T}u_{j}-b_{i}-\mu)$$

Summary

$$w_{i}=\left ( \sum_{j\in \Psi_{i}}u_{j}u_{j}^{T} + \lambda I\right )^{-1}\sum_{j\in \Psi_{i}}(r_{ij}-b_{i}-c_{j}-\mu)u_{j}$$

$$u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} + \lambda I\right )^{-1}\sum_{i\in \Omega_{j}}(r_{ij}-b_{i}-c_{j}-\mu)w_{i}$$

$$b_{i}=\frac{1}{\left| \Psi_{i} \right| + \lambda}\sum_{j\in \Psi_{i}}(r_{ij}-w_{i}^{T}u_{j}-c_{j}-\mu)$$

$$c_{j}=\frac{1}{\left| \Omega_{j} \right| + \lambda}\sum_{i\in \Omega_{j}}(r_{ij}-w_{i}^{T}u_{j}-b_{i}-\mu)$$

저작자표시 비영리 변경금지 (새창열림)

'MLAI > RecommendSystem' 카테고리의 다른 글

Item-Item Collaborative Filtering (0)	2022.07.12
AWS Personalize (0)	2022.07.07
User-User Collaborative Filtering (0)	2022.07.07
Association Analysis (0)	2022.07.07

ABOUT ME

Introduction

The scale of matrix R, W, and U

Scale down to K

Calculate only at i and j

What if I just want one rating, for user i, item j?

Singular value decomposition (SVD)

What's the justification for calling matrix factorization SVD?

Linear algebra

Interpretation

Questions

Example

Batman Animation

THE NOTEBOOK

Supervised Machine Learning

Dimensionality Reduction

Training

Squared error loss

Minimize the loss

Solving for $W$

Solving for $U$

Training algorithm

2-way dependency

Algorithm

$$w_{i}=\left ( \sum_{j\in \Psi_{i} }u_{j}u_{j}^{T} \right )^{-1}\sum_{j\in \Psi_{i} }r_{ij}u_{j}$$

$$u_{j}=\left ( \sum_{i\in \Omega_{j}}w_{i}w_{i}^{T} \right )^{-1}\sum_{i\in \Omega_{j}}r_{ij}w_{i}$$

FAQ

Bias Terms

Ex: Movie Bias

Training

Solving for $W$

Solving for $U$

Solving for b

Solving for c

Summary

Regularization

In linear regression

Regularization in Matrix Factorization

Solve for $W$

Solve for $U$

Solve for b

Solve for c

Summary

'MLAI > RecommendSystem' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바

Questions