FunkSVD: math, code, prediction, and validation
Today I would like to include the below parts:
- FunkSVD: math, process, and code
- Based on 1, make predictions, validation and will give emphasis on the different cases in order to help to understand the validation process
Let’s get started.
FunkSVD: briefly intro
- FunkSVD is a kind of matrix factorization. The original algorithm proposed by Simon Funk in his blog post  factorized the user-item rating matrix as the product of two lower dimensional matrices.
- The predicted ratings can be computed as R=HW, where R is the user-item rating matrix, H contains the user’s latent factors and W is the item’s latent factors. Specifically, the predicted rating r for user u will give to item i is computed as:
There are three matrices in this function:
- User-rating matrix: known ratings + to be predicted ratings
- User’s latent matrix: which needs to construct
- Item’s latent matrix: which needs to construct
3. Note that in Funk MF no singular value decomposition is applied, it is an SVD-like machine learning model.
4. There is preliminary for making prediction and validation in FunkSVD: only the users that exist in both the training and validation matrices and the rating of predict item exists (doesn’t need to be the user)in the training matrix. For any user in the validation data that is not there in train data, we will not able to predict/recommend the articles to that user.
5. Compared with SVD, FunkSVD works well when there are a lot of missing values. I will not show it in detail as the emphasis of this story is not here.
FunkSVD: Process +code:
What you should have prepared:
In the beginning, two cleaned tables are needed: movie with features (like the genre)and review with ratings.
movies table: at least includes movie id, and genre.
review table: at least includes user_id, movie_id, and rating
Based on these two tables, we will:
- Prepare user-rating matrix (R) from original two tables: here called ratings-mat
2. Initialize two metrics for the user’s latent matrix (H) and item’s latent matrix (W) and randomly place values into the matrices H and W:
3. Search into user-item matrix R for the ratings that already exists for a user
4. When finding that user in R, performing following rules and updating the random values: dot product between the row associated with the user and the column associated with the item
4.1 The square error between the actual and predicted ratings is: (r-hw)*(r-hw)
4.2 How to minimize the error across all of the known values by changing the weight of each matrix, use the famous gradient descent as below (As difficult to write formula here, I extract the formulas from the udacity course: take y as r, u as h and v as w)
4.3: In each derivative, y-uv is the difference between the actual and predicted values, we can use -2(y-uv)u or v to update the values of the matrics. As we don’t want to move too fast, add learning rate alpha.
The corresponding codes for step 3 and 4 as follow:
5. The complete FunkSVD codes as below:
latent_feature, learning_rate, and iters are hyperparameters, which need to optimize. For example:
user_mat, movie_mat = FunkSVD(rating_mat, latent_features=12, learning_rate=0.005, iters=100)
Now the three matrics (R, H, W)are prepared. Running this code takes a bit of time, please be patient or do something else while running the code.
Part 2: using off-line validation techniques, perform the following tasks to create a training and validation set of data we can use to test the performance of your FunkSVD algorithm
2.1 Split training/validation dataset (because of the size limitation, only part of the review data is included in Github)
2.2 Predict rating function: for user_id, predict the rating for movie_id:
Here is an example for user id 49056, movie id1598822 :
An example of the result:
2.4 Let’s check the four cases below to understand the validation process and why there is an error and how to solve it.
- User in both training and validation set and movie rating exists in the training set, for example, user_id =49056, movie_id=1598822
- User in both training and validation set, but movie rating doesn’t exist in the training set, for example, user_id=29000, movie_id=287978
- User in the validation set, but not in the training set: for example, user_id=220, movie_id=1598822
- User in the training set, but not in the validation set:user_id=8, movie_id=1598822
Let’s check how the data displays in the dataset
There is no record from user 49056 for movie 1598822 in train dataset.
There is a record from user 49056 for movie 1598822 in the validation dataset
How about the rating from user 49056 in the train dataset:
Yes, the user’s ratings towards other movies exist.
How about the rating for movie 1598822 in the train dataset:
Yes, the rating for movie 1598822 from other users exists.
The validation: The predicted rating for user 49056 and movie 1598822 is 6, and the actual rating is 8.
The conclusion for the case1:
- User exists in both dataset, movie ratings in the training set (not from the target user), movie rating exists in the validation dataset, we can predict and validate the user-movie pair
Case 2: User in both training and validation set but movie rating doesn’t exist in the training set.
User 29000 exists in both datasets:
But there is no rating for movie 287979 in the training dataset:
It will show the error information: error pointed that there is no this movie.
This is a kind of cold-start problem.
User in the validation set, but not in the training set.
It will show the error information: error pointed that there is no this user.
This is also a kind of cold-start problem.
Case 4: User in the training set, but not in the validation set
The prediction is possible, just no validation for the prediction.
The conclusion for the FunkSVD validation is:
- The FunkSVD can make a prediction and be validated, only:
- Users exist in both dataset
- Item (not from the user) has been rated in the training dataset, and item (from the user) exists in the validation dataset
2. If the user exists in the training dataset, item has been rated in the training dataset, prediction is still possible, but can’t validate how well of the prediction.
3. Otherwise, there exists a cold start problem. For this problem (new users or new movies), FunkSVD combining content-based and ranked-based recommendations will be helpful.
4. Besides user-movie, this analysis method can extrapolate the other product pairs, like customer-products, user-article reading, etc. The data preprocessing might vary, but once the data cleaned, the method can be unified (this is my opinion, please feel free to make your comments).
5. The trick on how to get the user existing in both sets or not in each other: try
setdiff1dto get the item efficiently.
Thank you for your reading.