A NOVEL LATENT FACTOR MODEL FOR RECOMMENDER SYSTEM

Matrix factorization (MF) has evolved as one of the better practice to handle sparse data in field of recommender systems. Funk singular value decomposition (SVD) is a variant of MF that exists as state-of-the-art method that enabled winning the Netflix prize competition. The method is widely used with modifications in present day research in field of recommender systems. With the potential of data points to grow at very high velocity, it is prudent to devise newer methods that can handle such data accurately as well as efficiently than Funk-SVD in the context of recommender system. In view of the growing data points, I propose a latent factor model that caters to both accuracy and efficiency by reducing the number of latent features of either users or items making it less complex than Funk-SVD, where latent features of both users and items are equal and often larger. A comprehensive empirical evaluation of accuracy on two publicly available, amazon and ml-100 k datasets reveals the comparable accuracy and lesser complexity of proposed methods than FunkSVD.


INTRODUCTION
There are increasing clicks on e-service providers than footfalls in traditional stores around the globe.This has enabled many new e-commerce platforms like, Alibaba, flipkart etc., to spring up.One of the advantages of digital platforms is its ability to provide large number of choices to customers (Adolphs & Winkelmann, 2010).However, with rapid growth of customers and large number of options available to them, it is often difficult for customers to choose a right product for their need (Kim, Yum, Song, & Kim, 2005).Fortunately, digitization of data on e-commerce and its affordable processing has enabled e-service providers to aid the customers in decision making (Bauer & Nanopoulos, 2014;Ho, Kyeong, & Hie, 2002).Decision support Recommender systems collect information based on preferences of users on items (movies, news, videos, products etc.).There are two types of category of preferences as expressed by users in literature 1) Explicit feedback 2) implicit feedback.Explicit feedback are those which users directly report their interest on product.For example, Netflix collects star ratings for movies and TiVo users indicate their preferences for TV shows by hitting thumbs-up/down buttons.Because explicit feedback is not always available, some recommenders infer user preferences from the more abundant implicit feedback, which indirectly reflects opinion through observing user behavior.Types of implicit feedback include purchase history, browsing history, search patterns, or even mouse clicks.For example, a user who purchased many books by the same author probably likes that author (Koren & Bell, 2011).
Accuracy of a model in RS is measured by various metrics but the most popular among them is root mean square error (RMSE) (Bobadilla, Ortega, Hernando, & Gutiérrez, 2013;Parambath, 2013;Yang et al., 2014).The data on which the model is applied is first partitioned into train and test set.The model is trained on train set and tested on test set.The rating prediction model first predicts the ratings of items on test set and measure the RMSE on test set which gives an idea about the accuracy of the model.The lower the value of RMSE on test set the better is the model.Complexity of a model, on the other hand is determined by the number of parameters to be trained for running a model.The more are number of parameters, more is the complexity of model and vice-versa (Koren & Sill, 2011;Russell & Yoon, 2008;Su & Khoshgoftaar, 2009) .
In this paper, our main objective is to design a recommender system that reduces the complexity of the model without compromising on accuracy.Therefore, I have proposed a latent factor model that is less complex than RSVD but is as accurate if not better than RSVD.RSVD transforms both items and users in the same multi-dimension latent factor space and the rating provided by the user for an item is modelled as dot product of the latent factor space of items and users (Koren, 2008).Unlike RSVD, the proposed model transforms item and user in different latent factor space with user in one-dimensional space and items in more than one-dimensional space.The rating provided by the user for an item is modelled as product of the user latent factor space and sum of the item latent factor space.
In the remainder of this paper, I consider related work in CF used in recommender system.As there are numerous research-works published on recommender system, I consider the CF models that are relevant to matrix factorization only.In subsequent section, I formally introduce the baseline algorithm of RSVD and then introduce the proposed model.In subsequent section, I will also compare SVD with the proposed Latent factor model in terms of model formulation.In next section I also cover experimentation with the proposed model in MovieLens and amazon dataset and compare accuracy of the model.Finally, last section describes about the observation of experimentation and conclusion drawn on the basis experimentation.

Related work
Memory-based algorithms in Collaborative filtering (Breese et al., 1998) (Shardanand & Maes, 1995) make ratings predictions of an unrated item by a user based on previously ratings provided by users.Model-based algorithms in Collaborative filtering (Goldberg, Roeder, Gupta, & Perkins, 2001)use the collection of ratings to learn a model, which results in ratings prediction.
One of the important aspects in memory-based algorithms is to find out the similarity between users or items.The correlation-based approach and the cosine-based approach are the most widely used similarity measure (B.M. Sarwar, Karypis, Konstan, & Riedl, 2000)(B.Sarwar, Karypis, Konstan, & Riedl, 2002).Many newer ways of calculating similarity for improving prediction have been proposed as extensions to the standard correlation-based and cosine based techniques (Pennock, Lawrence, & Giles, 2000;Zhang, Edwards, & Harding, 2007).
Model-based methods have also been popular with rise of high-speed computer, as they require complex calculation.The methods that have been used to model the recommender system include Bayesian model (Jin & Si, 2004) (Bauer & Nanopoulos, 2014;Koren & Sill, 2011;Mishra, Kumar, & Bhasker, 2015;Russell & Yoon, 2008;Su & Khoshgoftaar, 2009), probabilistic models (Kumar, Raghavan, Rajagopalan, & Tomkins, 2001) (Salakhutdinov, Mnih & Hinton 2007) Probabilistic  Two-layer undirected graphical models with hidden units which learn feature of users and items  It is a scalable method for rating prediction Table1: A summary of latent factor models Based on the above review of latent factor models, it is apparent that focus has been mostly on improving the accuracy of the model and to a lesser extent on reducing space complexity.Therefore, I am proposing a very simple yet accurate model for recommendation task on e-commerce platform.

METHODOLOGY
Preliminaries I can formulate model based CF problem in following manner: Given a user set U with n users, an item set I with m items and preference on items of a user denoted by r ui , a user-item matrix R of |n × m| dimension is formed where each row vector denotes a specified user, each column vector denotes a specified item and each entry r ui denotes the user u's preference on item i.A user may exercise implicit preferences (clicks or purchase) as well as explicit preferences (ratings); usually a higher rating means stronger preference and lower rating means less or no preference for items.As not all users can rate all the items in the dataset, the matrix R is always sparse.From the given set of preferences in user-item set, the objective is to construct a recommender, which can predict the rating of unseen items by the users and thereafter recommend the items that have higher predicted ratings.

Notations
For distinguishing users from items special indexing letters have been used for user and itemsa user is denoted by "u", and an item is denoted by "i".A rating r ui indicates the preference of a user u for item i, where high values mean stronger preference and low values mean low preference or no preference for an item i.For example, in a range of "1 star" to "5 stars", "1 star" rating means lower interest by a particular user u for a given item i and "5 stars" rating means high interest by user u for a given item i. Predicted ratings and observed ratings in the data set have been distinguished by using the notation r ̂ui and r ui respectively.Regularization parameter is denoted by  and learning rate is denoted by  in the models.

Regularized Singular Value Decomposition
Matrix Factorization (MF) technique is one of the most popular approaches for solving the CF problem in Netflix prize competition.Regularized SVD is a type of MF proposed by Simon Funk and successfully implemented for Netflix challenge (Paterek, 2007).The basic idea incorporated in regularized SVD is that users and items can be described by their latent features.Every item can be associated with a feature vectors (Q i ) which describes the type of movie e.g.comedy vs. drama, romantic vs. action, etc.Similarly, every user is associated with a corresponding feature vectors (P u ).In order to build the model, the dot product between user feature vectors and item feature vectors is approximated as the actual rating given by a user u for an item i. Mathematically, it can be expressed as: More formally, an initial baseline estimate of every {user, item} pair is estimated using b ui = b u + b i +  ; where user bias (b u ) is the observed deviation of a user u from average rating of all users.Item bias (b i ) is the observed deviation of item i from average rating for all items;  is the global average of ratings of all the useritem ratings.The baseline estimate is added linearly into equation 1, and to make a balance between over-fitting and variance, regularization parameter  is introduced to the newly formed equation which minimizes the sum of square of errors between predicted and actual ratings.So the task is to minimize the following equation.
Here, ‖. ‖  denotes the Frobenius norm The optimum value of the minimization function can be obtained by using stochastic gradient descent method.Since, it is a non-convex function, it may not attain global optimum but it can attain close to the global optimum and hence yielding a suboptimal solution by using stochastic gradient descent method (Ma, Zhou, Liu, Lyu & King 2011).For every iteration, learning rate () is multiplied against the slope of descent of the function in order to reach minima.The update of   and   for every user and every item can be done after every iteration in following manner.
=   - ̂ ; (3) However, the predicted ratings after following the above steps need to be clipped in the range 1 to 5 in order to get the final predicted rating (Paterek 2007).If the predicted rating exceeds 5 it is clipped to 5, while if the predicted rating is less than 1, it is clipped to 1.The prediction of rating for an unrated item for a user is done by summing the dot product between learned features of corresponding item and user and further adding the global average, user bias and item bias.The predicted rating is given by: The aforementioned model outperforms other popular algorithms when dataset is populated as well as when sparseness of the dataset increases as was in Netflix prize competition.The method is also scalable and accurate which makes it a very important contribution in the field.Inclined with the aforementioned model, I propose a latent factor bilinear regression model, different from RSVD with lesser number of features.The proposed model with lesser number of features reduces the complexity without compromising on accuracy and scalability.

Proposed latent factor model
The RSVD model is scalable and works better in sparse dataset; however, the number of parameters in RSVD for both items and users are equal and are often large which increases the complexity of the model.With increasing data, the complexity of model increases, therefore main purpose of this paper is to develop a lesser complex model than RSVD, which is at least as accurate as RSVD.
The proposed model trains latent features of both user and items as in RSVD but differs from RSVD in following manner.In RSVD, the known ratings are approximated to the dot product of latent features between users and items but I propose a different approach of learning latent factor model where the dot product of sum of the latent factors of item and corresponding latent factor of user is approximated to known rating of user-item pair.Here, the number of latent factor of items varies depending on the dataset, but the number of latent factor of user is constant and equals to one.This model therefore, diminishes the number of latent factors to be trained in comparison to RSVD and hence is lesser complex than RSVD.
The equation of the above proposed model can be written mathematically: Where,   are  latent features of item and   is a latent weight of the sum of all the latent features of an item rated by the user.
Like the RSVD model, an initial baseline estimate of every {user, item} pair is estimated using   =   +   +  ; where   and   are observed deviation of user u and i from user average rating and item average rating and  is the global average of ratings of all the user, item rated.The base line estimate is added linearly with the above scheme, regularization parameter  1 is introduced to the new formed equation which minimizes the sum of square of error between predicted and actual rating.The new scheme formed after incorporating baseline estimates and regularization is given by: The equation can be solved by using SGD method as illustrated in RSVD.Since it is also a non-convex function as is RSVD model, it may not attain a global optimum value but can reach close to optimum value using SGD.In order to reach minima learning rate ( 1 ) is multiplied against the slope of descent at each iteration level.The update of   and   for every user and every item can be done after each iteration in following manner.
An important point with regard to this algorithm is stopping criteria, as soon as the sum of squares of the errors start stabilizing the algorithm is stopped.If the sum of square of the errors in last iteration approximately equals or has a very small prefixed difference () with the sum of square of the errors in previous iteration the model is thought to have learned and algorithm is stopped at that instance.The pseudo code for learning using stochastic gradient descent is described below: 2) Fix value of K,  1 and  1 .

Update training parameters
for each R   predict the ratings for user and item Space complexity of the proposed model is lesser than RSVD as can be seen from the two models.In RSVD model the dimension of P u and Q i are N x K and M x K respectively, while the dimension of corresponding factors X u and Z ik in proposed model are N x 1 and M x K respectively.This proves the compactness of the proposed model over RSVD model.Now, I will look into the accuracy of both the models in next section.
Figure 2: A schematic diagram for predicting ratings from sparse user-item rating matrix

Datasets
For the experimental evaluations of the proposed method, I make use of two different datasets.The first one is a publicly available Movie Lens dataset (ml-100k).The dataset consists of ratings of movies provided by users with corresponding user and movie IDs.There are 943 users and 1682 movies with 100000 ratings in the dataset.Had every user would have rated every movie total ratings available should have been 1586126 (i.e.943×1682); however only 100000 ratings are available which means that not every user has rated every movie and dataset is very sparse (93.7%).This dataset resembles an actual scenario in E-commerce, where not every user explicitly or implicitly expresses preferences for every item.
The second dataset consists of movie reviews from amazon.The data spans a period of more than 10 years, including approximately 8 million reviews up to October 2012.Reviews include product and user information, ratings, timestamp, and a plaintext review.The total number of users is 889,176 and total number of products is 253,059.In order to use this dataset for experimentation purpose I have randomly sub-sampled the dataset to include 6466 users and 25350 products with only users, items, ratings and timestamp intact in the data.The total number of ratings available in the sub-sampled dataset is 54996, which make the data sparser than ml-100 k dataset (99.67% sparsity).
In order to recommend items to users based on their past explicit behavior (ratings) for movies, I have assumed that ratings of 4 and 5 for a movie indicate preference for that movie, while ratings of 1, 2 and 3 for a movie suggest that user is not interested in that movie.Therefore, generating recommendation involves the task to predict ratings for unrated movies, and only those movies will be recommended to a user whose predicted rating lies in the range of 4 and 5.

Accuracy measures
In order to evaluate accuracy, the Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are popular metrics in Recommender systems.Since, RMSE gives more weightage to larger values of errors while MAE gives equal weightage to all values of errors, RMSE is preferred over MAE while evaluating the performance of RS (Koren, 2009).RMSE is popular metrics in RS until very recently and many previous works have based their findings on this metrics, therefore this metrics has been used primarily to exhibit the performance of the proposed models and RSVD model on two datasets.For a test user item matrix '' the predicted rating r ̂ui for user-item pairs (u, i) for which true item rating r ui are known, the RMSE is given by RMSE =√

Cross validation
Cross validation is a well-established technique in machine learning algorithms that are used in evaluation of various algorithms.This technique ensures that the evaluation results are unbiased estimates and are not due to chance.For applying this technique, the dataset is split into disjoint k-folds; (k-1) folds are used as training set while the left out set is used for testing.The procedure is repeated k times so that each time a unique test set can be used for performance evaluation.The measures such as RMSE used for evaluation of RS models will be calculated k times and then averaged to get the resultant unbiased estimate of the performance measures.

k-Fold Cross-Validated Paired t Test
For testing, better of the two algorithms between RSVD and proposed model, I have performed cross-validated paired t test on both the datasets.
Firstly, I record the RMSE of both the classifiers on the validation sets.Then, if the two classification algorithms have the same RMSE, it is expected to have the difference between the RMSE equal to zero which is also the null hypothesis.The alternate hypothesis is that the difference between the RMSE is not equal to zero.I can say by the test if the difference between the RMSE zero or not but I cannot establish the better of the two algorithm.Therefore, I have to modify the null hypothesis.The null hypothesis to establish the better of the two algorithms can be modified to; that the RMSE obtained for validation sets by RSVD is less than RMSE obtained for validation sets by the proposed model.By using paired t test I can statistically prove the better of the two algorithms for the both the datasets (Alpaydin, 2004).

OBSERVATIONS AND RESULT
For the proposed model latent feature value of K is varied from 5 to 30 in step size of 5, and I record the RMSE values in table 2. To compare both the models on MovieLens dataset (ml-100k),  (regularizing parameter) and  (learning rate) are taken as 0.01 and 0.01 respectively that were found using cross-validation.By comparing both the models I found that RMSE reaches its minima at K= 5 for both RSVD and proposed model, and the minima of both the model happens for the proposed model at K= 5. To establish that the proposed model is better than RSVD for ml-100k dataset I have used k-fold cross-validated paired t test at 0.05 levels of significance.The null hypothesis that the RMSE obtained for validation sets by RSVD is less than RMSE obtained for validation sets by the proposed model is rejected as the p-value is found to be 0.9995, and the alternative hypothesis is accepted.
I have also used amazon dataset and followed the above procedure to establish the claim of better RMSE from proposed model than RSVD.In order to compare both the models on amazon dataset,  (regularizing parameter) and  (learning rate) are taken as 0.001 and 0.1 respectively.Table 3 reveals the RMSE of both the algorithms at various Kvalues.By comparing both the models I found that RMSE reaches its minima at K= 5 for RSVD and at K= 20 for proposed model, the overall minima of both the models occurs for the proposed model at K= 20.I have used k-fold cross-validated paired t test at 0.05 levels of significance for finding the better of the two models.The null hypothesis that the RMSE obtained for validation sets by RSVD is less than RMSE obtained for validation sets by the proposed model is rejected as the p-value is found to be 1, and the alternative hypothesis is accepted.The results of the proposed model are reported on two different datasets, viz., Ml-100k and amazon dataset.RMSE values on both these datasets are significantly lower or as accurate when compared with state-of-the art RSVD model.It is also to be noted that the complexity of the proposed model is lower than RSVD model, which signifies its deployment in practical scenarios where there are large sparse data to be handled efficiently.

CONCLUSION
The data on e-commerce platform is growing at increasing velocity with every passing day, so it is a growing challenge for researchers and industry practitioners to build recommenders that are faster to use and accurate in recommendation.The proposed model in this paper has shown through empirical evaluation that it is as accurate as RSVD if not better in terms of accuracy as well as the space complexity is much lesser than RSVD.
The proposed model has also been successful in handling sparsity, which is one of the main reasons of using RSVD in sparse dataset.I can see from the empirical evaluations that the proposed model fares better than RSVD model in terms of accuracy when data is sparser.This observation is true in case of this dataset and drawing generalization for similar sparser dataset may be a far-fetched conclusion.This observation can be checked thoroughly in other datasets with more or less sparsity.One of the limitations of the models is that the rating prediction can go out of bounds which is also one of the limitations of RSVD model.The out of bound prediction can reduce the performance of the model, if not tackled either by clipping or by some other method as described for RSVD method.
Further, the proposed model can be extended to learn parameters so that the recommendations of good items can be improved.This can be achieved by modelling the proposed scheme so that it takes care of precision and recall of the recommender system.To improve the accuracy of proposed model I can also use gradient boosting of the proposed model which a kind of ensemble technique by varying parameters of the proposed model.
matrix of rating, dimension N x M (user item rating matrix)   : Set of known ratings in matrix     : An initial vector of dimension N x 1 (User feature vector)    : An initial matrix of dimension M x K (Movie feature matrix)    : Bias of user .   : Bias of item .  : Average rating of all users  K : Number of latent features to be trained    : error between predicted and actual rating to matrix   and vectors,   ,   and . As I have focused mainly on matrix factorization (SVD) technique the detailed literature review of the relevant technique are presented in following table.
Probabilistic Features of users and item as well as latent features learned from the database using SVD is used to predict the ratings  In the case of PMF we use zero mean prior over latent factors but in RLFM the prior is estimated by running regression over features of items and users. Suitable for cold start and warm start situations in RS Deterministic  Offline phase: uses principal component analysis(PCA) for optimal dimensionality reduction and then clusters users in the lower dimensional subspace  Online phase: uses eigenvectors to project new users into clusters and a lookup table to recommend Deterministic  Decomposes the user-item preference(rating) matrix into two matrices, user feature matrix and item feature matrix  It works on the principle of lowering the norm of matrices instead of reducing the rank of matrices