An Illustrative Guide to Multimodal Recommendation System

General recommendation systems learn the pattern of user choices or interactions with items and recommend items to users based on learned patterns. As a next step, the multimodal recommendation system captures the styles and aesthetic preferences of the users and recommends products based on the themes or context that interest the user. In this article, we will discuss these multimodal recommendation systems in detail along with how they work. principle and applications. The main points that will be covered in this article are listed below.

Table of Contentss

What are multimodal recommendation systems?
Why do we need multimodal recommendation systems?
How does it work?
Applications of multimodal recommendation systems

What are multimodal recommendation systems?

Multimodal generally means having more than one mode. Multimodal recommendation systems are systems that capture the styles and aesthetic preferences of users. That means it will recommend items based on input, history, and even match the color and pattern of the searched item. Multimodal recommendation systems have been developed using multimodal information from users and articles.

This type of recommendation system saves the user a lot of time by recommending the next item with a similar theme, style, or overall atmosphere, eventually increasing revenue for the company.

Why do we need multimodal recommendation systems?

In a recommendation system, generally, two types of approaches are followed, collaborative and content-based recommendations. In collaborative recommendation systems, it predicts your preference based on the similar interests of other users and based on your rating. While in the content recommendation system, it provides recommendations based on search history only based on user profiles.

The problem arises if you are looking for colored shoes to match the shirt or furniture to match your home, the above systems do not work. Therefore, the multimodal recommendation system will help to find user matches based on color, theme, environment, etc.

In the image above, we can see the input image as a seed and the recommended images as a generated assortment.

How does it work?

In this multimodal recommendation system, we use transfer learning and theme modeling to maximize compatibility of styles based on visuals, and multilingual theme modeling to incorporate text data to infer style in both modalities. Before explaining multimodal representative systems, let’s do a brief and basic introduction to transfer learning, topic modeling, and LDA.

Transfer learning

Transfer learning are deep neural networks that are trained on the ImageNet dataset, basically it means that they are pre-trained models. Some examples are Resnet-50, VGG-16, VGG-19, etc.

Theme modeling

Topic modeling is a statistical process in which you can identify, extract, and analyze topics from a specified collection of documents. The topic modeling techniques determine which topics are present in the documents within the corpus and check the strength of each of them.

Latent Dirichlet Assignment (LDA)

It is a generative statistical model that allows observations to be explained by unobserved groups in which it is explained why some parts of the data are similar.

We primarily use content data to see user preferences and seed the product around which we build packages. LDA topic modeling is now also applied to create topic-based recommendations from user input text and content data. These systems will compare individual products with each other and users will see which ones are the most similar. PolyLDA allows you to learn two distinct and coupled latent style representations. For a given set of documents and a series of target topics, the model assumes that the following generative process was used to create the documents.

To see the aesthetics of the images that are seeded we use the deep learning method. Some of the powerful deep neural networks like ResNet-50, InceptionV3, etc. display powerful models and capture style-based preferences. Here in Resnet-50 is used, which is previously trained on the ImageNet dataset. Here it uses a convolutional neural network to learn the characteristics of our data and simply index its responses to the images to create visual documents.

Since we use LDA for topic modeling and image transfer learning, we need an extension to interpret both. That extension is a multimodal theme model that assumes that words and visual features occur in pairs and must be captured as tuples.

The figure above shows the layers used to create visual documents.

The graph above visually shows the variety and performance of a multimodal recommendation system.

Applications of multimodal recommendation systems

Multimodal recommendation systems are used by e-commerce platforms where they can recommend additional products that are aesthetically similar to the products that the user has searched for. This type of recommendation system can also boost the sales of the other product, generating more revenue.

The image above shows how the search results are similar to each other. Multimodal recommendation systems are also used to recommend fashion-related products. Considering that you just searched for a red t-shirt with a leaf pattern, the system automatically recommends t-shirts in red color and with a leaf pattern along with that, it will also recommend red shoes.

Multimodal recommendation systems are used in the food and beverage industry. Considering you searched for organic grape juices, several other products that are organic will be listed.

Last words

In this article, we understand what multimodal recommendation systems are, how they work, and where they are used. In multimodal recommendations, there are other models that are also used as transfer learning models, sequential recommendation systems, etc. We also review some of the interesting applications of multimodal recommendation systems.

References