Content Recommendation Using Elastic Search

Vishnu Chilamakuru
3 min readDec 5, 2019

When we read article in any news website, medium etc.. we generally see additional sections like Recommended Articles, Similar Articles, etc.. where we see few more articles matching the content of article you are reading or may be based on your previous read history you get few more recommendations.

Basically, Articles recommendation can be done in two ways.

  • Collaborative Filtering
  • Content similarity

Collaborative Filtering:

Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating).

This can be implemented using Machine Learning techniques by identifying the group of users having similar behaviour.

Another way of identifying the similar group of users is by using Graph Databases like Neo4j, ArangoDB etc.. where you can build a graph of users connected via their interests, activities on website, purchase patterns etc.. and identify similar correlated user groups.

Content Similarity:

Content similarity is the degree of similarity between two articles , based on the textual content (terms appearing in them) of the two articles.

This can be implemented using information retrieval techniques like Bow (Bag Of Words), TF-IDF, etc..

In this blog post, i will explain more about implementing Content Similarity using Elastic Search, which internally uses TF-IDF for calculating the relevant articles for the given search query. I took sample articles dataset from kaggle (dataset from https://www.thenews.com website) for this activity.

This dataset contains 2692 articles, out of which 1408 are sports related articles and remaining 1284 articles are business related articles. Will explore sports articles in this post.

Let’s look into the interesting part… The implementation

Below is the sample article stored in Elastic Search which talks about Cricket.

Above article talks more about Hashim Amla, Temba Bavuma, South Africa, England, Test Cricket ... Now, let’s see the top 4 articles matching this content (article id — 1817 as mentioned above)

Below are the recommended articles which have similar content (https://gist.github.com/vishnuchilamakuru/7db4a2be34581af4156c8e07a8750c13)

  • “title” : “Injured Amla stands firm as South Africa build lead”
  • “title” : “Amla makes century De Villiers falls for 88”
  • “title” : “Amla and Stephen Cook lead South Africa to 329/5”
  • “title” : “Root Stokes fire up England ” (Eng Vs SA Test match)

So, If u see most of the recommended articles are in context with the article (id- 1121).

I used Elastic search’s More Like This Query to identify the similar articles matching the content of the current article.

Sample Elastic Search More Like This Query:

Let’s see one more similar example, this time for the Football news.

Above article talks more about Lionel Messi and Argentina. Now, let’s see the top 4 articles matching this content (article id — 1817 as mentioned above)

Below are the recommended articles which have similar content (https://gist.github.com/vishnuchilamakuru/4f1c6bad86d6867e1de4caef4396f336)

  • “title” : “Messi record as Argentina thrash Venezula”
  • “title” : “Magical Messi grabs hat trick as Argentina romp into quarter”
  • “title” : “Messi scores 50th Argentina goal in 2 0 win over Bolivi”
  • “title” : “Messi primed to end Argentina drought Copa fi”

Almost all the recommended articles are in context with the article id- 1817.

Overall, ElasticSearch MoreLikeThis query will help you in identifying similar content articles in a fast and efficient manner which can give you decent recommendations based on text content.

Further Improvements

  • Add NLP POS tagger token filter in index analyser and filter out tagged words to extract topics from the article content.

--

--