There are thousands of research papers published every year. This meant choosing the specific research paper which would match my research topic used to be a tedious and time-consuming task, during my college days. After I got into the six-week internship program at Leapfrog, I chose to build a recommendation system using machine learning for journals to solve this problem. In this blog, I will be sharing what I built in my internship and what I am up to now.
Here’s a solution I proposed
I envisioned a solution to cluster every research paper in a few topics which would be used to recommend a useful research paper based on topic similarity. I have shown my data modeling pipeline below.
I began my research by choosing an appropriate dataset to build my model on. I chose to use a dataset obtained from the Neural Information Processing Systems (NIPS) website. This dataset covered research papers from top machine learning conferences topics ranging from deep learning and computer vision to cognitive science and reinforcement learning. It contained information about the title, authors, abstracts, and extracted text for all NIPS papers to date (ranging from the first 1987 conference to the current 2018 conference). The extracted raw data extracted looked like this before processing:
Another important step was data preprocessing. Most words do not contain any useful information. These operations are performed to make the text more readable and understandable. Pre-processing gives more textual information and also significantly reduces redundant words for faster computation and operations. The processing operations I included were:
- Removal of special characters, digits, and punctuation
- Removal of single-letter single letter words and digit
- Formatting to lowercase
- Tokenization and Lemmatization
I wrote a few functions to process the papers.
After pre-processing the text, the data looked clean enough for the next step of understanding data with Exploratory Data Analysis(E.D.A).
After preprocessing the text data, we need to gather some insights which would be very important for figuring out the important aspect of the input data. EDA is all about making sense of data in hand, which cannot be figured out by simple calculation alone. For this, I used many charting and plotting methods like boxplot, Kernel density estimation, histogram, etc. I visualized the effect of year on the number of a research paper published, and also what is the distribution of words count before and after text preprocessing below. I obtained some interesting trends from the EDA. For example, as the year passes by, the number of research and citation has dramatically risen up and also word distribution decreases on average after preprocessing.
After the completion of preprocessing and EDA, the next step is to convert the text into numeric vectors, which is used by the machine learning model to perform the computation.
There are lots of other algorithms available such as Count vectorizer,TF-IDF and word embedding. To create a word vector for this project, we used the Count Vectorizer algorithm. Count vectorizer is a simple method to vectorize the word just by counting the frequency of repetition of a word in a sentence or total research paper.
Topic modeling is an unsupervised machine learning algorithm to classify different text into certain clusters. There are many ways to cluster the topic, Latent Dirichlet Algorithm is one of that algorithm. Latent Dirichlet Allocation (L.D.A), which gives us the likelihood of the words to occur in the given topic cluster.
As a generative model, LDA is able to generalize the model it uses to separate documents into topics to documents outside the corpora. For example, this means that using LDA to group online news articles into categories like Sports, Entertainment, and Politics, it would be possible to use the fitted model to help categorize newly-published news stories. Such an application is beyond the scope of approaches like Latent Semantic Indexing (L.S.I). What’s more, when fitting an LSI model, the number of parameters that have to be estimated scale linearly with the number of documents in the corpus, whereas the number of parameters to estimate for an LDA model scales with the number of topics — a much lower number, making it much better suited to working with large data sets.
It builds a topic as per the words cluster and figures out the likelihood of the document as per the likelihood word present in the topic cluster. It is a “generative probabilistic model” of a collection of composites made up of parts.
pyLDAvis was used to visualize word cluster reduce the dimension of word vector into 2 dimensions linearly using Principal Component analysis(P.C.A) but word vector dimensional doesn’t follow the linearity, hence visualizing word cluster using non-linear properties give more insight from the visualization hence we use the t-SNE model to visualize the world cluster.
For more detailed and clear insight 3D t-SNE , I used Tensorboard to make very interactive to visualize the cluster.
After applying L.D.A in the dataset, we get the likelihood of every document in those set of topics, whose outputs looks likes figure given below:
We choose the number of topics based on the hit and trial analysis based on the number of topic clusters on pyLDAvis. As more clusters are segregated, LDA would be more effective in predicting topic likelihood.
We can infer that words in similar topics are clustered into the same topic. In topic 2, we see that image, network, model, the object which is related to image processing comes under the same cluster. The total sum of likelihood in each topic is one, and every number in each topic is the measure is the probability of given document to be present in the following topic.
Recommendation System: Collaborative Filtering
After getting the likelihood for each document in every topic, now we have reached our final goal of implementing the recommendation system. There are many techniques for developing a recommendation system and one I chose was collaborative filtering. Other alternatives for recommendation were Pearson distance, Euclidean distance or cosine similarity.
Collaborative filtering can be further classified into item based and user-based collaborative filtering. It is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B’s opinion on a different issue than that of a randomly chosen person.
We get the likelihood for each topic for every row of the document.The idea is to recommend the topic based on the root mean square between the two documents. Hence, the lower the root mean square error (RMSE) value higher the chance to have similar topics. In the example below, we calculated the RMSE between the selected paper and other paper and recommended the top ten research paper with minimum root mean square value between them.
As this project is based on unsupervised learning, we have tried to implement our own metrics to evaluate the project. For this, we have taken the top 500 recommended research papers and evaluated the score between the selected topic and the other five hundred recommended topics.
Since LDA gives the likelihood of words in each cluster, it is very difficult to predict a research paper in a single topic cluster with high confidence. This may lead to some errors. Hence, classifying every document with the cluster with maximum likelihood gave us average accuracy that is range between 70% to 85% from the confusion matrix.
As it uses single-label classification, to have a better result we have to use multi-label classification while preserving its likelihood. This model can also be fed into machine learning or deep learning model to predict the topics in a semi-supervised way.
I am delighted that I got an opportunity to work with Data Science Pundits (The two KCs – Aviskar K.C. and Bipin K.C.), and my brilliant fellow interns (Aditya Chapagain, Binayak Pokhrel, and Sushil Ghimire) who helped me in my learning and knowledge sharing. Looking back, after suffering from high fever and an upper respiratory tract infection (which sounds like Sci-Fi horror), I had managed to start my internship four days late. Nevertheless, I am very thankful to have joined a team which is full of dreams and the power to turn it into reality. After this project, I started my journey in Leapfrog Technology as Associate Software Engineer and am working on new ML projects.
This blog is written by Anish Pandey, if you want to contribute to his project or look into it, you can find it here.