Current status: Planning regarding the Machine Learning Algorithm over. Implementation of a basic learning model using Tf-idf and supervised/ semi-supervised in progress.
The idea is to implement an auto tagging feature that provides tags automatically to the user depending upon the content of the post. The tags will get populated as soon as the user leaves the focus on the content text area or via ajax on the press of a button.I’ll be using semantic analysis and topic modeling techniques to judge the topic of the article and extract keywords also from it. Based on an algorithm and a ranking mechanism the user will be provided with a list of tags from which he can select those that best describe the article and also train a user-content specific semi-supervised or a supervised machine learning model in the background.
Schedule and Updates
Before 21st May
In this period I’d like to discuss my idea with the mentors, take feedback from the community and do the necessary changes to the idea. Since, the project is heavily into semantic analysis which is currently a hot research topic and involves NLP, Machine Learning, so I’ll utilize this time for discussion with Professors in this field to come up with the best solution. Also, familiarization with the Drupal coding standards, study of documentation are on the line.
May 21 - May 31
Further Research into ML models, their implementation details, data-structures used etc. -
Classifiers- SVM, Naive Bayes, Decision Trees.
Instance based learning- SVM, k-NN
Clustering (Unsupervised) - k Means, Probability based Clustering, Incremental clustering.
Semi-supervised- EM and Co em
Setting up development environment
June 1- June 10
Sandbox Project up and running
Implement basic configuration pages for project
Simple Ajax front end for nodes - Add article and Basic Pages
Implement Pre-processor for textual content. This includes classes-
3.Stop words removal
June 11- 15
Find the suitable corpus to be used as a training data set for the model.
Plans are to use user's past posts as a data source for training the learning algorithm -in case a suitable corpus is not found.
June 16 - June 23
Implementation of the classifier code
Plans are to have a vector space model ready initially as a backdrop algorithm.
Configuration Pages alteration for the vector model.
Classes for - weights calculation, vectorization etc.
June 24- June 30
Code similarity class - cosine similarity or adjusted cosine or correlation based similarity.
Code for upgrading and maintaining the learned vocabulary set.
July 1 - 10
Start work on another classifier algorithm after consultation with the mentor. The plan as of now is to have a working prototype ready by mid-term evaluation and later on work on incorporating additional models/algorithms, which the user can choose from that best suite his needs.
July 11- July 31
Coding of new classifier, do improvements-optimization and cleanup, changes based on feedback received during mid-term evaluation.
August 1- August 10
Complete documentation and any other pending work.
At the end of the final submission we will have a complete working module with the auto tagging feature implemented.