Melissas Forum Recommender

About

About This Project


Hi my name is Melissa and welcome to my forum classifier NLP project! This is an end-to-end NLP application I completed this during my time as a participant on Team BERTinator in the Machine Learning group at STEM-Away .

My internship at STEM-Away has taught me more than I could have ever imagined; it's a challenge just to sum up all that I have learned. When I first started STEM-Away I was placed on Team BERTinator. For week one, our lead picked DistilBERT as our team's transformer model and broke us into small groups to learn about NLP transformers and create web crawlers, using Scrapy to scrape data from our chosen public discourse forum. For the following four weeks, I learned how to clean data using Pandas, learned methods for web scraping, data cleaning, and data analysis exploration. I explored data wrangling (tokenization - preprocessing), and different approaches for text classification using machine learning classification models, as well as trained and tested the dataset to evaluate my model. Our team all had working models trained on the forum data we had scraped.

For the second part of my internship, I was given the opportunity to move to the Bioinformatics sector as a lead or to expand on my current project using Flask, Docker, and AWS. I decided to stay on for the expansion. I had only dabbled in AWS in the past, and I was excited to learn more about it and Docker. I decided to challenge myself to redo my recommender model and use even more scraped forum data from my teammates in session one. I used ktrain with DistilBERT to wrap libraries and lessen the amount of code used. I learned the hard way why it's best to use Google Colab to train your model!

I then learned how to use Flask in the front and back end and deployed my model in Flask on my local computer. I then dived into the world of Docker, and learned how to create my own Docker images, and to tag and push my code to the Docker hub. I learned how useful containers are, and am eager to take this knowledge to the next level in the future with Kubernetes and pods. My next step was to deploy my model onto AWS. I chose AWS EC2 since my project was too big to deploy via serverless and AWS Lambda. I used Forklift as my file manager to upload and worked with my files in EC2 and deployed my project in EC2 Ubuntu. I decided to listen to that insistent warning in Flask (if you’ve worked in Flask, you know the one) - and to the hopes of my team lead from the very beginning of this internship - and I deployed my model into production using uWSGI, and NGINX.

I am grateful to all of the people I have met from all over the world. I am especially grateful to STEM-Away. This experience has shaped who I am and who I can be. Thank you. I will be attaching a longer outline of the lessons I have learned, as well as my official certificate.


Multi-Class Confusion Matrix

Model accuracy overall = 95%


Tools used during my internship:

Python - Numpy, NLTK, Scikit Learn, Pandas,Keras,PyTorch,Tensorflow Scrapy, Seaborn, Flask
DistilBERT
PyCharm
Forklift
Docker
AWS(EC2,S3, Elastic IP)
uWSGI
NGINX
Ubuntu
Slack
Google Colab
Hugging Face Transformers
ktrain
Jupyter Notebook
Postman
Regex

Banner design comes from STEM-Away .