An NLP Topic Model & Recommender: NLTK, Spacy, and Sci-kit Learn
Goal: To explore natural language processing techniques using TED talk transcripts
Motivation: There are hundreds of TED talk topics, both broad (ex: identity, news) and niche (ex: augmented reality, PTSD), listed on the official TED website. Could the number of topics be reduced, but remain meaningful to the average user?
Research Question: Can we use natural language processing and topic modeling to determine latent groupings of TED talks?
Data
3,600+ TED talk transcripts (out of 4,200 TED talks available online)
Talks were recorded from 1984 to 2020, and uploaded from 2006 to 2020
Process
Tokenize transcripts
Vectorize transcripts
Latent Dirichlet Allocation to generate document-topic matrix
Recommend TED talks using Jensen-Shannon divergence
Outcome
Developed an application for topic modeling by creating a simple TED talk recommender.
Developed an interactive frontend using Streamlit and Heroku to showcase the data, exploratory data analysis, topic modeling, and recommender.
Tech Stack
Server-Side/Back-End
Python
BeautifulSoup
Sci-Kit Learn
Spacy
NLTK
Heroku
Client-Side/Front-End
Python
Streamlit