An NLP Topic Model & Recommender

Goal: To explore natural language processing techniques using TED talk transcripts
- Motivation: There are hundreds of TED talk topics, both broad (ex: identity, news) and niche (ex: augmented reality, PTSD), listed on the official TED website. Could the number of topics be reduced, but remain meaningful to the average user?
- Research Question: Can we use natural language processing and topic modeling to determine latent groupings of TED talks?
Data
- 3,600+ TED talk transcripts (out of 4,200 TED talks available online)
- Talks were recorded from 1984 to 2020, and uploaded from 2006 to 2020

Process
- Tokenize transcripts
- Vectorize transcripts
- Latent Dirichlet Allocation to generate document-topic matrix
- Recommend TED talks using Jensen-Shannon divergence
Outcome
- Developed an application for topic modeling by creating a simple TED talk recommender.
- Developed an interactive frontend using Streamlit and Heroku to showcase the data, exploratory data analysis, topic modeling, and recommender.

Tech Stack
- Server-Side/Back-End
  - Python
    - BeautifulSoup
    - Sci-Kit Learn
    - Spacy
    - NLTK
  - Heroku
- Client-Side/Front-End
  - Python
    - Streamlit

A Comparison of Supervised Machine Learning Models