An NLP Topic Model & Recommender: NLTK, Spacy, and Sci-kit Learn

  • Goal: To explore natural language processing techniques using TED talk transcripts

    • Motivation: There are hundreds of TED talk topics, both broad (ex: identity, news) and niche (ex: augmented reality, PTSD), listed on the official TED website. Could the number of topics be reduced, but remain meaningful to the average user?

    • Research Question: Can we use natural language processing and topic modeling to determine latent groupings of TED talks?

  • Data

    • 3,600+ TED talk transcripts (out of 4,200 TED talks available online)

    • Talks were recorded from 1984 to 2020, and uploaded from 2006 to 2020

  • Process

    • Tokenize transcripts

    • Vectorize transcripts

    • Latent Dirichlet Allocation to generate document-topic matrix

    • Recommend TED talks using Jensen-Shannon divergence

  • Outcome

    • Developed an application for topic modeling by creating a simple TED talk recommender.

    • Developed an interactive frontend using Streamlit and Heroku to showcase the data, exploratory data analysis, topic modeling, and recommender.

  • Tech Stack

    • Server-Side/Back-End

      • Python

        • BeautifulSoup

        • Sci-Kit Learn

        • Spacy

        • NLTK

      • Heroku

    • Client-Side/Front-End

      • Python

        • Streamlit

Previous
Previous

A Comparison of Supervised Machine Learning Models