Skip to content

Too Long, Didn't Watch: End-to-End Rolling Summarizer of Long Videos


Notifications You must be signed in to change notification settings


Repository files navigation

TL/DW: Too Long, Didnt Watch

Download, Transcribe & Summarize Videos. All automated


What is TL/DW?

  • Take a URL, single video, list of URLs, or list of local videos + URLs and feed it into the script and have each video transcribed (and audio downloaded if not local) using faster-whisper.
  • Transcriptions can then be shuffled off to an LLM API endpoint of your choice, whether that be local or remote.
  • Rolling summaries (i.e. chunking up input and doing a chain of summaries) is supported only through OpenAI currently, though the scripts here will let you do it with exllama or vLLM.
  • Any site supported by yt-dl is supported, so you can use this with sites besides just youtube. ( )

For commercial API usage, I personally recommend Sonnet. It's great quality and relatively inexpensive.

As for personal offline usage, Microsoft Phi-3 Mini 128k is great if you don't have a lot of VRAM and want to self-host. (I think it's better than anything up to 70B for summarization - I do not have actual evidence for this)

Application Demo

CLI tldw-summarization-cli-demo

GUI tldw-summarization-gui-demo

Table of Contents

Quickstart after Installation

  • Transcribe audio from a Youtube URL:

    • python
  • Transcribe audio from a Youtube URL & Summarize it using (anthropic/cohere/openai/llama (llama.cpp)/ooba (oobabooga/text-gen-webui)/kobold (kobold.cpp)/tabby (Tabbyapi)) API:

    • python -api <your choice of API>
      • Make sure to put your API key into config.txt under the appropriate API variable
  • Transcribe a list of Youtube URLs & Summarize them using (anthropic/cohere/openai/llama (llama.cpp)/ooba (oobabooga/text-gen-webui)/kobold (kobold.cpp)/tabby (Tabbyapi)) API:

    • python ./ListofVideos.txt -api <your choice of API>
      • Make sure to put your API key into config.txt under the appropriate API variable
  • Transcribe & Summarize a List of Videos on your local filesytem with a text file:

    • python -v ./local/file_on_your/system
  • Download a Video with Audio from a URL:

    • python -v
  • Run it as a WebApp

    • python -gui - This requires you to either stuff your API keys into the config.txt file, or pass them into the app every time you want to use it.
      • It will expose every CLI option (not currently/is planned)
      • Has an option to download the generated transcript, and summary as text files.
      • Can also download video/audio as files if selected in the UI (WIP - doesn't currently work)



  • Linux
    1. Download necessary packages (Python3, ffmpeg[sudo apt install ffmpeg / dnf install ffmpeg], ?)
    2. Create a virtual env: python -m venv ./
    3. Launch/activate your virtual env: . .\scripts\
    4. See Linux && Windows
  • Windows
    1. Download necessary packages (Python3, ffmpeg, ?)
    2. Create a virtual env: python -m venv .\
    3. Launch/activate your virtual env: . .\scripts\activate.ps1
    4. See Linux && Windows
  • Linux && Windows
    1. pip install -r requirements.txt - may take a bit of time...
    2. Run python ./ <video_url> - The video URL does not have to be a youtube URL. It can be any site that ytdl supports.
    3. You'll then be asked if you'd like to run the transcription through GPU(1) or CPU(2).
    4. Next, the video will be downloaded to the local directory by ytdl.
    5. Then the video will be transcribed by faster_whisper. (You can see this in the console output) * The resulting transcription output will be stored as both a json file with timestamps, as well as a txt file with no timestamps.
    6. Finally, you can have the transcription summarized through feeding it into an LLM of your choice.
    7. For running it locally, pass the '--local' argument into the script. This will download and launch a local inference server as part of the script. * This will take up at least 6 GB of space. (WIP - not in place yet)


  • Single file (remote URL) transcription
    • Single URL: python
  • Single file (local) transcription)
    • Transcribe a local file: python /path/to/your/localfile.mp4
  • Multiple files (local & remote)
    • List of Files(can be URLs and local files mixed): python ./path/to/your/text_file.txt"

Save time and use the config.txt file, it allows you to set these settings and have them used when ran.

usage: [-h] [-v] [-api API_NAME] [-key API_KEY] [-ns NUM_SPEAKERS] [-wm WHISPER_MODEL] [-off OFFSET] [-vad]
                    [-log {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [-ui] [-demo] [-prompt CUSTOM_PROMPT] [-overwrite] [-roll]
                    [-detail DETAIL_LEVEL]

Transcribe and summarize videos.

positional arguments:
  input_path            Path or URL of the video

  -h, --help            show this help message and exit
  -v, --video           Download the video instead of just the audio
  -api API_NAME, --api_name API_NAME
                        API name for summarization (optional)
  -key API_KEY, --api_key API_KEY
                        API key for summarization (optional)
  -ns NUM_SPEAKERS, --num_speakers NUM_SPEAKERS
                        Number of speakers (default: 2)
  -wm WHISPER_MODEL, --whisper_model WHISPER_MODEL
                        Whisper model (default: small.en)
  -off OFFSET, --offset OFFSET
                        Offset in seconds (default: 0)
  -vad, --vad_filter    Enable VAD filter
                        Log level (default: INFO)
  -ui, --user_interface
                        Launch the Gradio user interface
  -demo, --demo_mode    Enable demo mode
  -prompt CUSTOM_PROMPT, --custom_prompt CUSTOM_PROMPT
                        Pass in a custom prompt to be used in place of the existing one. (Probably should just modify the script itself...)
  -overwrite, --overwrite
                        Overwrite existing files
  -roll, --rolling_summarization
                        Enable rolling summarization
  -detail DETAIL_LEVEL, --detail_level DETAIL_LEVEL
                        Mandatory if rolling summarization is enabled, defines the chunk size. Default is 0.01(lots of chunks) -> 1.00 (few
                        chunks) Currently only OpenAI works.

-Download Audio only from URL -> Transcribe audio:

-Transcribe audio from a Youtube URL & Summarize it using (anthropic/cohere/openai/llama (llama.cpp)/ooba (oobabooga/text-gen-webui)/kobold (kobold.cpp)/tabby (Tabbyapi)) API:
  >python -api <your choice of API>
    - Make sure to put your API key into `config.txt` under the appropriate API variable

-Download Video with audio from URL -> Transcribe audio from Video:
  >python -v

-Download Audio+Video from a list of videos in a text file (can be file paths or URLs) and have them all summarized:
  >python --video ./local/file_on_your/system --api_name <API_name>

-Transcribe & Summarize a List of Videos on your local filesytem with a text file:
  >python -v ./local/file_on_your/system

-Run it as a WebApp:
  >python -gui

By default videos, transcriptions and summaries are stored in a folder with the video's name under './Results', unless otherwise specified in the config file.

Setting up a Local LLM Inference Engine

Pieces & What's in the repo?

  • Workflow
    1. Setup python + packages
    2. Setup ffmpeg
    3. Run python <video_url> or python <List_of_videos.txt>
    4. If you want summarization, add your API keys (if not using a local LLM) to the config.txt file, and then re-run the script, passing in the name of the API [or URL endpoint - to be added] to the script.
    • python --api_name anthropic - This will attempt to download the video, then upload the resulting json file to the anthropic API endpoint, referring to values set in the config file (API key and model) to request summarization.
    • Anthropic:
      • claude-3-opus-20240229
      • claude-3-sonnet-20240229
      • claude-3-haiku-20240307
    • Cohere:
      • command-r
      • command-r-plus
    • Groq
      • llama3-8b-8192
      • llama3-70b-8192
      • mixtral-8x7b-32768
    • HuggingFace:
      • CohereForAI/c4ai-command-r-plus
      • meta-llama/Meta-Llama-3-70B-Instruct
      • meta-llama/Meta-Llama-3-8B-Instruct
      • Supposedly you can use any model on there, but this is for reference for the free demo instance, in case you'd like to host your own.
    • OpenAI:
      • gpt-4-turbo
      • gpt-4-turbo-preview
      • gpt-4
  • What's in the repo?
    • - download, transcribe and summarize audio
      1. First uses yt-dlp to download audio(optionally video) from supplied URL
      2. Next, it uses ffmpeg to convert the resulting .m4a file to .wav
      3. Then it uses faster_whisper to transcribe the .wav file to .txt
      4. After that, it uses pyannote to perform 'diarorization'
      5. Finally, it'll send the resulting txt to an LLM endpoint of your choice for summarization of the text.
    • - break text into parts and prepare each part for LLM summarization
    • roller-*.py - rolling summarization
      • can-ai-code - interview executors to run LLM inference
    • - prepare LLM outputs for webapp
    • - summary viewer webapp

Similar/Other projects:
