How to Summarize a YouTube Transcript With AI
- The fastest manual way: paste the transcript into ChatGPT or Claude under a clear prompt (summarize this transcript into key points with timestamps). For a video under ~30 minutes the whole transcript fits in one message.
- The repeatable way: I ran a Python pipeline that fetches the transcript with
youtube-transcript-apiand sends it to the OpenAI API. A 522-token transcript summarized for a fraction of a cent. - Long videos: a 3-hour podcast transcript runs past a model's context window. The fix is chunking the transcript and summarizing in a map-reduce pass, which I show below.
- At scale: the transcript step throws
IpBlockedfrom cloud IPs. For bulk jobs I pull text through a transcript API first, then summarize.
Once I have a YouTube transcript in hand, the next thing I almost always want is a summary of it. A 40-minute conference talk becomes five bullet points. A two-hour podcast becomes a paragraph and a list of timestamps worth jumping to. This guide is the AI summarization step, and it picks up exactly where pulling the transcript leaves off, which I cover separately in getting the transcript of a YouTube video.
I tested three routes to summarize a YouTube video transcript in June 2026: a plain ChatGPT prompt, a Python pipeline that fetches the transcript and calls an LLM API, and the one-click browser extensions sold as a YouTube transcript summarizer. Every code block below is something I ran, with the real numbers. The short version is that a manual prompt is the best AI tool for a single video, code wins the moment you have a list of them, and long videos need one extra trick to get past the token limit.
What is the fastest way to summarize a YouTube transcript with AI?
The fastest way to summarize a YouTube transcript with AI is to paste the transcript into ChatGPT or Claude under a clear summarization prompt. Both tools read the full transcript text and return a structured summary in a few seconds, and the free tier handles it as long as the transcript fits inside one message. This is the route I reach for when I only have one video to condense.
The catch most people hit first: ChatGPT cannot open a YouTube link and watch the video. It works on the transcript text you give it. So the real workflow is two steps, get the transcript, then summarize it.
| Step | What you do | Tool |
|---|---|---|
| 1 | Copy the transcript text | YouTube Show transcript panel, or a generator |
| 2 | Paste it under a prompt | ChatGPT, Claude, or Gemini |
| 3 | Read the summary | The model returns key points |
The quality of the YouTube transcript summary depends almost entirely on the prompt. A bare “summarize this” returns a flat paragraph. Naming the output format gets something useful, which is the next thing to get right before you paste anything.
What is the best prompt to summarize a YouTube transcript?
The best prompt to summarize a YouTube transcript names the output format, the length, and what to preserve. A prompt that specifies “5 bullet points, keep timestamps, list any tools mentioned” returns a far more useful summary than “summarize this transcript,” because the model has explicit targets instead of guessing what you want.
Here is the prompt I paste above a transcript. It is plain text, nothing clever:
Summarize the following YouTube transcript.
- Give a 2-sentence overview first.
- Then 5 to 7 key points as bullets.
- Keep the approximate timestamp for each point.
- List any tools, people, or sources mentioned by name.
Transcript:
<paste transcript here>
A few things in that prompt earn their place. Asking for the overview first gives a fast answer at the top. Asking it to keep timestamps means the summary doubles as a jump-to index for the video. Asking it to pull named tools and people stops the model from smoothing over the specifics, which is the detail a generic summary usually loses.
OpenAI’s own prompting guidance makes the same point in general terms: specific instructions about length, format, and the role the model should take produce more reliable output than vague requests. The transcript is just a long block of input text, so the normal prompting rules apply.
For a single short video, pasting under that prompt is the whole job. The friction starts when the transcript is long, or when you have twenty videos to get through, and that is where a script beats copy-paste.
How do you summarize a YouTube transcript with Python?
To summarize a YouTube video transcript with Python, fetch the transcript with the youtube-transcript-api library, join the text into one string, and send it to an LLM API such as OpenAI’s with a summarization prompt. The script turns the two manual steps into one repeatable call you can run over a list of videos. This is the route I default to once I have more than a couple of videos.
Install the two pieces:
pip install youtube-transcript-api openai
Then fetch the transcript and summarize it. The video ID is the part after v= in a watch URL:
import os
from youtube_transcript_api import YouTubeTranscriptApi
from openai import OpenAI
VIDEO_ID = "dQw4w9WgXcQ"
# 1. Fetch the transcript (no API key needed for this step)
fetched = YouTubeTranscriptApi().fetch(VIDEO_ID)
transcript = " ".join(snippet.text for snippet in fetched)
# 2. Send it to an LLM with a summarization prompt
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
prompt = (
"Summarize this YouTube transcript into a 2-sentence overview "
"followed by 5 key points as bullets:\n\n" + transcript
)
resp = client.chat.completions.create(
model="gpt-5.4-nano",
messages=[{"role": "user", "content": prompt}],
)
print(resp.choices[0].message.content)
When I ran the fetch step against a public video in June 2026, YouTubeTranscriptApi().fetch() returned 61 transcript snippets that joined into 2,089 characters of text. Each snippet carried text, start, and duration fields, so the timestamps are there if you want the prompt to keep them. That transcript is roughly 522 tokens, which I confirmed with a character count divided by four, the standard rough estimate.
At 522 input tokens, the summarization call is effectively free. GPT-5.4 nano is priced at $0.20 per million input tokens, so this transcript costs about four-hundredths of a cent to summarize. The economics only matter once you are running thousands of videos, and even then a small model keeps it cheap.
Which model should you use to summarize a transcript?
For summarizing a transcript, a small fast model is the right default, because summarization is not a hard reasoning task and the cheap models do it well. I reach for a nano or mini tier model first and only move up to a flagship if the summaries miss nuance on dense technical talks.
| Model | Input price (per 1M tokens) | Output price (per 1M tokens) | Notes |
|---|---|---|---|
| GPT-5.4 nano | $0.20 | $1.25 | Cheapest, fine for plain summaries |
| GPT-5.4 mini | $0.75 | $4.50 | Better on dense or technical content |
| Claude Haiku 4.5 | $1.00 | $5.00 | Strong at structured, faithful summaries |
The price gap between the tiers is real but the absolute cost is tiny for summarization, since a transcript is mostly input tokens and the summary output is short. A 50-minute talk runs around 10,000 input tokens, which is a fraction of a cent on any of these models. The constraint that actually bites is not price. It is the context window on very long videos, which is the next thing to plan for.
How do you summarize a long YouTube video that exceeds the token limit?
You summarize a long YouTube video that exceeds the token limit by chunking the transcript into pieces, summarizing each piece, then summarizing the combined summaries. This map-reduce approach keeps every individual request inside the model’s context window, so a three-hour podcast that would never fit in one prompt still gets a clean final summary.
The token limit is a hard ceiling. Every model has a context window, and a transcript longer than that window cannot be sent in one call. A rough rule from working with these is that a 50-minute talk is about 10,000 tokens, so a multi-hour livestream or podcast can run into the hundreds of thousands of tokens and overflow a smaller model. When the input exceeds the limit, the model either errors or silently drops context and returns a summary that misses whole sections.
Here is the chunked pattern I use for long transcripts:
import os
from youtube_transcript_api import YouTubeTranscriptApi
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def summarize(text):
resp = client.chat.completions.create(
model="gpt-5.4-nano",
messages=[{"role": "user",
"content": "Summarize this transcript section into 3 bullets:\n\n" + text}],
)
return resp.choices[0].message.content
fetched = YouTubeTranscriptApi().fetch("dQw4w9WgXcQ")
words = " ".join(s.text for s in fetched).split()
# MAP: split into ~3000-word chunks and summarize each
chunk_size = 3000
chunks = [" ".join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
partials = [summarize(c) for c in chunks]
# REDUCE: summarize the combined partial summaries
final = summarize("\n\n".join(partials))
print(final)
The map step summarizes each ~3,000-word chunk independently, so no single request approaches the context window. The reduce step then condenses those partial summaries into one. This is the same technique the LangChain documentation describes for summarization, where map-reduce is the standard answer for documents that are too long to summarize in a single pass.
For most videos under an hour you never need this. The whole transcript fits in one call and the simple script from the previous section is enough. The chunking only earns its complexity on long-form content, which is exactly the kind of video people most want condensed into a summary they can read in a minute.
How do you summarize a YouTube video without a transcript?
You summarize a YouTube video without a transcript by transcribing the audio first with a speech-to-text model, then summarizing that text. When a video has no caption track, there is no transcript to read, so the missing step is generating one from the audio. A tool advertised as a “YouTube summarizer without transcript” is running this audio-to-text step for you behind the scenes.
The reason a video can lack a transcript at all comes down to captions. If the creator disabled captions and YouTube did not auto-generate them, no caption track exists, and no summarizer can read text that was never created. Music videos, some Shorts, and audio in unsupported languages routinely fall into this gap. I cover exactly which videos have no caption track in the transcript guide.
The fix is to transcribe the audio yourself. OpenAI’s Whisper is the common open route: you download the audio, run it through Whisper to get text, then feed that text into the same summarization prompt as any other transcript. It is an extra step and it costs compute, but it is the only honest way to summarize a video that has no captions, because the summary still needs text as input.
For videos that do have captions, skipping straight to the caption track is faster and cheaper than re-transcribing the audio. That is why the methods in this guide all read the existing transcript when one is available, and only fall back to speech-to-text when there genuinely is none.
What about browser extensions that summarize YouTube videos?
A browser extension marketed as a YouTube transcript summarizer works by reading the page’s caption track and sending it to an LLM, then showing the summary in a panel next to the video. These free YouTube transcript summarizer tools collapse the copy-transcript and paste-into-ChatGPT steps into one button on the watch page. For casual one-off summaries while browsing, an online summarizer like this is the lowest-effort option.
I looked at the most-installed ones to see what they actually do:
| Extension | Backing model | Free tier | What it adds |
|---|---|---|---|
| Glasp | ChatGPT, Claude, Gemini | Yes, desktop use | Opens ChatGPT with transcript pre-loaded |
| Eightify | ChatGPT and Claude | Limited free | Key insights with timestamps |
| ScreenApp | Proprietary | Limited free | Timestamped notes, also non-YouTube video |
Under the hood, these do the same two things the Python script does: grab the transcript, send it to a model. Glasp is upfront that clicking its ChatGPT icon “opens a new ChatGPT window with the transcript pre-loaded and a summarization prompt ready,” which is the manual workflow automated. The convenience is real for browsing. The limitation is that an extension summarizes one video at a time in your browser, so it does not help when you need to process a backlog programmatically.
That programmatic case, summarizing transcripts across a whole channel or a long list of videos, is where the extension and the copy-paste route both run out, and where the bottleneck moves back to reliably getting the transcripts in the first place.
How do you summarize YouTube transcripts at scale?
To summarize YouTube transcripts at scale, pull each transcript through a transcript API that handles the blocking, then run the summarization prompt over the returned text. The summarization itself scales fine, since an LLM API takes as many calls as you can send. The part that breaks at scale is the transcript fetch, because YouTube blocks the cloud IPs your script runs on.
This is the wall the Python pipeline hits in production. The youtube-transcript-api library warns that YouTube “has started blocking most IPs that are known to belong to cloud providers (like AWS, Google Cloud Platform, Azure, etc.)”, which surfaces as an IpBlocked or RequestBlocked exception the moment you deploy. The summarization step is fine on a server. The fetch step is what fails.
| Where the pipeline runs | Transcript fetch | Summarization |
|---|---|---|
| Home or office laptop | Works | Works |
| AWS / GCP / Azure server | IpBlocked exception | Works |
| Through a transcript API | Works, handled server-side | Works |
The clean fix is to get the transcript text from an API that owns the proxy layer, then summarize that text with the same code as before. The request is a single call with your key and the video URL. Asking for format=text returns the whole transcript pre-joined into one text field, which is exactly what the summarization prompt wants:
curl "https://api.youtubescraperapi.com/api/v1/youtube/transcript?url=https://www.youtube.com/watch?v=dQw4w9WgXcQ&format=text&api_key=$API_KEY"
import os, requests
from openai import OpenAI
# 1. Get the transcript from the API (no IpBlocked on a server)
r = requests.get(
"https://api.youtubescraperapi.com/api/v1/youtube/transcript",
params={
"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"format": "text", # one combined string, ready to summarize
"api_key": os.environ["API_KEY"],
},
timeout=30,
)
data = r.json()
if not data["transcript_available"]:
raise SystemExit(f"no transcript: {data['reason']}")
transcript = data["text"]
# 2. Summarize the returned transcript text with the same prompt as before
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
summary = client.chat.completions.create(
model="gpt-5.4-nano",
messages=[{"role": "user",
"content": "Summarize this transcript into 5 key points:\n\n" + transcript}],
)
print(summary.choices[0].message.content)
The server runs the transcript extraction from residential IPs it manages, so the same job that throws IpBlocked from your own cloud function returns parsed data here. When a video has only auto-generated captions, the endpoint returns that ASR track (marked source: "asr") instead of failing, so most spoken-word videos still summarize. You grab a key on the youtubescraperapi.com sign-up page and point the transcript API endpoint at any watch URL. For the timed subtitle lines instead of plain text to summarize, the caption and subtitle scraper returns those.
For a single video, the browser extension or a paste into ChatGPT is the right tool and costs nothing. The API earns its place when you are summarizing transcripts across a channel or a thousand-video backlog and cannot have the pipeline fail on a datacenter IP. Before you automate bulk collection, it is worth knowing where the line sits with YouTube’s Terms of Service, which I break down in whether scraping YouTube is legal.
FAQ
For a one-off summary, ChatGPT or Claude with a good prompt is the best free AI tool: paste the transcript, ask for key points with timestamps, and you get a clean summary in seconds. For repeatable or bulk work, a Python script that fetches the transcript and calls an LLM API is cheaper and scriptable. Browser extensions like Glasp and Eightify wrap the same idea into one click on the video page.
Yes. ChatGPT summarizes a pasted YouTube transcript for free on the no-cost tier, as long as the transcript fits inside the message. ChatGPT cannot open a YouTube link and read the video itself, so you paste the transcript text. For videos longer than roughly an hour you either summarize in chunks or use the API, where a small model like GPT-5.4 nano costs about $0.20 per million input tokens.
When a video has no caption track, you cannot read a transcript, so you transcribe the audio first with a speech-to-text model such as OpenAI Whisper, then summarize that text the same way. Tools advertised as a "YouTube summarizer without transcript" run this audio-to-text step for you behind the scenes. There is no way to summarize speech that was never converted to text.
Yes. Every model has a context window, and a long transcript can exceed it. GPT-5.4 models expose a large window, but a multi-hour podcast can still run past a smaller model's limit. A 50-minute talk is roughly 10,000 tokens, which most current models handle in one pass. Past that, you chunk the transcript and summarize each piece, then summarize the summaries.
A prompt that names the output structure works best: Summarize the following YouTube transcript into a 5-point bulleted summary, keep the timestamps for each point, and list any tools or names mentioned. Naming the format, the length, and what to preserve (timestamps, names) returns a far more useful summary than asking it to just summarize.