How to Scrape YouTube (Python, Selenium & More)
- A plain
requests.geton a YouTube watch page returns HTTP 200 with ~1.4 MB of HTML. The video data sits inside a JSON blob calledytInitialPlayerResponse, so I pull that blob with a regex andjson.loads. The visible numbers are absent from the HTML tags. - yt-dlp is the fastest path to one video's metadata in Python. One
extract_infocall returned title, channel, duration, view count, like count, and upload date for me. - BeautifulSoup alone cannot read view counts or durations, because YouTube renders them with JavaScript. You either parse the embedded JSON or drive a browser with Selenium.
- The official YouTube Data API gives 10,000 quota units per day and 100 search calls. Past that, our YouTube scraper API that returns parsed JSON is the lower-maintenance route.
I tested how to scrape YouTube three different ways this week to see which method actually returns usable data in 2026. I started with the request everyone tries first, a plain requests.get on the watch page, and the surprise is that it returns HTTP 200 and 1.4 MB of HTML, yet BeautifulSoup finds almost nothing useful in it. The view count, the duration, the channel, none of it sits in normal HTML tags. To scrape YouTube data you read the JSON the page already ships with.
This guide is the Python code I ran against live YouTube targets, the data each method returned, and the point where each one breaks. Every snippet below is code I executed in June 2026, using the video dQw4w9WgXcQ, so the numbers are real.
What data can you scrape from YouTube?
You can scrape a YouTube video’s title, channel, duration, view count, like count, upload date, keywords, and description, plus search results, comments, and captions from the surrounding pages. When you scrape YouTube videos this is the data you get, and all of it is data YouTube already serves to a logged-out browser.
Here is what each YouTube page exposes to a scraper, and which method reaches it cleanly:
| Data point | Where it lives | Easiest method |
|---|---|---|
| Video title | <title>, og:title meta, and JSON | requests + parse |
| View count | ytInitialPlayerResponse JSON | parse JSON or yt-dlp |
| Duration (seconds) | ytInitialPlayerResponse JSON | parse JSON or yt-dlp |
| Channel and channel ID | videoDetails JSON | parse JSON or yt-dlp |
| Like count, upload date | player + page JSON | yt-dlp |
| Video keywords / tags | videoDetails.keywords | parse JSON or yt-dlp |
| Search result titles | /results page ytInitialData | parse JSON |
| Comments | /youtubei/ continuation calls | hidden API or scraper API |
| Captions / subtitles | timed-text track | yt-dlp or caption scraper |
The pattern across every row is the same: the visible numbers you see on a video page are rendered by JavaScript, so they are not in the static HTML as tags. They are packed into two big JSON objects the page ships with, ytInitialPlayerResponse for the video itself and ytInitialData for everything around it. Once you know that, scraping YouTube becomes a JSON-parsing job, which is where the Python starts.
How do you scrape YouTube with Python?
The fastest way to scrape YouTube with Python is to request the watch page, pull the ytInitialPlayerResponse JSON blob out of the HTML with a regex, and load it with json.loads. This is how you scrape data from YouTube without the official API: the data is already in the page, packed into a JSON object that the browser reads.
Install the two libraries first:
pip install requests beautifulsoup4 lxml
Then request the page and confirm what comes back. This is the part that confuses people, so I print the status and the byte count:
import requests
from bs4 import BeautifulSoup
URL = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
UA = ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/124.0 Safari/537.36")
r = requests.get(URL, headers={"User-Agent": UA,
"Accept-Language": "en-US,en;q=0.9"}, timeout=20)
print(r.status_code) # -> 200
print(len(r.text)) # -> ~1,446,801
soup = BeautifulSoup(r.text, "lxml")
print(soup.title.string) # -> "...Never Gonna Give You Up... - YouTube"
og = soup.find("meta", property="og:title")
print(og["content"] if og else None) # -> the clean video title
When I ran this, the request returned 200 and 1,446,801 bytes of HTML. BeautifulSoup read the title fine, because the title and the Open Graph meta tags are baked into the static markup. The view count and duration are not, which is why the next step matters.
To get the numbers, parse the JSON. The watch page assigns a variable called ytInitialPlayerResponse, and inside it videoDetails holds the fields you actually want:
import re, json
m = re.search(r"ytInitialPlayerResponse\s*=\s*(\{.+?\})\s*;\s*(?:var|</script>)", r.text)
data = json.loads(m.group(1))
vd = data["videoDetails"]
print(vd["title"]) # Rick Astley - Never Gonna Give You Up (Official Video) (4K Remaster)
print(vd["author"]) # Rick Astley
print(vd["channelId"]) # UCuAXFkgsw1L7xaCfnd5JJOw
print(vd["lengthSeconds"]) # 213
print(vd["viewCount"]) # 1783376226
print(vd["keywords"][:5]) # ['rick astley', 'Never Gonna Give You Up', 'nggyu', ...]
That returned a duration of 213 seconds and a view count of 1,783,376,226 in my run, with the channel ID and the video keywords alongside. This is the core no-API scrape: one HTTP request, one regex, one json.loads. It works because videoDetails is the same object YouTube’s own front end reads to render the page.
The fragile part is the regex. YouTube changes the surrounding markup periodically, and when the delimiter after the JSON object shifts, the pattern misses and m is None. That is the maintenance cost of parsing the page yourself, and it is the reason the next method exists.
How do you scrape YouTube video metadata with yt-dlp?
You scrape YouTube video metadata with yt-dlp by calling extract_info with download=False, which returns a Python dictionary of the video’s title, channel, duration, view count, like count, and upload date without downloading the video file. yt-dlp does the JSON parsing internally, so it survives YouTube’s layout changes better than a hand-written regex.
Install it, then read one video:
pip install yt-dlp
from yt_dlp import YoutubeDL
URL = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
opts = {"quiet": True, "skip_download": True, "noplaylist": True}
with YoutubeDL(opts) as ydl:
info = ydl.extract_info(URL, download=False)
for field in ["title", "channel", "channel_id", "duration",
"view_count", "like_count", "upload_date"]:
print(field, "=", info.get(field))
print("tags:", info.get("tags", [])[:5])
My run returned these values:
| Field | Value returned |
|---|---|
title | Rick Astley - Never Gonna Give You Up (Official Video) (4K Remaster) |
channel | Rick Astley |
channel_id | UCuAXFkgsw1L7xaCfnd5JJOw |
duration | 213 |
view_count | 1783377080 |
like_count | 19159219 |
upload_date | 20091025 |
yt-dlp is the gold standard for single-video scraping in Python because it tracks YouTube’s changes for you. The project ships updates often, and the maintainers document the extraction internals on their wiki. One honest caveat from my run: yt-dlp now prints a warning that it wants a JavaScript runtime (Deno) installed, and without one some download formats are skipped. For metadata extraction the warning did not block the result, but it signals that YouTube is making pure-Python extraction harder over time.
yt-dlp is built for known video and playlist URLs. To turn a search term into a list of videos, you go back to parsing ytInitialData, which is the next piece.
How do you scrape YouTube search results?
You scrape YouTube search results by requesting https://www.youtube.com/results?search_query=YOUR+TERMS and parsing the ytInitialData JSON object embedded in that page, which contains the list of video renderers with their IDs and titles. The structure is deeper than the player response, so you walk down to the search results renderer.
import requests, re, json
UA = ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/124.0 Safari/537.36")
URL = "https://www.youtube.com/results?search_query=python+tutorial"
r = requests.get(URL, headers={"User-Agent": UA}, timeout=20)
m = re.search(r"var ytInitialData\s*=\s*(\{.+?\})\s*;\s*</script>", r.text)
data = json.loads(m.group(1))
sections = (data["contents"]["twoColumnSearchResultsRenderer"]
["primaryContents"]["sectionListRenderer"]["contents"])
results = []
for sec in sections:
for item in sec.get("itemSectionRenderer", {}).get("contents", []):
vr = item.get("videoRenderer")
if vr:
title = "".join(run["text"] for run in vr["title"]["runs"])
results.append((vr["videoId"], title))
print(len(results), "videos")
for vid, title in results[:5]:
print(vid, "-", title)
This returned 14 videos on the first results page in my run, with their video IDs and titles. From each ID you can build a watch URL and feed it back into the metadata scraper above, which is how you turn one search term into a dataset of video details.
Two things break this method. First, the same regex fragility as the player response: a layout change moves the JSON. Second, YouTube paginates search with continuation tokens fetched from its internal /youtubei/ endpoint, so the first request only gives you the first page. The /results and /youtubei/ paths are both disallowed in YouTube’s robots.txt, which is worth knowing before you scale a search crawl. If you need full result sets across many queries, a dedicated YouTube search scraper handles the pagination for you.
Can you scrape YouTube with BeautifulSoup alone?
You cannot scrape YouTube view counts, durations, or video lists with BeautifulSoup alone, because BeautifulSoup only parses the static HTML and those fields are rendered by JavaScript after the page loads. BeautifulSoup is still useful on YouTube, just for a narrow set of fields.
Here is what BeautifulSoup can and cannot read directly from the watch page HTML:
| Field | BeautifulSoup tag selector works? | Why |
|---|---|---|
| Video title | Yes | In <title> and og:title meta |
| Description (short) | Partial | og:description meta only |
| Thumbnail URL | Yes | og:image meta |
| View count | No | Rendered from JSON by JavaScript |
| Duration | No | Rendered from JSON by JavaScript |
| Comments | No | Loaded by separate API calls |
So BeautifulSoup parses the document, but the document does not contain the numbers as tags. The fix is the approach this whole guide uses: locate the JSON blob with a regex, hand the matched string to json.loads, and read the fields from the dictionary. BeautifulSoup is the parser for the meta tags, and json.loads is the parser for everything dynamic.
When a page genuinely needs JavaScript to execute before the data exists, the alternative is to render it in a real browser, which is what Selenium does.
When should you use Selenium for YouTube scraping?
You should use Selenium for YouTube scraping when you need data that only appears after scrolling or clicking, such as loading more comments, expanding a description, or paging through a channel’s videos. Selenium drives a real Chrome instance, so the JavaScript runs and the rendered values become readable.
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
driver.implicitly_wait(10)
title = driver.find_element(By.CSS_SELECTOR, "h1 yt-formatted-string").text
print(title)
driver.quit()
Selenium reads the rendered page, so the title element resolves once Chrome has executed YouTube’s scripts. The cost is speed and overhead. A browser session uses far more memory and time than an HTTP request, and YouTube’s consent and bot-check screens show up more readily for automated browsers, so you often add waits and a logged-out consent click. For one-off pulls of dynamic data, Selenium earns its place. For volume, the JSON-parsing and yt-dlp routes are faster, and the official API or a managed scraper avoids the browser fleet entirely.
Before scaling any of these methods, the quota and blocking limits decide which one is sustainable.
What are the limits, and how do you avoid getting blocked?
The hard limits on YouTube scraping are the official API quota and YouTube’s anti-bot blocking on the unauthenticated pages. The YouTube Data API v3 gives each project 10,000 quota units per day plus a separate allowance of 100 search.list calls, per Google’s getting-started documentation. A videos.list read costs 1 unit and a search.list call costs 100 units, so a search-heavy job burns the daily quota far faster than a metadata job.
Scraping the pages directly has no quota, but it has a different ceiling: blocking. From my own machine the requests above returned clean 200s. From a cloud server, YouTube starts answering automated traffic with consent interstitials and bot challenges. These are the levers that keep page scraping working, in rough order of impact:
- Slow the request rate. A steady, human-like cadence with a few seconds between requests survives far longer than parallel bursts.
- Use residential IPs for volume. Datacenter ranges get challenged quickly. Residential proxies present as ordinary home connections.
- Set a real User-Agent and
Accept-Language. A missing or empty User-Agent makes a borderline request worse. - Cache video IDs. The cheapest request is the one you skip. Store IDs and only refetch what changed.
- Respect robots.txt and the Terms of Service. YouTube disallows
/results,/youtubei/, and/watch_ajaxin its robots.txt, and its Terms of Service restrict automated access. Whether public-page scraping is lawful is a separate question I cover in is scraping YouTube legal.
On the legal point, US courts have read the Computer Fraud and Abuse Act narrowly for public data. In hiQ Labs v. LinkedIn, the Ninth Circuit affirmed in 2022 that scraping public website data likely does not violate the CFAA, a reading that followed the Supreme Court’s narrow CFAA interpretation in Van Buren v. United States. Public data and a site’s own terms are different matters, so read both before scaling.
The practical reading: build the Python yourself for a few hundred videos, and move to a managed service when the proxy rotation, consent handling, and parser upkeep cost more time than the data is worth.
How do you scrape YouTube at scale without managing proxies?
You scrape YouTube at scale without managing proxies by sending a YouTube URL to a scraper API that returns parsed JSON, with the proxy rotation, consent handling, and retries handled on the server side. You send one request and get structured data, no ytInitialData regex to maintain when YouTube shifts its markup.
This is the route I reach for once a project outgrows a single machine. The request shape is a single GET with the target URL and an API key:
curl "https://api.youtubescraperapi.com/api/v1/youtube/video?url=https://www.youtube.com/watch?v=dQw4w9WgXcQ&api_key=$API_KEY"
In Python, the same call returns the video fields parsed and ready, the same title, channel_name, duration_seconds, view_count, and keywords the JSON-parsing method pulled, without the regex maintenance:
import requests, os
resp = requests.get(
"https://api.youtubescraperapi.com/api/v1/youtube/video",
params={
"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"api_key": os.environ["API_KEY"],
},
timeout=30,
)
video = resp.json()
print(video["title"], "-", video["view_count"], "views")
The same pattern covers the other YouTube surfaces by swapping the path: a YouTube video scraper for metadata, a comment scraper for comment threads, and a channel scraper for a creator’s full video list. You can start free and test the shape against your own targets on the youtubescraperapi.com sign-up.
For a one-off pull of a few hundred videos, the Python in this guide is enough, and yt-dlp covers most single-video needs. For continuous collection across thousands of videos, channels, and searches, offloading the blocking and the parser upkeep is usually the cheaper path once you price in your own time. If you want to compare the managed options first, I ranked them in the best YouTube scrapers for 2026.
A note on cleaning scraped YouTube text
Scraped YouTube titles, descriptions, and captions arrive as raw text that usually needs preprocessing before any analysis, because they carry emoji, hashtags, timestamps, and inconsistent casing. A common next step after the scrape is to tokenize the text into words and sentences and strip the noise, often with the NLTK library.
A minimal cleanup on a batch of scraped titles looks like this:
import re
def clean(text):
text = re.sub(r"http\S+", "", text) # drop URLs
text = re.sub(r"[#@]\w+", "", text) # drop hashtags and handles
text = re.sub(r"\s+", " ", text).strip() # collapse whitespace
return text.lower()
titles = ["Learn Python in 10 Minutes! #python", "Full Course for Beginners"]
cleaned = [clean(t) for t in titles]
print(cleaned)
From there a word count, a sentence split, or a frequency analysis is straightforward, and the output feeds whatever model or report you are building. The scrape gets you the data, and the preprocessing makes it usable. That two-step shape, extract then clean, is the whole reason to capture structured fields like title and view_count cleanly at the source. Clean fields at the source save hours of downstream HTML cleanup.
FAQ
Yes. You can scrape public YouTube pages without the YouTube Data API by requesting the watch or results page with Python and parsing the embedded ytInitialPlayerResponse or ytInitialData JSON. yt-dlp does this for you. The tradeoff is maintenance: YouTube changes its page structure periodically, so a hand-rolled parser breaks more often than the API.
BeautifulSoup returns empty results for view counts, durations, and video lists because YouTube renders those fields with JavaScript after the page loads. The raw HTML BeautifulSoup parses only contains the <title>, Open Graph meta tags, and a large JSON object. Read the JSON object, or render the page with Selenium first.
yt-dlp is an open-source tool that is legal to install and run. How you use it is the question that matters. Scraping public video metadata is treated differently from downloading copyrighted video, and YouTube's Terms of Service restrict automated access. See my guide on whether scraping YouTube is legal for the detail.
The YouTube Data API v3 gives every project 10,000 quota units per day plus a separate allowance of 100 search.list calls, per Google's getting-started docs. A videos.list read costs 1 unit and a search costs 100, so search-heavy jobs exhaust the quota fastest.
For a few videos from your own machine, no. For sustained scraping from a cloud server, YouTube starts returning consent walls and bot checks, so residential proxies and slower request rates become necessary. A scraper API bundles the proxy rotation so you send a URL and get JSON back.