如何从JavaScript中提取YouTube URL

How can I extract a YouTube URL from its JavaScript?

本文关键字:提取 YouTube URL JavaScript      更新时间:2023-09-26

你好,我想知道我的脚本是否好;我想要完整的URL作为我的Perl脚本的结果:

#!/usr/bin/perl
use strict;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new( agent => 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1');
my $get = $ua->get('http://www.youtube.com/watch?v=Ko0c4QT5aVA')->content;
if ($get =~ m,(.*?)http:(.*?)'"')';'yt.preload.start'('"(.*?)'"')';</script>,sgi){
    print "First:$2'n'n";
    print "Second:$3'n";

我真的很欣赏Mojo::UserAgent内置的DOM功能。你可以提取出你想要的脚本(糟糕的是YouTube没有给它们附加id):

use v5.10;
use Mojo::UserAgent;
my $script = Mojo::UserAgent->new->
    get("http://www.youtube.com/watch?v=Ko0c4QT5aVA" )->
    res->
    dom->
    find('script')->
    [1];
my( $yt_preload_start ) = $script =~ m|;'s*yt'Q.preload.start('E's*"(.*?)"|;
$yt_preload_start =~ s{''(.)}{$1}g;
$yt_preload_start =~ s{u0026}{&}g;
say "URL is $yt_preload_start";

我更喜欢使用JavaScript解析器来提取yt.preload.start的参数,但我没有这些方面的经验。

它更好吗?

#!/usr/bin/perl
use strict;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new( agent => 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1');
my $get = $ua->get('http://www.youtube.com/watch?v=Ko0c4QT5aVA')->content;
if ($get =~ m,(.*?)http:(.*?)'"')';'yt.preload.start'('"(.*?)'"')';</script>,sgi){
    my $out = $3;
    $out =~ s@''/@/@g;
    $out =~ s@''u0026@'&@g;
    print "$out'n";
}

从您的问题和代码中,我不清楚您试图从HTML中提取什么。特别是,为什么你在比赛的主要部分之前捕捉所有内容,然后忽略捕捉?

我的最佳猜测是,您希望所有URL thta都显示为yt.preload.start JavaScript函数的参数。你可以这样做:

use strict;
use warnings;
use LWP::UserAgent;
use URI::Escape;
my $ua = LWP::UserAgent->new( agent => 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1');
my $html = $ua->get('http://www.youtube.com/watch?v=Ko0c4QT5aVA')->content;
my @urls = $html =~ /'Qyt.preload.start("'E(http[^"]+)/gi;
print map uri_unescape($_)."'n", @urls;

编辑:

此解决方案将URL保留为JavaScript Unicode字符"'u0026"(与Perl "'N{N+0026}"相同)或与符号"&"。字符串也以"http:'/'/"开头。纠正这些问题很简单。一种方法是用替换最终的map

print map {
  my $ss = uri_unescape $_;
  $ss =~ s/''u0026/&/g, $ss =~ s|''/|/|g;
  $ss;
} @urls;