Non-Robot access to IMDB

Discussion:

(too old to reply)

root

2022-11-22 23:27:01 UTC

As you must have noticed by now IMDB isn't what it used to be.

For years I have been using (a highly customized version of) W3m
for all my web use. I have a dedicated key to jump to IMDB
for information search. The modern IMDB treats me as a robot
and offers a captcha to which I can't respond. My specific
question here: is there someway of accessing the IMDB database
without a javascript browser?

A more general question is how does the site recognize that
W3m does not handle javascript but does accept access from,
say, Chrome or Firefox even when javascript is turned off.

How feasible would it be to have a wrapper surrounding
W3m or Lynx that makes the site think javascript is
there but turned off?

The user-agent string does not trick any site that
demands javascript.

Thanks for your thoughts.

Eli the Bearded

2022-11-23 00:03:45 UTC

Permalink

For years I have been using (a highly customized version of) W3m for
all my web use. I have a dedicated key to jump to IMDB for information
search. The modern IMDB treats me as a robot and offers a captcha to
which I can't respond. My specific question here: is there someway of
accessing the IMDB database without a javascript browser?

It appears to be user-agent sniffing. I get a terse 403 with my
curl-like browser emulator for a bunch of text browser emulations, but
say Firefox 1.5 works fine.

You may also like to just skip the web site and get the free data:

https://datasets.imdbws.com/

I downloaded it ~10 years ago when it was still FTP, but I didn't find
it that useful. My needs may be different than yours.

A more general question is how does the site recognize that W3m does
not handle javascript but does accept access from, say, Chrome or
Firefox even when javascript is
turned off.

The content it tried to shove down to that ancient Firefox was not going
to work.

The user-agent string does not trick any site that
demands javascript.

curl -H 'User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4' https://www.imdb.com/title/tt0082085/

Try that UA.

Elijah
------
whose bget tool also sends all the original Accept* headers

root

2022-11-23 01:11:30 UTC

Permalink

Post by Eli the Bearded

It appears to be user-agent sniffing. I get a terse 403 with my
curl-like browser emulator for a bunch of text browser emulations, but
say Firefox 1.5 works fine.
https://datasets.imdbws.com/
I downloaded it ~10 years ago when it was still FTP, but I didn't find
it that useful. My needs may be different than yours.

A more general question is how does the site recognize that W3m does
not handle javascript but does accept access from, say, Chrome or
Firefox even when javascript is
turned off.

The content it tried to shove down to that ancient Firefox was not going
to work.

The user-agent string does not trick any site that
demands javascript.

curl -H 'User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4' https://www.imdb.com/title/tt0082085/
Try that UA.
Elijah
------
whose bget tool also sends all the original Accept* headers

Thanks, very close to what I want but not. I changed your example to:
curl -H 'User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4' "https://www.imdb.com/find?q=/top+guni&ref_=nv_sr_sm"|/root/w3m/w3m
because I want to search. I get as far as I can get directly using w3m, but I am stopped by the
title entry. This is a problem I solved a long time ago when I did build a robot. If it
hasn't changed from then, I can download the page source and pull from that the tt..... stuff for what I want.

Thanks for the trick. I had tried something similar with wget with no luck
I didn't think to try curl.

Thanks again.

root

2022-11-23 01:22:34 UTC

Permalink

Post by root

Post by Eli the Bearded

A more general question is how does the site recognize that W3m does
not handle javascript but does accept access from, say, Chrome or
Firefox even when javascript is
turned off.

The content it tried to shove down to that ancient Firefox was not going
to work.

The user-agent string does not trick any site that
demands javascript.

curl -H 'User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4' https://www.imdb.com/title/tt0082085/
Try that UA.
Elijah
------
whose bget tool also sends all the original Accept* headers

curl -H 'User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4' "https://www.imdb.com/find?q=/top+guni&ref_=nv_sr_sm"|/root/w3m/w3m
because I want to search. I get as far as I can get directly using w3m, but I am stopped by the
title entry. This is a problem I solved a long time ago when I did build a robot. If it
hasn't changed from then, I can download the page source and pull from that the tt..... stuff for what I want.
Thanks for the trick. I had tried something similar with wget with no luck
I didn't think to try curl.
Thanks again.

Yes, that can work. The search I tried (above) yields an html file which
contains a collection of /title/tt...... entries with sufficient information
to choose the (Top Gun) version I want. This nothing like tapping
a key. For now I have put up a second monitor on my right running
off a Raspberry Pi which does connect to IMDB. If I find a better
solution I will post it here.

Eli the Bearded

2022-11-23 23:17:14 UTC

Permalink

Post by Eli the Bearded
curl -H 'User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.0.4)
Gecko/20060508 Firefox/1.5.0.4'
"https://www.imdb.com/find?q=/top+guni&ref_=nv_sr_sm"|/root/w3m/w3m
because I want to search.

/usr/bin/lynx \
-useragent='Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4' \
-image_links -noreferer \
-accept_all_cookies -cookie_save_file=/dev/null \
"https://www.imdb.com/find?q=/top+guni&ref_=nv_sr_sm"

Worked for me. Not sure how to switch UA in w3m.

Other thing that works for search only, is find the movie at Wikipedia,
and then get the IMDB link from the External Sites section. That's been
quite reliable for me. Duckduckgo and Wikipedia are fine via lynx. I use
a script for for command line web searches with lynx:

$ cat ~/bin/ddg
#!/usr/bin/perl
# Mon Dec 18 19:56:20 EST 2017
# quickly search duck duck go from command line

use strict;
use warnings;

my $query = join('+',@ARGV);
my $base = 'https://duckduckgo.com/lite/';

my $url = $base;

if(defined($query) and $query ne '') {
$url .= "?q=$query";
}

exec('lynx',
'-image_links',
'-noreferer',
'-accept_all_cookies',
'--cookie_save_file=/dev/null',
$url);
__END__
$

Adding additional options to that should be obvious.

Elijah
------
imdb was second link for 'ddg top gun`