Discussion:
Filtering incoming HTML [slightly OT]
(too old to reply)
Mike Spencer
2018-01-25 06:03:36 UTC
Permalink
Raw Message
Not Slackware-specific but I surmise Slackers might tilt more than
others toward The Hacker Nature and have good suggestions. And I'm
working with Slackware 14.2.

I would like a utility/feature such that HTML pages incoming to my web
browser can optionally be filtered/edited with a script, preferably
Perl because I've already done similar tasks with Perl.

1. Something opens the socket to the remote host, does IP, certs if
any, crypto if any, HTTP GET and whatever other details may arise
such as "chunked" or compression.

2. If the incoming data is text/html, passes it to my script.

3. Script edits the HTML, writes to stdout or other suitable place.

4. Script passes result to browser, possibly creating (or not) new HTTP
headers.

I've done this for individual, often-visited sites with:

Home page on localhost, containing link to foo.com

Link to foo.com actually points to localhost/cgi-bin/foo.pl

foo.pl opens socket on real foo.com, fetches "page" via HTTP, edits,
writes result to stdout (along with suitable HTTP headers.)

Browser gets result from the cgi-bin mechanism.

Works fine for HTTP, *not* for HTTPS. (I'm not smart enough to write a
complete RFC-compliant package to do HTTPS with certs & crypto).

I'd like to generalize that process so that if FILTER is ON, any link
clicked will get that treatment. If FILTER is OFF, browser will
get unadorned input as usual.

Where do I go to do this? Can Apache (on localhost) be made to
do step 1. above, pass results to a script? Do I have to find out
about writing "addons" or "extensions" for Firefox? Might there be a
secret API/feature in FF that would make this easy? Is there an
existing stand-alone utility that implements this?

This is essentially a man in the middle attack on myself. Someone
must have done it already.

Any pointers welcome, TIA, etc.
--
Mike Spencer Nova Scotia, Canada
Henrik Carlqvist
2018-01-25 06:38:56 UTC
Permalink
Raw Message
Post by Mike Spencer
I would like a utility/feature such that HTML pages incoming to my web
browser can optionally be filtered/edited with a script, preferably Perl
because I've already done similar tasks with Perl.
This sounds like some kind of dynamic content from the web server to me.
The standard way to do this is LAMP (Linux, Apache, Mysql and PHP) but it
seems you have no need for Mysql in your application.
Post by Mike Spencer
Link to foo.com actually points to localhost/cgi-bin/foo.pl
Of course you could replace PHP with perl and you could also use CGI
scripts instead of the apache built in support for server generated pages
with php.
Post by Mike Spencer
Works fine for HTTP, *not* for HTTPS. (I'm not smart enough to write a
complete RFC-compliant package to do HTTPS with certs & crypto).
Let Apache handle https, no need to handle HTTP(S) requests directly in
your script.
Post by Mike Spencer
Might there be a secret API/feature in FF that would make this easy?
To me it seems as if all you want to do should be done on the server side.

A pointer:

How to edit html uploaded to a server (if this is what you want to do):
https://stackoverflow.com/questions/30325905/edit-an-html-file-using-php

regards Henrik
--
The address in the header is only to prevent spam. My real address is:
hc351(at)poolhem.se Examples of addresses which go to spammers:
***@localhost ***@localhost
Mike Spencer
2018-01-25 07:48:12 UTC
Permalink
Raw Message
Post by Henrik Carlqvist
Post by Mike Spencer
I would like a utility/feature such that HTML pages incoming to my web
browser can optionally be filtered/edited with a script, preferably Perl
because I've already done similar tasks with Perl.
This sounds like some kind of dynamic content from the web server to me.
The standard way to do this is LAMP (Linux, Apache, Mysql and PHP) but it
seems you have no need for Mysql in your application.
I don't think you're getting what I want to do.

As a humble user, I want to (optionally) filter every web page of type
text/html (not jpegs, PDFs, WAVs etc) that my browser visits.
Examples of filtering would be any one or combination of: elide some
or all <LINK... tags; elide some or all <META... tags; elide some or
all <SCRIPT... blocks; elide <STYLE... blocks; alter some
URLs in <A HREF=... tags; elide some or all <IMG... tags; elide or
convert some UTF-8 chars; and so on.

Perl regular expressions can do that. How do I get between the code
that does the HTTP[S] stuff and the rendering engine of the browser?

AFAIUI, either a proxy on localhost that lets me do that or some way
to get inside the Firefox (or other) browser's usual operation. But
AFAIUI may not be far enough, thus advice sought.

The Apache docs are intimidating. If I tell my browser that localhost
is an all-purpose proxy, will Apache let me edit everything it proxies?
How do I do a man in the middle attack on myself?

I don't know where to begin to get my own code (effectively) inside
Firefox.
Post by Henrik Carlqvist
Post by Mike Spencer
Link to foo.com actually points to localhost/cgi-bin/foo.pl
You're digressing here. That line is part of a description of how I
do the filtering via a cgi-bin script on localhost. That only works
for an <A... tag I've put on my home page or for a received page that
my script has already edited so that URLs originally pointing to
foo.com/whatever point instead to localhost/cgi-bin/foo&whatever. So
it only works for specific, pre-planned one-off cases.
Post by Henrik Carlqvist
Of course you could replace PHP with perl and you could also use CGI
scripts instead of the apache built in support for server generated pages
with php.
Post by Mike Spencer
Works fine for HTTP, *not* for HTTPS. (I'm not smart enough to write a
complete RFC-compliant package to do HTTPS with certs & crypto).
Let Apache handle https, no need to handle HTTP(S) requests directly in
your script.
Post by Mike Spencer
Might there be a secret API/feature in FF that would make this easy?
To me it seems as if all you want to do should be done on the server side.
Huh. See above. AFAICT, you're not getting it.
It's not. When my browser tries to access
HTTP[S]://random.com/some-file, I want to edit some-file before my
browser tries to render it.

Tnx for the reply but I'm no further ahead yet.
Post by Henrik Carlqvist
https://stackoverflow.com/questions/30325905/edit-an-html-file-using-php
regards Henrik
--
Mike Spencer Nova Scotia, Canada
Rich
2018-01-25 11:40:21 UTC
Permalink
Raw Message
Post by Mike Spencer
Examples of filtering would be any one or combination of: elide some
or all <LINK... tags; elide some or all <META... tags; elide some or
all <SCRIPT... blocks; elide <STYLE... blocks; alter some
URLs in <A HREF=... tags; elide some or all <IMG... tags; elide or
convert some UTF-8 chars; and so on.
Perl regular expressions can do that.
Not in general. Don't try to do this. You'll have a never ending set
of edge cases that you'll be chasing your tail trying to fix, and an
ever growing, more and more ugly set of regular expressions. Use a
real html parser.
https://blog.codinghorror.com/parsing-html-the-cthulhu-way/

And for what you want to do, a parser that returns a DOM tree you can
manipulate, and then serialize back to HTML, would make your life
*much* easier here.
Post by Mike Spencer
How do I get between the code that does the HTTP[S] stuff and the
rendering engine of the browser?
That is called a "proxy" and your browser already has built in support
for using one. Do some googling on http proxy.
Post by Mike Spencer
I don't know where to begin to get my own code (effectively) inside
Firefox.
That would be a Firefox extension.

But for what you are talking about doing, an external proxy will be by
far more capable and easier to put together.

And, it seems perl already has a module in cpan (did you do any
searching at all?) that sounds like it is most of the way towards what
you want to do:

http://search.cpan.org/~book/HTTP-Proxy-0.304/lib/HTTP/Proxy.pm
Zaphod Beeblebrox
2018-01-25 12:08:43 UTC
Permalink
Raw Message
(snip)
Post by Mike Spencer
How do I get between the code that does the HTTP[S] stuff and the
rendering engine of the browser?
That is called a "proxy" and your browser already has built in support
for using one. Do some googling on http proxy.
Here is a hilarious example of how to set up a proxy:
http://www.ex-parrot.com/pete/upside-down-ternet.html
Mike Spencer
2018-01-26 08:09:25 UTC
Permalink
Raw Message
Post by Zaphod Beeblebrox
(snip)
Post by Mike Spencer
How do I get between the code that does the HTTP[S] stuff and the
rendering engine of the browser?
That is called a "proxy" and your browser already has built in support
for using one. Do some googling on http proxy.
http://www.ex-parrot.com/pete/upside-down-ternet.html
Just lovely. He's using Squid proxy to call a script invert images?
I'll have to look more closely at how that works. Still have doubts
about HTTPS, though.

Tnx,
--
Mike Spencer Nova Scotia, Canada
Mike Spencer
2018-01-26 08:02:27 UTC
Permalink
Raw Message
Thanks for the long reply, Rich. Comments interspersed as usual.
Executive summary: Not much further ahead yet.
Post by Rich
Post by Mike Spencer
Examples of filtering would be any one or combination of: elide some
or all <LINK... tags; elide some or all <META... tags; elide some or
all <SCRIPT... blocks; elide <STYLE... blocks; alter some
URLs in <A HREF=... tags; elide some or all <IMG... tags; elide or
convert some UTF-8 chars; and so on.
Perl regular expressions can do that.
Not in general. Don't try to do this. You'll have a never ending set
of edge cases that you'll be chasing your tail trying to fix, and an
ever growing, more and more ugly set of regular expressions.
In the sites I'm presently handling this way, such things as:

$all =~ s/<link[^>]+>//gi;
$all =~ s/<script[^>]*>(.*?)<\/script>//sgi;

work fine. I realize there are potential complicationss but...
Post by Rich
Use a real html parser.
https://blog.codinghorror.com/parsing-html-the-cthulhu-way/
Ha! Well and good. I agree that "You can't parse [X]HTML with
regex. Because HTML can't be parsed by regex." But I've no intention
of trying to do that. Mostly, just identify what I regard as crap and
elide it.

In addition to just tossing unwanted carp (see above), one of my
existing scripts even re-writes URLs in different ways for a
particular site I visit often. If I could get access to the incoming
doc between the HTTP[S] stuff and the rendering (or DOM analysis or
whatever) via a general proxy, I wouldn't to fool with that. All
follow-on clicks would go back through the proxy without explicitly
making the URLs do so.
Post by Rich
And for what you want to do, a parser that returns a DOM tree you can
manipulate, and then serialize back to HTML, would make your life
*much* easier here.
Having a look at https://javascript.info/dom-nodes, I'm pretty sure
that's above my pay grade.
Post by Rich
Post by Mike Spencer
How do I get between the code that does the HTTP[S] stuff and the
rendering engine of the browser?
That is called a "proxy" and your browser already has built in support
for using one. Do some googling on http proxy.
BTDT. Like many subjects/keywords, there's a vast collection of
stuff out there explaining how to do *other* things with proxies --
load and bandwidth management, blocking domains or sites, cacheing.
No one I've found is doing my simple task which is similar to tearing
all the packaging off of a new widget you just bought. See "package
rage" elsewhere.
Post by Rich
Post by Mike Spencer
I don't know where to begin to get my own code (effectively) inside
Firefox.
That would be a Firefox extension.
Haven't looked much at how to write a FF extension. I did just order an
O'Reilly book on js. But, based on one tutorial, it looks messy.
Have to write RFD and XML files as well as js? Huh.
Post by Rich
But for what you are talking about doing, an external proxy will be by
far more capable and easier to put together.
I'd have thought so.
Post by Rich
And, it seems perl already has a module in cpan (did you do any
searching at all?)...
Yeah. See "vast collection of stuff", above.
Post by Rich
...that sounds like it is most of the way towards what you want to
http://search.cpan.org/~book/HTTP-Proxy-0.304/lib/HTTP/Proxy.pm
I'm looking at that. Old guy, not as quick a study as I used to be.
But AFAICT, that doesn't handle HTTPS. Separate, stand-alone proxy
programs apparently do HTTPS "transparently", which I take to mean
that they (intentionally) don't get access to the content but pass the
cert & crypto stuff and encrypted data straight through. It's the big
corporate sites that have the most crap I want to elide and those are
the ones that are going all HTTPS.

Digressing here, HTTP meant "Ask for a library book, librarian hands
it to you". HTTPS means keeping a database of corporate entities,
asking one of them for permission to get the book, then getting it in
a "plain brown wrapper". Great for banking but not so much for XKCD
or the WaPo. You can actually fetch an HTTP doc from the commandline
with telnet. HTTPS needs messy data architecture and protocols. So I
need to work around that somehow.


Tnx,
--
Mike Spencer Nova Scotia, Canada
Rich
2018-01-26 12:25:56 UTC
Permalink
Raw Message
Post by Mike Spencer
Thanks for the long reply, Rich. Comments interspersed as usual.
Executive summary: Not much further ahead yet.
Post by Rich
Post by Mike Spencer
Examples of filtering would be any one or combination of: elide some
or all <LINK... tags; elide some or all <META... tags; elide some or
all <SCRIPT... blocks; elide <STYLE... blocks; alter some
URLs in <A HREF=... tags; elide some or all <IMG... tags; elide or
convert some UTF-8 chars; and so on.
Perl regular expressions can do that.
Not in general. Don't try to do this. You'll have a never ending set
of edge cases that you'll be chasing your tail trying to fix, and an
ever growing, more and more ugly set of regular expressions.
$all =~ s/<link[^>]+>//gi;
$all =~ s/<script[^>]*>(.*?)<\/script>//sgi;
work fine. I realize there are potential complicationss but...
Yep, even those will not work for all possible html inputs that you
might find on the web.
Post by Mike Spencer
Post by Rich
Use a real html parser.
https://blog.codinghorror.com/parsing-html-the-cthulhu-way/
Ha! Well and good. I agree that "You can't parse [X]HTML with
regex. Because HTML can't be parsed by regex." But I've no intention
of trying to do that. Mostly, just identify what I regard as crap
and elide it.
But, to 'identify what [you] regard as crap' you have to, in some
manner, /parse/ the HTML. So you are /parsing/ the HTML, just not in a
full sense of an HTML parser.
Post by Mike Spencer
In addition to just tossing unwanted carp (see above), one of my
existing scripts even re-writes URLs in different ways for a
particular site I visit often.
With a proper parser, you can then walk the DOM tree, find all anchor
nodes (no matter how the author specified them in the html), extract
just the href= attribute (no matter how the author wrote them [1]) and
receive back a simple string that is the URL. Then after your filter
does whatever it does to 'rewrite' you can simply replace the old URL
with your new URL and you are done (beyond serializing back to HTML).
Post by Mike Spencer
If I could get access to the incoming doc between the HTTP[S] stuff
and the rendering (or DOM analysis or whatever) via a general proxy,
I wouldn't to fool with that.
A general proxy provides you *just that*. That's how a proxy works,
*all* your http fetches from the browser go to the proxy. The proxy is
then responsible for actually performing the http fetch from the site,
downloading the content at that URL, and returning the data back to the
browser. It sits right in the middle of *everything*. It can modify
any of the outgoing fetch data, and any of the return response data.
Post by Mike Spencer
All follow-on clicks would go back through the proxy without
explicitly making the URLs do so.
With a proxy, the browser already handles that for you, because you
tell your browser to use your proxy (it is one of the config options in
the network setup inside the browser). Afterwhich, *all* http traffic
goes to the proxy. There is also a way to filter based on url to
selectively send some http to the proxy, others not, but it sounds like
you want /everyting/ to go, which is actually the simpler
configuration.
Post by Mike Spencer
Post by Rich
And for what you want to do, a parser that returns a DOM tree you
can manipulate, and then serialize back to HTML, would make your
life *much* easier here.
Having a look at https://javascript.info/dom-nodes, I'm pretty sure
that's above my pay grade.
Ah, but the Javascript API to it's dom tree also includes a lot of
windowing/styling/etc. material plus JS's general "anything can be
stuffed in there, if you try" that you don't need for what you want to
do. The DOM interface you'd get for a DOM tree from "just a HTML
parser" won't have half the crap the JS DOM contains, because you are
not styling, are not adding JS events, etc. at that point.
Post by Mike Spencer
Post by Rich
Post by Mike Spencer
How do I get between the code that does the HTTP[S] stuff and the
rendering engine of the browser?
That is called a "proxy" and your browser already has built in
support for using one. Do some googling on http proxy.
BTDT. Like many subjects/keywords, there's a vast collection of
stuff out there explaining how to do *other* things with proxies --
load and bandwidth management, blocking domains or sites, cacheing.
No one I've found is doing my simple task which is similar to tearing
all the packaging off of a new widget you just bought. See "package
rage" elsewhere.
Look at how those "intercept" the http transactions. All you'll be
doing is replacing the module "block domains" (or "cache") with code
that edits the HTML instead. The rest of the 'intercept' is identical
among all of them. The "cache" one might be a bit closer in that it
has to hold a copy of the response, so it will already have the
response stored somewhere.

But you are looking at "specific" kinds of proxys above, when the
"proxy" part is generic to them all, and you appear to be dismissing
the generic proxy because the specific one you've seen isn't the exact
item you are looking for. Years ago (circa 1997-1999) there was a blob
of perl code that was a proxy that edited the returning HTML, which
would be a specific example towards your goal. I unfortunately0 no
longer remember what it was named.
Post by Mike Spencer
Post by Rich
Post by Mike Spencer
I don't know where to begin to get my own code (effectively) inside
Firefox.
That would be a Firefox extension.
Haven't looked much at how to write a FF extension. I did just order an
O'Reilly book on js. But, based on one tutorial, it looks messy.
Have to write RFD and XML files as well as js? Huh.
And an extension will be in large part specific to one browser. A http
proxy will be generic to any browser you wanted to run.
Post by Mike Spencer
Post by Rich
But for what you are talking about doing, an external proxy will be
by far more capable and easier to put together.
I'd have thought so.
Post by Rich
And, it seems perl already has a module in cpan (did you do any
searching at all?)...
Yeah. See "vast collection of stuff", above.
Post by Rich
...that sounds like it is most of the way towards what you want to
http://search.cpan.org/~book/HTTP-Proxy-0.304/lib/HTTP/Proxy.pm
I'm looking at that. Old guy, not as quick a study as I used to be.
But AFAICT, that doesn't handle HTTPS. Separate, stand-alone proxy
programs apparently do HTTPS "transparently", which I take to mean
that they (intentionally) don't get access to the content but pass the
cert & crypto stuff and encrypted data straight through. It's the big
corporate sites that have the most crap I want to elide and those are
the ones that are going all HTTPS.
No, when a proxy is active, the proxy gets to MITM the https request.
So you'd get to see the internals of the https stream as well. But
you'd need to research how to add https proxy handling to the Perl
module.

That's why corporate 'locked down' environments when they do lock down
https do it via the proxy method. They then get to MITM all the HTTPS
traffic as well.
Post by Mike Spencer
Digressing here, HTTP meant "Ask for a library book, librarian hands
it to you". HTTPS means keeping a database of corporate entities,
asking one of them for permission to get the book, then getting it in
a "plain brown wrapper". Great for banking but not so much for XKCD
or the WaPo. You can actually fetch an HTTP doc from the commandline
with telnet. HTTPS needs messy data architecture and protocols. So I
need to work around that somehow.
There's numerous Perl modules that take care of the complexity of
HTTPS for you. There's likely one that plugs in to one or more of the
Perl Proxy modules that adds HTTPS proxy capability to the module. At
which point you don't have to work out all that complexity yourself.
Henrik Carlqvist
2018-01-26 06:53:41 UTC
Permalink
Raw Message
Post by Mike Spencer
As a humble user, I want to (optionally) filter every web page of type
text/html (not jpegs, PDFs, WAVs etc) that my browser visits.
Aha, so this is not about modifying pages being served from your own web
server, possibly by fetching data from other servers...

As others have said, you should instead be looking for some web proxy
functionality. Grant mentioned squid and squid seems able to modify pages
for you: https://wiki.squid-cache.org/SquidFaq/ContentAdaptation

regards Henrik
--
The address in the header is only to prevent spam. My real address is:
hc351(at)poolhem.se Examples of addresses which go to spammers:
***@localhost ***@localhost
root
2018-01-26 15:04:19 UTC
Permalink
Raw Message
Post by Henrik Carlqvist
Post by Mike Spencer
As a humble user, I want to (optionally) filter every web page of type
text/html (not jpegs, PDFs, WAVs etc) that my browser visits.
Aha, so this is not about modifying pages being served from your own web
server, possibly by fetching data from other servers...
As others have said, you should instead be looking for some web proxy
functionality. Grant mentioned squid and squid seems able to modify pages
for you: https://wiki.squid-cache.org/SquidFaq/ContentAdaptation
That page only talks about an http connection. Somebody else
has pointed out the problem of an https connection.
Doug713705
2018-01-29 19:58:37 UTC
Permalink
Raw Message
Le 26-01-2018, root nous expliquait dans
alt.os.linux.slackware
Post by root
Post by Henrik Carlqvist
Post by Mike Spencer
As a humble user, I want to (optionally) filter every web page of type
text/html (not jpegs, PDFs, WAVs etc) that my browser visits.
Aha, so this is not about modifying pages being served from your own web
server, possibly by fetching data from other servers...
As others have said, you should instead be looking for some web proxy
functionality. Grant mentioned squid and squid seems able to modify pages
for you: https://wiki.squid-cache.org/SquidFaq/ContentAdaptation
That page only talks about an http connection. Somebody else
has pointed out the problem of an https connection.
You can't intercept https connection without making a 'man in the middle' attack
(which is possible if you can add a crafted CA in the web browser).

The URL request is sent only after connection is established (the server identified
itself with a cert and all) and the crypted tunnel is up. The proxy can
not see anything beyond the CONNECT request which contains only the
remote server domain name or IP address.

Nobody can see what is going on between your brower and the remote server once
this 'crypted tunnel' is established.
--
Et faut touiller ça c'est sûr
Sinon ça devient de la confiture
La cancoillotte c'est tout un art
Il faut rien laisser au hasard
-- H.F. Thiéfaine, La cancoillote
Mike Spencer
2018-01-29 23:36:29 UTC
Permalink
Raw Message
Le 26-01-2018, root nous expliquait dans alt.os.linux.slackware
Post by root
Post by Henrik Carlqvist
Post by Mike Spencer
As a humble user, I want to (optionally) filter every web page of type
text/html (not jpegs, PDFs, WAVs etc) that my browser visits.
Aha, so this is not about modifying pages being served from your own web
server, possibly by fetching data from other servers...
As others have said, you should instead be looking for some web proxy
functionality. Grant mentioned squid and squid seems able to modify pages
for you: https://wiki.squid-cache.org/SquidFaq/ContentAdaptation
That page only talks about an http connection. Somebody else
has pointed out the problem of an https connection.
You can't intercept https connection without making a 'man in the
middle' attack (which is possible if you can add a crafted CA in the
web browser).
That was my understanding (although some comments seemed to be to the
contrary.) In principle, I want to make a man in the middle attack
against myself.
The URL request is sent only after connection is established (the
server identified itself with a cert and all) and the crypted tunnel
is up. The proxy can not see anything beyond the CONNECT request
which contains only the remote server domain name or IP address.
which means that, in order to get access to the transported data, the
proxy has to establish the encrypted connection itself and acquire the
data. I suppose that means that it then has spoof the
encrypted/certified connection back to the browser because the browser
is expecting it?
Nobody can see what is going on between your brower and the remote
server once this 'crypted tunnel' is established.
As the OP, you'll ask, "What's this all about?". I want to remove
stuff from HTML pages before my browser sees it. Among other things:
On a slow connection, it's bad enough that some useful sites send 450K
of data in order to deliver 45K of readable text. But even with
javascript *disabled*, the browser parses <LINK... tags and fetches
referenced javascript (and icon, image, whatever-else) data, pushing
the load toward a meg. You can demonstrate this by loading an
unexpurgated HTML page from a local file. If the <LINK...-referenced
files aren't locally available, the browser will promptly start/try to
fetch them from the net.

Basically, I want to scrutinize every incoming web page and, in so far
as possible, defeat the intentions of the "web designers", whose
notions of what's good for me are vexatious.

Merci,
--
Mike Spencer Nova Scotia, Canada

"Strive to subvert and defeat web bloat wherever encoutered."
-- Imafa Kinasso, Research Fellow, Bridgwater Institute for Advanced Study
Eli the Bearded
2018-01-30 00:23:53 UTC
Permalink
Raw Message
Post by Mike Spencer
Post by Doug713705
You can't intercept https connection without making a 'man in the
middle' attack (which is possible if you can add a crafted CA in the
web browser).
That was my understanding (although some comments seemed to be to the
contrary.) In principle, I want to make a man in the middle attack
against myself.
Yeah, that's what I understood you to be doing from your first request.
I think the "easy" solution is to make your proxy *only* talk http to
your browser, and rewrite the html to have http URLs, but maintain a
record of what sites are http/https. When talking to the remote site,
use the correct protocol.

Compared to stripping <LINK> tags, it's a bit different but not all that
hard. The hard part is find https in javacript, but you don't want the
javascript anyway.
Post by Mike Spencer
As the OP, you'll ask, "What's this all about?". I want to remove
Yeah, I use text mode browsers like "lynx" by default on some sites.
Places like Medium.com add a "shitton" of useless (to me) javascript and
styling plus huge images that I almost always don't want.
Post by Mike Spencer
On a slow connection, it's bad enough that some useful sites send 450K
of data in order to deliver 45K of readable text.
A talk that veers into a rant you may find symnpathetic:

http://idlewords.com/talks/website_obesity.htm

Elijah
------
no longer has a slow connection, but prefers the text view of text
Doug713705
2018-01-30 10:12:19 UTC
Permalink
Raw Message
Le 29-01-2018, Mike Spencer nous expliquait dans
alt.os.linux.slackware
Post by Mike Spencer
Le 26-01-2018, root nous expliquait dans alt.os.linux.slackware
Post by root
Post by Henrik Carlqvist
Post by Mike Spencer
As a humble user, I want to (optionally) filter every web page of type
text/html (not jpegs, PDFs, WAVs etc) that my browser visits.
Aha, so this is not about modifying pages being served from your own web
server, possibly by fetching data from other servers...
As others have said, you should instead be looking for some web proxy
functionality. Grant mentioned squid and squid seems able to modify pages
for you: https://wiki.squid-cache.org/SquidFaq/ContentAdaptation
That page only talks about an http connection. Somebody else
has pointed out the problem of an https connection.
You can't intercept https connection without making a 'man in the
middle' attack (which is possible if you can add a crafted CA in the
web browser).
That was my understanding (although some comments seemed to be to the
contrary.) In principle, I want to make a man in the middle attack
against myself.
Yes but that is *not* an easy thing.
Post by Mike Spencer
The URL request is sent only after connection is established (the
server identified itself with a cert and all) and the crypted tunnel
is up. The proxy can not see anything beyond the CONNECT request
which contains only the remote server domain name or IP address.
which means that, in order to get access to the transported data, the
proxy has to establish the encrypted connection itself and acquire the
data. I suppose that means that it then has spoof the
encrypted/certified connection back to the browser because the browser
is expecting it?
Yes. In fact your proxy has to pretend to be the requested remote server
and give back a valid SSL certifcate signed by a trusted CA (Certificate
Authority).

In fact, your proxy needs to read the real certificate from the remote
server (which is 'easy') and use the data in it to generate *on the fly*
a new certificate signed by its own CA and then send this new certificate
to your browser.

Prior to that You'll have to create a CA (easy) and put it in your
browser trusted CA list. This way, your browser will accept any
certificate signed by this CA.

There is an appliance that do all that for you. It's an uggly appliance
that propose to enterprises man in the middle attack to watch employees
behavior in the Internet (In fact they pretend to do that to offer a
better user experience but in fact this is spying people).

This applicance is called 'Olfeo'. AFAIK it's not very stable, probably
expensive, based on Ubuntu (OMG !). IMHO, This should never existed.

You should find other similar solutions.
--
Je ne connaîtrai rien de tes habitudes
Il se peut même que tu sois décédée
Mais j'demanderai ta main pour la couper
-- H.F. Thiéfaine, L'ascenceur de 22H43
Eli the Bearded
2018-01-30 22:50:53 UTC
Permalink
Raw Message
Post by Doug713705
There is an appliance that do all that for you. It's an uggly appliance
that propose to enterprises man in the middle attack to watch employees
behavior in the Internet (In fact they pretend to do that to offer a
better user experience but in fact this is spying people).
The way to detect if something similar has been installed up stream from
you is to periodically inspect certificates and make sure they are not
all signed by the same authority.[*]
Post by Doug713705
This applicance is called 'Olfeo'. AFAIK it's not very stable, probably
expensive, based on Ubuntu (OMG !). IMHO, This should never existed.
You should find other similar solutions.
I've already mentioned two others:

One: get his MITM proxy to only talk http not https to the browser, and
rewrite pages to not use https links. Will need some cookie header
editing, too, now that I think more about it, don't want the browser to
reject secure-only cookies. Since he's doing this to rewrite pages
*anyway*, it seems like the best choice to me.

Two: The MITM Proxy project from mitmproxy.org; written in Python, which
is not the language originally requested.

Elijah
------
[*] I do that while checking the expire dates on my Let's Encrypt certs
Eli the Bearded
2018-01-25 18:22:10 UTC
Permalink
Raw Message
Post by Mike Spencer
browser can optionally be filtered/edited with a script, preferably
Perl because I've already done similar tasks with Perl.
1. Something opens the socket to the remote host, does IP, certs if
any, crypto if any, HTTP GET and whatever other details may arise
such as "chunked" or compression.
...
Post by Mike Spencer
Works fine for HTTP, *not* for HTTPS. (I'm not smart enough to write a
complete RFC-compliant package to do HTTPS with certs & crypto).
You don't have to.

use Net::SSLeay::Handle;
...
if(!tie(*SOCK, "Net::SSLeay::Handle", $remotehost, $port)) {
die "$remotehost:$port -- $!\n";
}

# now use SOCK as if you had socket() / connect()ed to the site

Elijah
------
has written his own wget/curl tool in perl
root
2018-01-25 18:34:51 UTC
Permalink
Raw Message
Post by Eli the Bearded
------
has written his own wget/curl tool in perl
But why?
Eli the Bearded
2018-01-25 19:47:08 UTC
Permalink
Raw Message
Post by root
Post by Eli the Bearded
has written his own wget/curl tool in perl
But why?
I started in 1999 when I needed features that weren't in the versions
available back then. Since then I've kept it up for (a) keeping my
understanding of web protocols sharp and (b) a browser "emulation"
feature that has helped me understand some esoteric cases.

https://qaz.wtf/tmp/bget

Should work with any perl newer than say, 5.005, if you have the
matching SSLeay module. No other modules outside of the standard
install are required. The code is not pretty.

The "emulation" makes this tool send request headers that match other
browsers. User-Agent, Accept-*, etc. Variations in Accept-* can have
suprising results on some servers. Other servers might seem to hate
lynx, but then you can figure out that, no, it's just the libwww library
mentioned in the UA that they hate. My bget program has no "native"
User-Agent, it pretends to be a libwww-perl project by default (but
does not use LWP at all).

I've considered trying to write a full MITM https proxy, but there are
some subtle complications with that. There's a Python project that has
already done it, which is good, but I don't like hacking on Python to
add / modify it for my tastes.

https://mitmproxy.org/

Elijah
------
has written his own test web servers, too, but not as full featured
root
2018-01-25 20:35:09 UTC
Permalink
Raw Message
Post by Eli the Bearded
Post by root
But why?
I started in 1999 when I needed features that weren't in the versions
available back then. Since then I've kept it up for (a) keeping my
understanding of web protocols sharp and (b) a browser "emulation"
feature that has helped me understand some esoteric cases.
I well understand writing your own stuff to add your features.

The browser emulation feature most important to me is emulating
the javascript processing of the page. Much of my time is
spent extracting data from a downloaded page that would
require manual clicks if I were to use a browser. I pipe the
wget output through a filter to extract what I need. None
of these filters use regular expressions to process the
page contents.

I've saved your perl/wget for later study.
Eli the Bearded
2018-01-25 22:55:10 UTC
Permalink
Raw Message
Post by root
The browser emulation feature most important to me is emulating
the javascript processing of the page.
Yeah, I don't do any of that.

For a http / https agent with a nice scriptable interface and javascript
support, I'd suggest edbrowse. The UI mimics ed, which is terse, line
based, easy to script but not very forgiving of errors.

I used edbrowse as a twitter UI for a couple of months some years ago.
Unlike lynx, I could both and tweet with the program. I'm not sure if
you can even login with lynx.
Post by root
Much of my time is
spent extracting data from a downloaded page that would
require manual clicks if I were to use a browser.
For several years I played a browser-based multiplayer game and used my
bget script as part of a macro engine. The cookie based auth required
similar headers from by browser and my tool, the emulation feature
worked well there. I wrote separate perl scripts for actually parsing
and generating the form responses for the gameplay. Having a javascript
parser or engine might have made that easier, but I did it all by hand.

(Due to the use of framesets, edbrowse was not effective for that game.)
Post by root
I pipe the
wget output through a filter to extract what I need. None
of these filters use regular expressions to process the
page contents.
I still use bget for web API scripts, but generally don't download
HTML in bulk with it anymore. I create "emulations" with auth headers I
want/need and download JSON and/or binaries with it.
Post by root
I've saved your perl/wget for later study.
If you want to use it for reals, the full ecosystem includes two
other scripts:

https://qaz.wtf/tmp/mkpost

Used to generate complex post forms for bget's --filepost option.

https://qaz.wtf/tmp/nullhttpd

Run as "nullhttpd --bget" and it starts a trivial httpd server which
logs (to STDOUT) requests in a way that mimics the formatting of
emulations in bget source. Used to create new emulations from any
browser.

With other options nullhttpd is handy for studying what gets sent from
various things. You can use it, eg, to handle callback URLs and see
the headers and body formatting of the call. This tool does not support
https yet, though.

Elijah
------
finds most callbacks are quite willing to accept http only addresses
Ars Ivci
2018-01-25 20:21:24 UTC
Permalink
Raw Message
On 25 Jan 2018 02:03:36 -0400
Post by Mike Spencer
Not Slackware-specific but I surmise Slackers might tilt more than
others toward The Hacker Nature and have good suggestions. And I'm
working with Slackware 14.2.
I would like a utility/feature such that HTML pages incoming to my web
browser can optionally be filtered/edited with a script, preferably
Perl because I've already done similar tasks with Perl.
1. Something opens the socket to the remote host, does IP, certs if
any, crypto if any, HTTP GET and whatever other details may arise
such as "chunked" or compression.
2. If the incoming data is text/html, passes it to my script.
3. Script edits the HTML, writes to stdout or other suitable place.
4. Script passes result to browser, possibly creating (or not) new
HTTP headers.
Home page on localhost, containing link to foo.com
Link to foo.com actually points to localhost/cgi-bin/foo.pl
foo.pl opens socket on real foo.com, fetches "page" via HTTP,
edits, writes result to stdout (along with suitable HTTP headers.)
Browser gets result from the cgi-bin mechanism.
Works fine for HTTP, *not* for HTTPS. (I'm not smart enough to write a
complete RFC-compliant package to do HTTPS with certs & crypto).
I'd like to generalize that process so that if FILTER is ON, any link
clicked will get that treatment. If FILTER is OFF, browser will
get unadorned input as usual.
Where do I go to do this? Can Apache (on localhost) be made to
do step 1. above, pass results to a script? Do I have to find out
about writing "addons" or "extensions" for Firefox? Might there be a
secret API/feature in FF that would make this easy? Is there an
existing stand-alone utility that implements this?
This is essentially a man in the middle attack on myself. Someone
must have done it already.
Any pointers welcome, TIA, etc.
Maybe Privoxy (https://www.privoxy.org/) can give you some ideas.
peace,
--
Ars Ivci
Grant Taylor
2018-01-26 03:07:12 UTC
Permalink
Raw Message
Post by Mike Spencer
I would like a utility/feature such that HTML pages incoming to my web
browser can optionally be filtered/edited with a script, preferably Perl
because I've already done similar tasks with Perl.

Post by Mike Spencer
Works fine for HTTP, *not* for HTTPS. (I'm not smart enough to write a
complete RFC-compliant package to do HTTPS with certs & crypto).
Yep, HTTPS is going to be tricky, but not impossible.
Post by Mike Spencer
I'd like to generalize that process so that if FILTER is ON, any link
clicked will get that treatment. If FILTER is OFF, browser will get
unadorned input as usual.
I think that combining a couple of different technologies will allow
this behavior.
Post by Mike Spencer
Where do I go to do this? Can Apache (on localhost) be made to do
step 1. above, pass results to a script?
I don't think Apache will do what you want to do.
Post by Mike Spencer
Do I have to find out about writing "addons" or "extensions" for Firefox?
I don't think so.
Post by Mike Spencer
Might there be a secret API/feature in FF that would make this easy?
Maybe. I'm not aware of one, but that doesn't mean that it doesn't exist.

I think that you can get Firefox to save ephemeral keys used for SSL,
but I don't think that will help with this.
Post by Mike Spencer
Is there an existing stand-alone utility that implements this?
I'm sure that you can find some utilities of the darker hat persuasion
that might be able to do some of this.
Post by Mike Spencer
This is essentially a man in the middle attack on myself.
Exactly. }:-)
Post by Mike Spencer
Someone must have done it already.
I've played with some of this, years ago.
Post by Mike Spencer
Any pointers welcome, TIA, etc.
Research Squid's SSL bump-in-the-wire decryption support. Squid can
behave like a standard proxy that your web browser uses, /and/ it can
intercept the CONNECT request to HTTPS sites. (Remember that HTTPS is
encrypted, even through proxies.)

I would configure Squid to use Web Cache Communication Protocol (WCCP)
to pass requests and / or replies to your script to do the filtering and
/ or modification that you want to do. - Your script wouldn't even
need to bother with HTTP or HTTPS. You would only need to appear as a
WCCP server to Squid. Squid will handle a LOT of the gory details to you.

This also means that you can leverage all of Squid's other features,
caching, URL filtering, ACLs, you name it.

As for enabling and disabling the feature, I'd leverage FoxyProxy to
decide if your client would use the Squid proxy or not. - FoxyProxy is
really neat and can be enabled / disabled easily, or it can even
dynamically choose to use the proxy based on the site that you're
connecting to.

I don't know if that's all that you want to do or not, but it should get
you a long way down the road.
--
Grant. . . .
unix || die
Mike Spencer
2018-01-26 20:44:20 UTC
Permalink
Raw Message
[snipity-snip]
Thanks all for the replies and discussion. Further discussion welcome
but I now have some pointers to pursue -- Squid, Eli's code and more
-- and reading to do.

I'm using Slack 11 & Netscape 4.76 on my main box with FF 2 for HTTPS.
That allows more control of web crap than newer browsers but now
many sites are using crypto (TLS 2?) that even FF 2 doesn't support.

So I'm doing a testbed of Slack 14.2. Spent hours already groveling
over the FF about:config, hunting up on the web how to defeat the
numerous unwanted "features" (bugs? :-) in FF 45. I think I have it
under control.

The Subject: line refers to further efforts, beyond controlling the
browser itself, to defeat web bloat and intrusive "stuff". Thirty
years ago, Nicholas Negroponte (founder of MIT's Media Lab) was
pontificating in glowing terms about the future when you would send
your "intelligent agents" out onto the net to more or less
autonomously carry out tasks for you. Hah! Fast forward 30 years: We
all allow web sites to dump active code -- their "intelligent agents"
-- into our boxen to carry out *their* tasks autonomously (or in
real-time collaboration with other data collection/analysis hosts).
It's the *their* "computer surrogates that possess a body of knowledge
both about something (a process, a field of interest, a way of doing)
and about you in relation to that something (your taste, your
inclinations, your acquaintances)" [1], not our own computer surrogates.

But it's unidirectional. The other direction is felonious intrusion.

[1] Nicholas Negroponte, _Being Digital_, p. 151

Tnx,
--
Mike Spencer Nova Scotia, Canada
Loading...