python - Extract java script from html document using regular expression -

- September 15, 2013

i trying extract java script google.com using regular expression.

program

import urllib import re gdoc = urllib.urlopen('http://google.com').read() scriptlis = re.findall(r'<script>(.*?)</script>', gdoc) print scriptlis

output:

['']

can 1 tell me how extract java script html doc using regular expression only.

this works:

import urllib import re gdoc = urllib.urlopen('http://google.com').read() scriptlis = re.findall('(?si)<script>(.*?)</script>', gdoc) print scriptlis

the key here (?si). "s" sets "dotall" flag (same re.dotall), makes regex match on newlines. root of problem. scripts on google.com span multiple lines, regex can't match them unless tell include newlines in (.*?).

the "i" sets "ignorcase" flag (same re.ignorecase), allows match can javascript. now, isn't entirely necessary because google codes pretty well. but, if had poor code did stuff similar <script>...</script>, need flag.

Search This Blog

You

python - Extract java script from html document using regular expression -

Comments

Post a Comment

Popular posts from this blog

Need help in packaging app using TideSDK on Windows -

java - Why does my date parsing return a weird date? -

plot - Remove Objects from Legend When You Have Also Used Fit, Matlab -