python - Extract java script from html document using regular expression -


i trying extract java script google.com using regular expression.

program

import urllib import re gdoc = urllib.urlopen('http://google.com').read() scriptlis = re.findall(r'<script>(.*?)</script>', gdoc) print scriptlis 

output:

[''] 

can 1 tell me how extract java script html doc using regular expression only.

this works:

import urllib import re gdoc = urllib.urlopen('http://google.com').read() scriptlis = re.findall('(?si)<script>(.*?)</script>', gdoc) print scriptlis 

the key here (?si). "s" sets "dotall" flag (same re.dotall), makes regex match on newlines. root of problem. scripts on google.com span multiple lines, regex can't match them unless tell include newlines in (.*?).

the "i" sets "ignorcase" flag (same re.ignorecase), allows match can javascript. now, isn't entirely necessary because google codes pretty well. but, if had poor code did stuff similar <script>...</script>, need flag.


Comments

Popular posts from this blog

plot - Remove Objects from Legend When You Have Also Used Fit, Matlab -

java - Why does my date parsing return a weird date? -

Need help in packaging app using TideSDK on Windows -