python - Extract java script from html document using regular expression -
i trying extract java script google.com
using regular expression.
program
import urllib import re gdoc = urllib.urlopen('http://google.com').read() scriptlis = re.findall(r'<script>(.*?)</script>', gdoc) print scriptlis
output:
['']
can 1 tell me how extract java script html doc using regular expression only.
this works:
import urllib import re gdoc = urllib.urlopen('http://google.com').read() scriptlis = re.findall('(?si)<script>(.*?)</script>', gdoc) print scriptlis
the key here (?si)
. "s" sets "dotall" flag (same re.dotall
), makes regex match on newlines. root of problem. scripts on google.com span multiple lines, regex can't match them unless tell include newlines in (.*?)
.
the "i" sets "ignorcase" flag (same re.ignorecase
), allows match can javascript. now, isn't entirely necessary because google codes pretty well. but, if had poor code did stuff similar <script>...</script>
, need flag.
Comments
Post a Comment