python - Extract java script from html document using regular expression -
i trying extract java script google.com using regular expression.
program
import urllib import re gdoc = urllib.urlopen('http://google.com').read() scriptlis = re.findall(r'<script>(.*?)</script>', gdoc) print scriptlis output:
[''] can 1 tell me how extract java script html doc using regular expression only.
this works:
import urllib import re gdoc = urllib.urlopen('http://google.com').read() scriptlis = re.findall('(?si)<script>(.*?)</script>', gdoc) print scriptlis the key here (?si). "s" sets "dotall" flag (same re.dotall), makes regex match on newlines. root of problem. scripts on google.com span multiple lines, regex can't match them unless tell include newlines in (.*?).
the "i" sets "ignorcase" flag (same re.ignorecase), allows match can javascript. now, isn't entirely necessary because google codes pretty well. but, if had poor code did stuff similar <script>...</script>, need flag.
Comments
Post a Comment