beautifulsoup - split a table into several with Beautiful Soup [Python] -
i need problem can't find out...
i have html table tr , td:
for example:
<table border="0" cellpadding="0" cellspacing="0"> <tr> <td> </td> </tr> <tr> <td colspan="2"> <br /> <h2> macros </h2> </td> </tr> <tr> <td> #define </td> <td> <a class="el" href="#g3e3da223d2db3b49a9b6e3ee6f49f745"> snd_lstindic </a> </td> </tr> <tr> <td class="mdescleft"> </td> <td class="mdescright"> liste sons indication <br /> </td> </tr> <tr> <td colspan="2"> <br /> <h2> définition de type </h2> </td> </tr> <tr> <td class="memitemleft" nowrap="nowrap" align="right" valign="top"> typedef void(* </td> <td class="memitemright" valign="bottom"> <a class="el" href="#g73cba8bd62d629eb05495a5c1a7b2844"> f_sndchangefunc </a> )( <a class="el" href="#g4ab7db37a42f244764583a63997489a8"> e_sndsound </a> i_esound, abool i_bstart, abyte i_bydisablemodule) </td> </tr> <tr> <td class="mdescleft"> </td> <td class="mdescright"> fonction rappel sur départ/arrêt bip. <a href="#g73cba8bd62d629eb05495a5c1a7b2844"> </a> <br /> </td> </tr> <tr> <td colspan="2"> <br /> <h2> Énumérations </h2> </td> </tr> <tr> <td class="memitemleft" nowrap="nowrap" align="right" valign="top"> enum </td> <td class="memitemright" valign="bottom"> <a class="el" href="#g4ab7db37a42f244764583a63997489a8"> e_sndsound </a> { } </td> </tr> <tr> <td class="mdescleft"> </td> <td class="mdescright"> identificateurs sons <a href="group__sound.html#g4ab7db37a42f244764583a63997489a8"> plus de détails... </a> <br /> </td> </tr> </table>
i try split table several one. out title , create table following lines.
for example expected result here should this:
<h2> macros </h2> <table border="0" cellpadding="0" cellspacing="0"> <tr> <td> </td> </tr> <tr> <td colspan="2"> <br /> </td> </tr> <tr> <td> #define </td> <td> <a class="el" href="#g3e3da223d2db3b49a9b6e3ee6f49f745"> snd_lstindic </a> </td> </tr> <tr> <td class="mdescleft"> </td> <td class="mdescright"> liste sons indication <br /> </td> </tr> </table> <h2> définition de type </h2> <table> <tr> <td class="memitemleft" nowrap="nowrap" align="right" valign="top"> typedef void(* </td> <td class="memitemright" valign="bottom"> <a class="el" href="#g73cba8bd62d629eb05495a5c1a7b2844"> f_sndchangefunc </a> )( <a class="el" href="#g4ab7db37a42f244764583a63997489a8"> e_sndsound </a> i_esound, abool i_bstart, abyte i_bydisablemodule) </td> </tr> <tr> <td class="mdescleft"> </td> <td class="mdescright"> fonction rappel sur départ/arrêt bip. <a href="#g73cba8bd62d629eb05495a5c1a7b2844"> </a> <br /> </td> </tr> </table> <h2> Énumérations </h2> <table> <tr> <td class="memitemleft" nowrap="nowrap" align="right" valign="top"> enum </td> <td class="memitemright" valign="bottom"> <a class="el" href="#g4ab7db37a42f244764583a63997489a8"> e_sndsound </a> { } </td> </tr> <tr> <td class="mdescleft"> </td> <td class="mdescright"> identificateurs sons <a href="group__sound.html#g4ab7db37a42f244764583a63997489a8"> plus de détails... </a> <br /> </td> </tr> </table>
i use python , beautifulsoup in order parse html code. tried first :
from beautifulsoup import beautifulsoup, navigablestring import sys import os soup = beautifulsoup(allhtml) table in htmlsoup.findall("table"): h2s = table.findall("h2") if h2s not []: firsth2 = true lasth2 = false i, h2 in enumerate(h2s): if h2 not []: lasth2 = ( == len(h2s) - 1 ) h2.parent.replacewithchildren() # <td> deleted h2.parent.replacewithchildren() # <tr> deleted print h2.parent if firsth2: h2.replacewith( h2.prettify() + '<table>' ) #h2_tag_idx = h2.parent.contents.index(h2) # other method add tags #h2.parent.insert(h2_tag_idx + 1, '<b>ok</b>') else: h2.replacewith( '</table>' + h2.prettify() + '<table>' ) firsth2 = false print soup.prettify()
but no way, replace tag html équivalent ascii code...
i tried every contents in table , after try rebuild several table en put again in soup failed...
i tried table in string , split string delimiter , reput subtable soup failed too...
if has idea, great!
thanks in advance!
i made function , works...
def getouttitlefromtable(htmlsoup): ii, table in enumerate(htmlsoup.findall("table")): h2s = table.findall("h2") # on cherche tous les <h2></h2> dans le tableau #print h2s if len(h2s) > 0: #si on au moins 1 <h2> dans le tableau firsth2 = true lasth2 = false newtables = beautifulsoup() # contiendra nos tableaux reconstitués i, h2 in enumerate(h2s): if h2 not []: lasth2 = ( == len(h2s) - 1 ) h2.parent.replacewithchildren() # on supprime le <td> h2.parent.replacewithchildren() # on supprime le <tr> idt = "table"+str(ii)+str(i) # création d'un id de tableau pour une meilleure lisibilité wraptable = tag(htmlsoup, "table") wraptable["id"]=idt wraptable["border"]=0 wraptable["cellpadding"]=0 wraptable["cellspacing"]=0 #print h2.parent.contents.index(h2) # index du h2 dans l'arbre table table.insert(h2.parent.contents.index(h2)+1, wraptable) # on ajoute <table></table> après chaque <h2>"title"</h2> #newtable = table.findall("table") newtable = table.find(name="table", attrs={"id" : idt}) filltable = false #print table.findall(["h2","tr"]) tr in table.findall(["h2","tr"]): if filltable: if tr in h2s: #print "fin du nouveau tableau" #print tr filltable = false break else: if tr.find("h2") not in h2s: #print "ajout d'une nouvelle ligne: " newtable.contents.append(tr) #print newtable.contents if str(tr) == str(h2): #print "début du nouveau tableau" #print tr filltable = true newtables.append(h2) newtables.append(newtable) #os.system("pause") #print h2 #print firsth2 #print lasth2 firsth2 = false #print newtables table.contents = newtables table.name = "div" # on change la balise table en div... on triche mais je n'arrive absolument pas à retirer le wrap <table></table>
if has better solution, enjoy @ it.
bye
Comments
Post a Comment