powershell - HtmlAgilityPack get tables based on cell value -
i have 1000+ html documents contain various tables each , using powershell process them.
i looking extract specific tables, these can identified first row, used headings , 1 of cells has word "measurement".
since html .doc export word can nested in <span>
or <p>
ideally able ignore level of nesting.
i've tried like:
$tables = $doc.documentnode.selectnodes("//table[* = 'measurement']")
but nothing back.
here's more html, unfortunately cannot post of it, it's ms word export html document:
<table class=msonormaltable border=1 cellspacing=0 cellpadding=0 style='border-collapse:collapse;mso-table-layout-alt:fixed;border:none; mso-border-alt:double windowtext 1.5pt;mso-padding-alt:0in 5.4pt 0in 5.4pt'> <tr style='mso-yfti-irow:0;mso-yfti-firstrow:yes'> <td width=192 valign=top style='width:2.0in;border:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt'> <p class=msoheading9><span lang=en-ca>areas</span></p> </td> <td width=288 valign=top style='width:3.0in;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext 1.0pt;padding:0in 5.4pt 0in 5.4pt'> <p class=msoheading9><span lang=en-ca>measurements</span></p> </td> <td width=346 valign=top style='width:3.6in;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext 1.0pt;padding:0in 5.4pt 0in 5.4pt'> <p class=msoheading9><span lang=en-ca>objectives</span></p> </td> </tr>
without further information or sample html markup can suggest use descendant axis //
descendant nodes no matter how deep nested within <table>
node :
//table[.//* = 'measurement']
update :
after looking @ sample html, think there might more efficient way using more specific xpath, example:
//table[tr/td//* = 'measurement']
but specific xpath bring more risk of leaving tables supposed selected. decision yours, according entire document structure , how efficiency needed.
Comments
Post a Comment