C# Strip HTML Markup in XML -


i hope can me issue. solution should on c#.

i have xml file size of 36 mb , 900k lines. on nodes has lot of html markup , invalid markup

<obs><p> <jantes -="" .="" 22.000="" apenas="" exclusive="" kms.="" leve="" liga="" o=""> </jantes></p> 

i've tried different ways clean file 1 way able perform task, however, being executed on web application it's blocking application , taking around 6 minutes finish task , consuming around 450mb in memory.

as file invalid xml cannot use xmltextreader. using xlst, based on strip html-like characters (not markup) xml xslt? ,strangely i'm problems html entities.

the process worked (with tweaks) following on http://www.codeproject.com/articles/19652/html-tag-stripper

thanks

edit:

following kevin's suggestions. i'm trying build solution using html agility pack. @ least benchmarks. i'm stuck however. imagine following xml node:

<obs><p> want text<jantes -="" .="" 22.000="" apenas="" exclusive="" kms.="" leve="" liga="" o=""> </jantes></p></obs> 

how can strip tags inside "obs" tag, keep tag "obs" , keep text "i want text" ? this:

<obs>i want text</obs> 

for code have:

        htmldocument doc = new htmldocument();         doc.loadhtml(text);         queue<htmlnode> nodes = new queue<htmlnode>(doc.documentnode.selectnodes("./*|./text()"));         while (nodes.count > 0)         {             htmlnode node = nodes.dequeue();             htmlnode parentnode = node.parentnode;              htmlnodecollection childnodes = node.selectnodes("./*|./text()");              if (childnodes != null)             {                 foreach (htmlnode child in childnodes)                 {                     if (child.name != "obs")                     {                         nodes.enqueue(child);                     }                     else                     {                         childnodes = child.selectnodes("//p|//jantes");                         foreach (htmlnode nodetostrip in childnodes)                             nodetostrip.parentnode.removechild(nodetostrip);                     }                 }             }         }         string s = doc.documentnode.innerhtml; 

thanks :)

edit 2

ok, able complete task. taking time. 3 hours , consuming 800mb in memory.

still needing help!

here code, might someone.

htmldocument doc = new htmldocument();         doc.loadhtml(text);         queue<htmlnode> nodes = new queue<htmlnode>(doc.documentnode.selectnodes("./*|./text()"));         while (nodes.count > 0)         {             htmlnode node = nodes.dequeue();             htmlnode parentnode = node.parentnode;              htmlnodecollection childnodes = node.selectnodes("./*|./text()");              if (childnodes != null)             {                 foreach (htmlnode child in childnodes)                 {                     if (child.name != "obs")                     {                         nodes.enqueue(child);                     }                     else                     {                         childnodes = child.selectnodes("//p|//jantes");                         if (childnodes != null)                         {                             foreach (htmlnode nodetostrip in childnodes)                             {                                 var replacement = doc.createtextnode(nodetostrip.innertext);                                 nodetostrip.parentnode.replacechild(replacement, nodetostrip);                             }                         }                     }                 }             }         }         string s = doc.documentnode.innerhtml; 

have tried html agility pack? among claims:

  • the parser tolerant "real world" malformed html
  • you can fix page way want, modify dom, add nodes, copy nodes, well... name it

Comments

Popular posts from this blog

node.js - Bad Request - node js ajax post -

Why does Ruby on Rails generate add a blank line to the end of a file? -

keyboard - Smiles and long press feature in Android -