C# Strip HTML Markup in XML -
i hope can me issue. solution should on c#.
i have xml file size of 36 mb , 900k lines. on nodes has lot of html markup , invalid markup
<obs><p> <jantes -="" .="" 22.000="" apenas="" exclusive="" kms.="" leve="" liga="" o=""> </jantes></p> i've tried different ways clean file 1 way able perform task, however, being executed on web application it's blocking application , taking around 6 minutes finish task , consuming around 450mb in memory.
as file invalid xml cannot use xmltextreader. using xlst, based on strip html-like characters (not markup) xml xslt? ,strangely i'm problems html entities.
the process worked (with tweaks) following on http://www.codeproject.com/articles/19652/html-tag-stripper
thanks
edit:
following kevin's suggestions. i'm trying build solution using html agility pack. @ least benchmarks. i'm stuck however. imagine following xml node:
<obs><p> want text<jantes -="" .="" 22.000="" apenas="" exclusive="" kms.="" leve="" liga="" o=""> </jantes></p></obs> how can strip tags inside "obs" tag, keep tag "obs" , keep text "i want text" ? this:
<obs>i want text</obs> for code have:
htmldocument doc = new htmldocument(); doc.loadhtml(text); queue<htmlnode> nodes = new queue<htmlnode>(doc.documentnode.selectnodes("./*|./text()")); while (nodes.count > 0) { htmlnode node = nodes.dequeue(); htmlnode parentnode = node.parentnode; htmlnodecollection childnodes = node.selectnodes("./*|./text()"); if (childnodes != null) { foreach (htmlnode child in childnodes) { if (child.name != "obs") { nodes.enqueue(child); } else { childnodes = child.selectnodes("//p|//jantes"); foreach (htmlnode nodetostrip in childnodes) nodetostrip.parentnode.removechild(nodetostrip); } } } } string s = doc.documentnode.innerhtml; thanks :)
edit 2
ok, able complete task. taking time. 3 hours , consuming 800mb in memory.
still needing help!
here code, might someone.
htmldocument doc = new htmldocument(); doc.loadhtml(text); queue<htmlnode> nodes = new queue<htmlnode>(doc.documentnode.selectnodes("./*|./text()")); while (nodes.count > 0) { htmlnode node = nodes.dequeue(); htmlnode parentnode = node.parentnode; htmlnodecollection childnodes = node.selectnodes("./*|./text()"); if (childnodes != null) { foreach (htmlnode child in childnodes) { if (child.name != "obs") { nodes.enqueue(child); } else { childnodes = child.selectnodes("//p|//jantes"); if (childnodes != null) { foreach (htmlnode nodetostrip in childnodes) { var replacement = doc.createtextnode(nodetostrip.innertext); nodetostrip.parentnode.replacechild(replacement, nodetostrip); } } } } } } string s = doc.documentnode.innerhtml;
have tried html agility pack? among claims:
- the parser tolerant "real world" malformed html
- you can fix page way want, modify dom, add nodes, copy nodes, well... name it
Comments
Post a Comment