c# - Export Wikipedia article to get summary information -
i'm trying introduction wikipedia article include report. example, article: http://en.wikipedia.org/wiki/map3k8
i want get:
mitogen-activated protein kinase kinase kinase 8 enzyme in humans encoded map3k8 gene. gene identified oncogenic transforming activity in cells. encoded protein member of serine/threonine protein kinase family.
kinase can activate both map kinase , jnk kinase pathways. kinase shown activate ikappab kinases, , induce nuclear production of nf-kappab. kinase found promote production of tnf-alpha , il-2 during t lymphocyte activation. studies of similar gene in rat suggested direct involvement of kinase in proteolysis of nf-kappab1,p105 (nfkb1). gene may utilize downstream in-frame translation start codon, , produce isoform containing shorter n-terminus. shorter isoform has been shown display weaker transforming activity. in mice, gene known tpl2 , tumor suppressor gene absence contributes development , progression of cancer.
i'm getting page url: http://en.wikipedia.org/wiki/special:export/map3k8
and convert code post: http://forums.asp.net/t/1066507.aspx/1 c#:
httpwebrequest request =(httpwebrequest)httpwebrequest.create("http:// en.wikipedia.org/wiki/special:export/map3k8"); request.accept = "text/hmtl"; request.credentials = system.net.credentialcache.defaultcredentials; httpwebresponse response = (httpwebresponse) request.getresponse(); stream responsestream = response.getresponsestream(); xmltextreader reader = new xmltextreader(responsestream); string ns = "http://www.mediawiki.org/xml/export-0.8/"; xpathdocument doc = new xpathdocument(reader); reader.close(); response.close(); xpathnavigator myxpathnav = doc.createnavigator(); xpathnodeiterator nodestext = myxpathnav.selectdescendants("text", ns, false); while (nodestext.movenext()) { viewbag.message += nodestext.current.innerxml; } viewbag.summary = getsummary(viewbag.message); return view();
getsummary method, according pbb template: http://en.wikipedia.org/wiki/template:pbb_controls
i want informations proteins, if following this.
public string getsummary(string page) { string res = ""; //the introduction in 2 parts: //1st between "{{pbb|geneid=1326}}" , <!-- pbb_summary (.)* --> string intro = ""; //2nd between "summary_text =" , "==" //http://en.wik ipedia.org/wiki/special:export/map3k8 used example string summary = ""; try { intro = page.split(new string[] { "}}" }, stringsplitoptions.none)[1]; intro = intro.split(new string[] { "<!--" }, stringsplitoptions.none)[0]; intro = deletemediawikitag(intro); } catch(exception) { intro = ""; } try { summary += page.split(new string[] { "summary_text =" }, stringsplitoptions.none)[1]; summary = summary.split(new string[] { "==" }, stringsplitoptions.none)[0]; summary = deletemediawikitag(summary); } catch(exception) { summary = ""; } res = intro + "\n\n" + summary; return res; } public string deletemediawikitag(string text) { string res = ""; // working regex reg = new regex("{{.*(}})*|{{|}}|'''|<!--.*-->|]]|([[]){2}"); res = reg.replace(text,""); //i don't understand wrong regex regex regprime = new regex("<(.)*(>){1}"); res = regprime.replace(res, "prime"); return res; }
my problem in execution of deletemediawikitag(summary)
because i'm losing end of summary part :
in mice, gene known tpl2 , tumor suppressor gene absence contributes development , progression of cancer.
before handled regex, text looks like:
<ref name="entrez" /> in mice, gene known tpl2 , tumor suppressor gene absence contributes development , progression of cancer. <ref>{{cite web|last=decicco-skinner|first=kathleen|title=loss of tumor progression locus 2 (tpl2) enhances tumorigenesis , inflammation in two-stage skin carcinogenesis|url=http://www.ncbi.nlm.nih.gov/pmc/articles/pmc3460638/}}</ref>
so according regex, i'm expecting like: (prime used highlight matches, @ end, delete matching regex)
prime in mice *.....* prime
but get:
prime
so "<(.)*(>){1}"
matching whole part (the first < , last > i'm asking match 1 time pattern > more 1 time if take everything...
what wrong regex? did miss something? maybe better way parse this? (but none of parsers i've found convinced me)
p.s. parser works with: http://en.wikipedia.org/wiki/nfkb2 or http://en.wikipedia.org/wiki/apoa4 want more reliably.
i can't find issue exiting one. both regular expressions working fine. recommend use regular expression online tester before implementing in code. try out: http://gskinner.com/regexr/
Comments
Post a Comment