c# - Export Wikipedia article to get summary information -

i'm trying introduction wikipedia article include report. example, article: http://en.wikipedia.org/wiki/map3k8

i want get:

mitogen-activated protein kinase kinase kinase 8 enzyme in humans encoded map3k8 gene. gene identified oncogenic transforming activity in cells. encoded protein member of serine/threonine protein kinase family.
kinase can activate both map kinase , jnk kinase pathways. kinase shown activate ikappab kinases, , induce nuclear production of nf-kappab. kinase found promote production of tnf-alpha , il-2 during t lymphocyte activation. studies of similar gene in rat suggested direct involvement of kinase in proteolysis of nf-kappab1,p105 (nfkb1). gene may utilize downstream in-frame translation start codon, , produce isoform containing shorter n-terminus. shorter isoform has been shown display weaker transforming activity. in mice, gene known tpl2 , tumor suppressor gene absence contributes development , progression of cancer.

i'm getting page url: http://en.wikipedia.org/wiki/special:export/map3k8

and convert code post: http://forums.asp.net/t/1066507.aspx/1 c#:

   httpwebrequest request  =(httpwebrequest)httpwebrequest.create("http://  en.wikipedia.org/wiki/special:export/map3k8");    request.accept = "text/hmtl";    request.credentials = system.net.credentialcache.defaultcredentials;    httpwebresponse response = (httpwebresponse) request.getresponse();    stream responsestream = response.getresponsestream();    xmltextreader reader = new xmltextreader(responsestream);    string ns = "http://www.mediawiki.org/xml/export-0.8/";    xpathdocument doc = new xpathdocument(reader);    reader.close();    response.close();    xpathnavigator myxpathnav = doc.createnavigator();    xpathnodeiterator nodestext = myxpathnav.selectdescendants("text", ns, false);    while (nodestext.movenext())    {        viewbag.message += nodestext.current.innerxml;    }    viewbag.summary = getsummary(viewbag.message);    return view();

getsummary method, according pbb template: http://en.wikipedia.org/wiki/template:pbb_controls

i want informations proteins, if following this.

   public string getsummary(string page)     {         string res = "";         //the introduction in 2 parts:          //1st between "{{pbb|geneid=1326}}" , &lt;!-- pbb_summary (.)* --&gt;         string intro = "";         //2nd between "summary_text =" , "=="         //http://en.wik    ipedia.org/wiki/special:export/map3k8 used example          string summary = "";         try         {             intro = page.split(new string[] { "}}" }, stringsplitoptions.none)[1];              intro = intro.split(new string[] { "&lt;!--" }, stringsplitoptions.none)[0];             intro = deletemediawikitag(intro);         }         catch(exception)         {             intro = "";         }         try         {             summary += page.split(new string[] { "summary_text =" }, stringsplitoptions.none)[1];             summary = summary.split(new string[] { "==" }, stringsplitoptions.none)[0];             summary = deletemediawikitag(summary);         }         catch(exception)         {             summary = "";         }         res = intro + "\n\n" + summary;         return res;     }     public string deletemediawikitag(string text)     {         string res = "";         // working         regex reg = new regex("{{.*(}})*|{{|}}|'''|&lt;!--.*--&gt;|]]|([[]){2}");         res = reg.replace(text,"");         //i don't understand wrong regex         regex regprime = new regex("&lt(.)*(&gt;){1}");         res = regprime.replace(res, "prime");         return res;     }

my problem in execution of deletemediawikitag(summary) because i'm losing end of summary part :

in mice, gene known tpl2 , tumor suppressor gene absence contributes development , progression of cancer.

before handled regex, text looks like:

   &lt;ref name=&quot;entrez&quot; /&gt;     in mice, gene known tpl2 , tumor suppressor gene absence contributes development , progression of cancer.    &lt;ref&gt;{{cite web|last=decicco-skinner|first=kathleen|title=loss of tumor progression locus 2 (tpl2) enhances tumorigenesis , inflammation in two-stage skin carcinogenesis|url=http://www.ncbi.nlm.nih.gov/pmc/articles/pmc3460638/}}&lt;/ref&gt;

so according regex, i'm expecting like: (prime used highlight matches, @ end, delete matching regex)

   prime in  mice *.....* prime

but get:

   prime

so "&lt(.)*(>){1}" matching whole part (the first &lt , last > i'm asking match 1 time pattern > more 1 time if take everything...

what wrong regex? did miss something? maybe better way parse this? (but none of parsers i've found convinced me)

p.s. parser works with: http://en.wikipedia.org/wiki/nfkb2 or http://en.wikipedia.org/wiki/apoa4 want more reliably.

i can't find issue exiting one. both regular expressions working fine. recommend use regular expression online tester before implementing in code. try out: http://gskinner.com/regexr/

Search This Blog

Babette

c# - Export Wikipedia article to get summary information -

Comments

Post a Comment

Popular posts from this blog

node.js - Bad Request - node js ajax post -

Why does Ruby on Rails generate add a blank line to the end of a file? -

keyboard - Smiles and long press feature in Android -