How to index apache nutch fetched content without parsing into solr -


i need index fetched content crawled nutch solr. solrjob in nutch indexes parse content. , need content html tags. can guide me on this?

thanks sudh

nutch has series of parsers , filters extract content fetched html.

you need implement htmlparserfilter, write raw content metatag , insert solr field.

the tutorial below indexing filter follows same flow.

nutch plugin

your class should implement "htmlparsefilter" instead of "indexingfilter". override filter() method:

@override public parseresult filter(content content, parseresult parseresult, htmlmetatags metatags, documentfragment doc) {     metadata metadata = parseresult.get(content.geturl()).getdata().getparsemeta();     byte[] rawcontent = content.getcontent();     string str = new string(rawcontent, "utf-8");     metadata.add("rawcontent", str);         return parseresult; } 

after that, change schema.xml , add new field:

<field name="metatag.rawcontent" type="text" stored="true" indexed="true" multivalued="false"/> 

compile, deploy, re-crawl, re-index.

you should see raw html content in solr index.

note: --

make sure have enabled metatags plugins. important because storing rawcontent metadata.


Comments

Popular posts from this blog

Why does Ruby on Rails generate add a blank line to the end of a file? -

keyboard - Smiles and long press feature in Android -

node.js - Bad Request - node js ajax post -