How to index apache nutch fetched content without parsing into solr -
i need index fetched content crawled nutch solr. solrjob in nutch indexes parse content. , need content html tags. can guide me on this?
thanks sudh
nutch has series of parsers , filters extract content fetched html.
you need implement htmlparserfilter, write raw content metatag , insert solr field.
the tutorial below indexing filter follows same flow.
your class should implement "htmlparsefilter" instead of "indexingfilter". override filter() method:
@override public parseresult filter(content content, parseresult parseresult, htmlmetatags metatags, documentfragment doc) { metadata metadata = parseresult.get(content.geturl()).getdata().getparsemeta(); byte[] rawcontent = content.getcontent(); string str = new string(rawcontent, "utf-8"); metadata.add("rawcontent", str); return parseresult; }
after that, change schema.xml , add new field:
<field name="metatag.rawcontent" type="text" stored="true" indexed="true" multivalued="false"/>
compile, deploy, re-crawl, re-index.
you should see raw html content in solr index.
note: --
make sure have enabled metatags plugins. important because storing rawcontent metadata.
Comments
Post a Comment