hadoop - Pig map reduce job to place values within proper range -
i have list of values 1 data source , second dataset contains ranges tied value.
dataset 1: 3 4 6 20 25 38 dataset 2: 1|3|a 4|10|b 11|20|c 21|30|d 31|31|e 32|38|f 39|40|g result: 3,a 4,b 6,b 20,c 25,d 38,f
i'd create type of "join" tie value in dataset 1 character in dataset 2.
if either of donald miner's suggestions work fast enough i'd those, make faster, if dataset 2 has 250k-500k entries should able fit entire thing memory. therefore could: write udf stores dataset 2 memory (see getcachefiles
how store hdfs file distributedcache. write evalfunc
takes single item of dataset a, binary searches it's location in dataset 2, , returns answer want.
answer = foreach dataset1 generate mybinarysearchudf(number) myresult:tuple(originalnumber:int, dataset2id:chararray);
Comments
Post a Comment