Possible alternatives to speed up reads from a text file in c? -


i working on machine learning application features stored in huge text files. way have implemented data input reads, way slow practical. each line of text file represents feature vector in sparse format. instance, following example contains 3 features in index:value fashion.

1:0.34 2:0.67 6:0.99 12:2.1 28:2.1 2:0.12 22:0.27 26:9.8 69:1.8 3:0.24 4:67.0 7:1.9 13:8.1 18:1.7 32:3.4 

following how making reads now. dont know length of feature string before hand, read suitably large length upper bounds length of each string. once, have read line file, use strtok_r function split string key value pairs , further process store sparse array. ideas on how speed highly appreciated.

file *fp = fopen(feature_file, "r");  int fvec_length = 0; char line[1000000]; size_t ln; char *pair, *single, *brkt, *brkb;  svector **fvecs = (svector **)malloc(n_fvecs*sizeof(svector *)); if(!fvecs) die("memory error.");  int j = 0;  while( fgets(line,1000000,fp) ) {     ln = strlen(line) - 1;     if (line[ln] == '\n')         line[ln] = '\0';      fvec_length = 0;     for(pair = strtok_r(line, " ", &brkt); pair; pair = strtok_r(null, " ", &brkt)){         fvec_length++;         words = (word *) realloc(words, fvec_length*sizeof(word));         if(!words) die("memory error.");         j = 0;         (single = strtok_r(pair, ":", &brkb); single; single = strtok_r(null, ":", &brkb)){             if(j == 0){                 words[fvec_length-1].wnum = atoi(single);             }             else{                 words[fvec_length-1].weight = atof(single);              }             j++;         }     }        fvec_length++;      words = (word *) realloc(words, fvec_length*sizeof(word));     if(!words) die("memory error.");     words[fvec_length-1].wnum = 0;     words[fvec_length-1].weight = 0.0;      fvecs[i] = create_svector(words,"",1);     free(words);     words = null; } fclose(fp); return fvecs; 

  1. you should absolutely reduce number of memory allocations. classic approach double vector on each allocation, logarithmic number of allocation calls rather linear.

  2. since line pattern seems constant, there's no need tokenize hand, use single sscanf() on each loaded line directly scan line's words.

  3. your line buffer seems extremely large, can cost blowing stack, worsening cache locality bit.


Comments

Popular posts from this blog

Why does Ruby on Rails generate add a blank line to the end of a file? -

keyboard - Smiles and long press feature in Android -

node.js - Bad Request - node js ajax post -