Tuesday, January 7, 2014

Example: data preprocessing with BASH

Case situation


I have run some batch jobs on a cluster to process data files for different systems (msc, ms, sh, rd) and parameters (i and w). The files are in different subdirectories:
[cjj@gust pattern]$ ls d-*/*.spd
d-msc/i275w042526.spd  d-ms/i285w025017.spd  d-rd/i295w042812.spd
d-msc/i280w040241.spd  d-ms/i290w023034.spd  d-sh/i275w051138.spd
d-msc/i285w036791.spd  d-ms/i295w020787.spd  d-sh/i280w047315.spd
d-msc/i290w031925.spd  d-rd/i270w065151.spd  d-sh/i285w043415.spd
d-msc/i295w026791.spd  d-rd/i275w060475.spd  d-sh/i290w039589.spd
d-ms/i270w034433.spd   d-rd/i280w055777.spd  d-sh/i295w035791.spd
d-ms/i275w030644.spd   d-rd/i285w051257.spd
d-ms/i280w027133.spd   d-rd/i290w046948.spd
[cjj@gust pattern]$
While, the output files are in the current directory:
[cjj@gust pattern]$ ls *.o*
i270w034433.spd.o172489  i275w060475.spd.o172496  i285w036791.spd.o172486
i270w065151.spd.o172495  i280w027133.spd.o172491  i290w023034.spd.o172493
i275w030644.spd.o172490  i280w040241.spd.o172485  i290w031925.spd.o172487
i275w042526.spd.o172484  i285w025017.spd.o172492  i295w026791.spd.o172488
[cjj@gust pattern]$
The format of the output log files are as follows:
[cjj@gust pattern]$ cat i270w034433.spd.o172489
MinTemplateNumber =  3
JT =  5
JN =  1
spikeResolution =  2
Number of initial spike patterns have been found : 562
ans = Creating surrogate data
ans = Creating time jittering surrogate data
ans = Creating neuron jittering surrogate data
Number of spike patterns have been valid by checking with sorrogate : 542
Number of spike patterns have been ruled out because of having less complex : 205
Number of valid spike patterns have been found : 337
[cjj@gust pattern]$

Problem task

Gather the stats in the log files as those marked in red.

Solution 1

This is done with a one-liner:
[cjj@gust pattern]$ for i in d-*/*.spd;do n=${i%/*};n=${n#d-};s=${i#*/};if [ -f ${s}.o* ];then w=${s%.spd};w=${w#*w}; echo ${n} ${s:1:2}.${s:3:1} $((1${w:0:2}-100)).${w:2} `grep ':' ${s}.o* | awk '{print $NF}'`;fi;done > matching_stat.txt
which can be broken down to:
for i in d-*/*.spd;do
  n=${i%/*}
  n=${n#d-}
  s=${i#*/}
  if [ -f ${s}.o* ];then
    w=${s%.spd}
    w=${w#*w}
    echo ${n} ${s:1:2}.${s:3:1} $((1${w:0:2}-100)).${w:2} `grep ':' ${s}.o* | awk '{print $NF}'`
  fi
done > matching_stat.txt
The data file generated is:
[cjj@gust pattern]$ cat matching_stat.txt 
msc 27.5 4.2526 81 75 22 53
msc 28.0 4.0241 237 217 103 114
msc 28.5 3.6791 393 371 156 215
msc 29.0 3.1925 335 322 132 190
msc 29.5 2.6791 445 437 144 293
ms 27.0 3.4433 562 542 205 337
ms 27.5 3.0644 1037 1006 331 675
ms 28.0 2.7133 1141 1093 341 752
ms 28.5 2.5017 1325 1274 462 812
ms 29.0 2.3034 1652 1609 747 862
rd 27.0 6.5151 1031 953 313 640
rd 27.5 6.0475 1042 963 345 618
[cjj@gust pattern]$