Posts Tagged ‘LFPW’

Script for downloading LFPW

November 23, 2012 1 comment

Today I needed to download (thus parse) the Labeled Face Parts in the Wild (LFPW). Although it seems an straight forward task, as the datasets have the urls, many of them are dead/broken or pointing to something else.

I thought it will take me a couple of minutes to put an script an have everything running smoothly. However, I found that the urls sometimes download html files, or garbage gifs. So I need to analyze the output and then choose some files and discard the others. And because I don’t want to do it manually, I put a script to achieve it.

My solution is in awk, probably it may be easier in pearl or python but I started with awk and stick with it.

To run it, you should be in the directory in which you want the images to end up.

awk '/average/ {
  result = ""
  download = ""
  name = ""
  cmd = "wget -t 1 -nv " $1 " 2>&1"
  while ( (cmd | getline line) > 0 )
    result = result " " line
  print "res: " result > "/dev/stderr"
  if (result != "" && match(result,/ERROR/) == 0) {
    downloaded = substr(result,RSTART+1,RLENGTH-2)
    print "file: " downloaded > "/dev/stderr"
    if (match(downloaded,/\.(htm|php|gif)/) != 0) {
      print "deleting: " downloaded > "/dev/stderr"
      system("rm \"" downloaded "\"")
    } else {
      if( match(tolower(downloaded),/\.(jpg|jpeg|bmp|png)$/) == 0 ) {
        if( match(tolower(downloaded),/\.(jpg|jpeg|bmp|png)/) != 0 )
          downloaded_fix = substr(downloaded,1,RSTART) substr(downloaded,RSTART+RLENGTH+1,length(downloaded))
          downloaded_fix = downloaded
        name = downloaded_fix ".jpg"
        print "adding ext: " downloaded_fix > "/dev/stderr"
        system("mv \"" downloaded "\" \"" name "\"")
      } else
      print "writting: " name > "/dev/stderr"
      print name, $0
  } else
    print "skipping: " name > "/dev/stderr"
  print "\n" > "/dev/stderr"
}' ../kbvt_lfpw_v1_train.csv > ../fixed_train.txt 2> ../train.log

Some comments about the code:

  • I’m processing only the “average” worker, and ignoring all the other (3) entries.
  • I choose to change the extension of the files that download without one to jpg (you may change it to something else if needed—line 23).
  • Also I’m deleting all the htm(l), php, and gifs that may be downloaded (line 14).
  • I’m checking the extensions directly, as the files that download have really random names (and I’m mean it: 86f6ec6e-9de5-11de-805f-588d52b6bd80, how is that an image name?… moving on), but you may add another ones if needed (I didn’t check thoroughly)—line 18.
  • When a downloaded file name already exist wget adds a suffix .# to the file. Thus, I’m stripping the possible extensions from each file, if it doesn’t end with a valid extension but contains such extension (line 20).
  • All the log messages (print ... > "/dev/stderr") are redirected to the standard error. That’s why at the end you get them with “2> ../<yourfile>” (line 28).
  • And to get the new list of files that were downloaded you redirect the standard output to your file by “> ../<fixed-file>” (line 34).

I haven’t checked all the files, but that is the first version. Any improvements are welcomed. :mrgreen:

%d bloggers like this: