Home > Explanations, Research, Tips > Script for downloading LFPW

Script for downloading LFPW

November 23, 2012 Leave a comment Go to comments

Today I needed to download (thus parse) the Labeled Face Parts in the Wild (LFPW). Although it seems an straight forward task, as the datasets have the urls, many of them are dead/broken or pointing to something else.

I thought it will take me a couple of minutes to put an script an have everything running smoothly. However, I found that the urls sometimes download html files, or garbage gifs. So I need to analyze the output and then choose some files and discard the others. And because I don’t want to do it manually, I put a script to achieve it.

My solution is in awk, probably it may be easier in pearl or python but I started with awk and stick with it.

To run it, you should be in the directory in which you want the images to end up.

awk '/average/ {
  result = ""
  download = ""
  name = ""
  cmd = "wget -t 1 -nv " $1 " 2>&1"
  while ( (cmd | getline line) > 0 )
    result = result " " line
  print "res: " result > "/dev/stderr"
  if (result != "" && match(result,/ERROR/) == 0) {
    downloaded = substr(result,RSTART+1,RLENGTH-2)
    print "file: " downloaded > "/dev/stderr"
    if (match(downloaded,/\.(htm|php|gif)/) != 0) {
      print "deleting: " downloaded > "/dev/stderr"
      system("rm \"" downloaded "\"")
    } else {
      if( match(tolower(downloaded),/\.(jpg|jpeg|bmp|png)$/) == 0 ) {
        if( match(tolower(downloaded),/\.(jpg|jpeg|bmp|png)/) != 0 )
          downloaded_fix = substr(downloaded,1,RSTART) substr(downloaded,RSTART+RLENGTH+1,length(downloaded))
          downloaded_fix = downloaded
        name = downloaded_fix ".jpg"
        print "adding ext: " downloaded_fix > "/dev/stderr"
        system("mv \"" downloaded "\" \"" name "\"")
      } else
      print "writting: " name > "/dev/stderr"
      print name, $0
  } else
    print "skipping: " name > "/dev/stderr"
  print "\n" > "/dev/stderr"
}' ../kbvt_lfpw_v1_train.csv > ../fixed_train.txt 2> ../train.log

Some comments about the code:

  • I’m processing only the “average” worker, and ignoring all the other (3) entries.
  • I choose to change the extension of the files that download without one to jpg (you may change it to something else if needed—line 23).
  • Also I’m deleting all the htm(l), php, and gifs that may be downloaded (line 14).
  • I’m checking the extensions directly, as the files that download have really random names (and I’m mean it: 86f6ec6e-9de5-11de-805f-588d52b6bd80, how is that an image name?… moving on), but you may add another ones if needed (I didn’t check thoroughly)—line 18.
  • When a downloaded file name already exist wget adds a suffix .# to the file. Thus, I’m stripping the possible extensions from each file, if it doesn’t end with a valid extension but contains such extension (line 20).
  • All the log messages (print ... > "/dev/stderr") are redirected to the standard error. That’s why at the end you get them with “2> ../<yourfile>” (line 28).
  • And to get the new list of files that were downloaded you redirect the standard output to your file by “> ../<fixed-file>” (line 34).

I haven’t checked all the files, but that is the first version. Any improvements are welcomed. :mrgreen:

  1. September 19, 2013 at 10:23 am

    Here is my attempt of doing something similar in python

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: