Script for downloading LFPW
Today I needed to download (thus parse) the Labeled Face Parts in the Wild (LFPW). Although it seems an straight forward task, as the datasets have the urls, many of them are dead/broken or pointing to something else.
I thought it will take me a couple of minutes to put an script an have everything running smoothly. However, I found that the urls sometimes download html
files, or garbage gifs
. So I need to analyze the output and then choose some files and discard the others. And because I don’t want to do it manually, I put a script to achieve it.
My solution is in awk
, probably it may be easier in pearl
or python
but I started with awk
and stick with it.
To run it, you should be in the directory in which you want the images to end up.
awk '/average/ { result = "" download = "" name = "" cmd = "wget -t 1 -nv " $1 " 2>&1" while ( (cmd | getline line) > 0 ) result = result " " line close(cmd) print "res: " result > "/dev/stderr" if (result != "" && match(result,/ERROR/) == 0) { match(result,/".+"/); downloaded = substr(result,RSTART+1,RLENGTH-2) print "file: " downloaded > "/dev/stderr" if (match(downloaded,/\.(htm|php|gif)/) != 0) { print "deleting: " downloaded > "/dev/stderr" system("rm \"" downloaded "\"") } else { if( match(tolower(downloaded),/\.(jpg|jpeg|bmp|png)$/) == 0 ) { if( match(tolower(downloaded),/\.(jpg|jpeg|bmp|png)/) != 0 ) downloaded_fix = substr(downloaded,1,RSTART) substr(downloaded,RSTART+RLENGTH+1,length(downloaded)) else downloaded_fix = downloaded name = downloaded_fix ".jpg" print "adding ext: " downloaded_fix > "/dev/stderr" system("mv \"" downloaded "\" \"" name "\"") } else name=downloaded print "writting: " name > "/dev/stderr" print name, $0 } } else print "skipping: " name > "/dev/stderr" print "\n" > "/dev/stderr" }' ../kbvt_lfpw_v1_train.csv > ../fixed_train.txt 2> ../train.log
Some comments about the code:
- I’m processing only the “average” worker, and ignoring all the other (3) entries.
- I choose to change the extension of the files that download without one to
jpg
(you may change it to something else if needed—line 23). - Also I’m deleting all the
htm(l)
,php
, andgifs
that may be downloaded (line 14). - I’m checking the extensions directly, as the files that download have really random names (and I’m mean it: 86f6ec6e-9de5-11de-805f-588d52b6bd80, how is that an image name?… moving on), but you may add another ones if needed (I didn’t check thoroughly)—line 18.
- When a downloaded file name already exist
wget
adds a suffix.#
to the file. Thus, I’m stripping the possible extensions from each file, if it doesn’t end with a valid extension but contains such extension (line 20). - All the log messages (
print ... > "/dev/stderr"
) are redirected to the standard error. That’s why at the end you get them with “2> ../<yourfile>
” (line 28). - And to get the new list of files that were downloaded you redirect the standard output to your file by “
> ../<fixed-file>
” (line 34).
I haven’t checked all the files, but that is the first version. Any improvements are welcomed.
Here is my attempt of doing something similar in python
https://bitbucket.org/rodrigob/lfpw_download