Posts Tagged ‘How to’

Script for downloading LFPW

November 23, 2012 1 comment

Today I needed to download (thus parse) the Labeled Face Parts in the Wild (LFPW). Although it seems an straight forward task, as the datasets have the urls, many of them are dead/broken or pointing to something else.

I thought it will take me a couple of minutes to put an script an have everything running smoothly. However, I found that the urls sometimes download html files, or garbage gifs. So I need to analyze the output and then choose some files and discard the others. And because I don’t want to do it manually, I put a script to achieve it.

My solution is in awk, probably it may be easier in pearl or python but I started with awk and stick with it.

To run it, you should be in the directory in which you want the images to end up.

awk '/average/ {
  result = ""
  download = ""
  name = ""
  cmd = "wget -t 1 -nv " $1 " 2>&1"
  while ( (cmd | getline line) > 0 )
    result = result " " line
  print "res: " result > "/dev/stderr"
  if (result != "" && match(result,/ERROR/) == 0) {
    downloaded = substr(result,RSTART+1,RLENGTH-2)
    print "file: " downloaded > "/dev/stderr"
    if (match(downloaded,/\.(htm|php|gif)/) != 0) {
      print "deleting: " downloaded > "/dev/stderr"
      system("rm \"" downloaded "\"")
    } else {
      if( match(tolower(downloaded),/\.(jpg|jpeg|bmp|png)$/) == 0 ) {
        if( match(tolower(downloaded),/\.(jpg|jpeg|bmp|png)/) != 0 )
          downloaded_fix = substr(downloaded,1,RSTART) substr(downloaded,RSTART+RLENGTH+1,length(downloaded))
          downloaded_fix = downloaded
        name = downloaded_fix ".jpg"
        print "adding ext: " downloaded_fix > "/dev/stderr"
        system("mv \"" downloaded "\" \"" name "\"")
      } else
      print "writting: " name > "/dev/stderr"
      print name, $0
  } else
    print "skipping: " name > "/dev/stderr"
  print "\n" > "/dev/stderr"
}' ../kbvt_lfpw_v1_train.csv > ../fixed_train.txt 2> ../train.log

Some comments about the code:

  • I’m processing only the “average” worker, and ignoring all the other (3) entries.
  • I choose to change the extension of the files that download without one to jpg (you may change it to something else if needed—line 23).
  • Also I’m deleting all the htm(l), php, and gifs that may be downloaded (line 14).
  • I’m checking the extensions directly, as the files that download have really random names (and I’m mean it: 86f6ec6e-9de5-11de-805f-588d52b6bd80, how is that an image name?… moving on), but you may add another ones if needed (I didn’t check thoroughly)—line 18.
  • When a downloaded file name already exist wget adds a suffix .# to the file. Thus, I’m stripping the possible extensions from each file, if it doesn’t end with a valid extension but contains such extension (line 20).
  • All the log messages (print ... > "/dev/stderr") are redirected to the standard error. That’s why at the end you get them with “2> ../<yourfile>” (line 28).
  • And to get the new list of files that were downloaded you redirect the standard output to your file by “> ../<fixed-file>” (line 34).

I haven’t checked all the files, but that is the first version. Any improvements are welcomed. :mrgreen:


Compiling Poppler on Windows

July 14, 2011 28 comments

popplerI’ve been struggling trying to install Poppler under Windows, and there is no much information out there. And the few people who claim that works on Windows don’t say how they did it.

Thus, today I will try to guide you on how to make Poppler works on Windows with QT. The goal of this tutorial is to compile the poppler_qt4viewer demo.

The things you will need

Ok, let’s begin the journey. Before starting you need to download some libraries

You will need QT, at the time I’m writing the latest version is 4.7.4 (source zip, tar.gz). I will recommend you to download the source because it gave me some problems when I downloaded the binaries. Then you need the freetype, cairo, and zlib. However, those are not that easy to find in Windows. So, you can download them from any site that maintains a build like GTK+ or Inkscape. I got mine from the GTK+ site and used the developer version of them. You can use, however, the version of your choice. And finally the other library we need is openjpeg, and you can download the source code and build it from there.

Now that you have the libraries, you will need the tools to build them. So go and set Visual Studio or a compiler for Windows, and you will also need CMake.

Building QT

Before getting into Poppler or other libraries, you will need to build QT. I prefer to build it instead of downloading the binaries, because the compiled version didn’t work for me 😦 . There is a detail instruction set in here. So I will give you the short story.

  1. First unzip the sources into your hard drive, for here on I will call this path $QT_PATH=c:\QT.
  2. Then you need to add the path to the environment variables. Go to My Computer Properties, in the Advanced Tab search for the Environment Variables. Then add to the PATH variable the path to the bin folder of your extracted QT files ($QT_PATH\bin).
  3. Now open a console from Visual Studio, and then go to the $QT_PATH and execute configure and then nmake.
    cd $QT_PATH

    If you have any problem in this part check the installation page.

Preparing the environment

After you get all the libraries (cairo, freetype, zlib, and openjpeg) you need to unzip them into a folder, I will call it $TOOLS from here on.

Then, for example, you will have $TOOLS\cairo-dev_1.10.2-1_win32 if you choose the development version of cairo library. And all the other libraries can be in this folder too. We will use them later when compiling Poppler. In my case I used the dev versions of all the libraries, and in order to make it easier I will refer to them as follows

  • $CAIRO = $TOOLS\cairo-dev_1.10.2-1_win32
  • $FREETYPE = $TOOLS\freetype-dev_2.4.2-1_win32
  • $ZLIB = $TOOLS\zlib-dev_1.2.5-2_win32
  • $OPENJPEG = $TOOLS\openjpeg_v1_4_sources_r697

Building Openjpeg

Now we will build Openjpeg using CMake. Open CMake GUI and in the source path choose $OPENJPEG path; for the build binaries you need to create a folder, mine is $OPENJPEG\build. Then you just need to press Configure; select your version of Visual Studio and wait for it to end, and then just press it again. It should be no more red rows in the output. If there are, try to solve the problems it mention. After that press Generate and the project should be in $OPENJPEG\build or wherever you put it.

Then open ALL_BUILD project and compile it. The binaries will be in $OPENJPEG\build\bin.

Building Poppler Library

Note: Different target names for Debug and Release.
In case you need to give a postfix to the debug and release libraries, e.g., popplerd.lib for debug and poppler.lib for release, you need to add this line to the CMakeList (after the set(CMAKE_MODULE_PATH ... ) is OK)


Just change the d for the postfix you want to use, maybe _d.

Now that we have all the prerequisites we are ready to build Poppler. Put the sources in a folder, I will call it $POPPLER, and create a folder for the build like in the openjpeg step, mine it is in $POPPLER\build.

Now open CMake, and point the source-path to $POPPLER and the build-path to $POPPLER\build, and press Configure. Again, select the Visual Studio of your choice and then wait for it to configure. An error message will appear regarding the freetype library, don’t worry we are expecting it. You will see a lot of red rows in the output. Look for FREETYPE_INCLUDE_DIR_freetype2, FREETYPE_INCLUDE_DIR_ft2build, FREETYPE_LIBRARY and point them to:

  • FREETYPE_INCLUDE_DIR_freetype2: $FREETYPE\include\freetype2
  • FREETYPE_INCLUDE_DIR_ft2build: $FREETYPE\include
  • FREETYPE_LIBRARY:$FREETYPE\lib\freetype.lib

Then hit Configure again and wait. An error will occur again, and we will need to set the cairo and zlib libraries. Look for the next variables and set them accordingly

  • CAIRO_LIBRARY: $CAIRO\lib\cairo.lib
  • LIBOPENJPEG_LIBRARIES: $OPENJPEG\bulid\bin\Release\openjpeg.lib (this can be Release or Debug depends on you)

Also, you need to change the zlib variables. If you don’t see them, mark the Advance option.

  • ZLIB_LIBRARY: $ZLIB\lib\zdll.lib

Take special attention to this variables if you have a LaTex related system in you computer because it will look for that binaries. However, if the include libraries are not there Poppler won’t compile.

After you change those variables, you need to disable the options WITH_Iconv, WITH_PNG, and WITH_GLIB. And then press Configure again, check for no red rows and then press Generate. Go to $POPPLER\build and execute ALL_BUILD, compile and that should be it for the Poppler library.

Executing the Demo

If you want to execute the demo, and you didn’t add everything to your path you will need some libraries in order to run it. The demo will be located at $POPPLER\build\qt4\demos\Debug or Release accordingly to your compiling settings. You will need the following libraries

  • poppler-qt4.dll: $POPPLER\build\qt4\src\Debug
  • openjpeg.dll: $OPENJPEG\build\bin\Debug
  • QtCored4.dll, QtGuid4.dll, QtTestd4.dll, QtXmld4.dll: $QT_PATH\lib (the ‘d’ is needed if you are working with Debug version, if it is Release version loose it)

If you want to port your demo, you will need to add the msvcpX.dll and msvcrX.dll where X is the number of your compiler (also the debug version of those libraries, the ones that includes the ‘d’), you can find those libraries in your system.

Fonts problems?

I encounter some problems with the demo. The application started, it open the PDFs but no fonts appeared. After struggling a little bit I found that I need the freetype6.dll, I found that library in my installation of Inkscape. However, I’m not sure why the Debug version doesn’t ask for the library. Nevertheless, if you put that library in your Release version the fonts start appearing.


Up to here I was able to build the Poppler library and run the demo inside. I will explore further to see the capabilities of the library and how can I incorporate it into my projects. I hope this help you, so you won’t waste your time trying to do things that, now that I know how to do them, look easy but are not when there is not enough documentation around.

It is hard at the beginning but if you struggle long enough it becomes easier.  :mrgreen:

%d bloggers like this: