openpyxl: my python xlsx library

Hello, this blog has moved to here : http://blog.ericgazoni.me, the blog on WordPress will be kept not to break links, but will not be updated anymore, see you there !

Update: openpyxl 1.0 is now out !

At a customer, we read a lot of Excel files. We’ve tried the conventional approaches, that are xlrd and xlwt, pyinex, and COM automation.

That’s COM that we mainly use, because it’s able to deal with every Excel file format, from the ancient Excel 5 to most recent Excel 2007 Office Open XML format.
However, we experience from time to time stability issues (Excel is a complex beast, sometimes you don’t fully understand why it is angry).

We then looked for a native reader for .xlsx format, to get rid of the Excel part of the equation, but unfortunately, there are only two small read-only libraries for now:

Finally, I thought that I was the only guy who needed a native .xslx writer, and decided to stick with COM for now.
I wouldn’t be doing this project now without a tweet from Tarek Ziadé, who was also looking for such a library. That meant that we were at least two in need for the same thing, so I simply decided to write it.

Trust me, the Office Open XML format is open, but it’s also a bit twisted, so I spent a few days gathering documentation, and I finally landed on the PHPExcel library, that was already doing what I needed, but in PHP.

So now, I’m busy porting the PHPExcel library under Python, which is really easy, because of the similarities between both languages, but I can also benefit from all the nice things that come with Python, so the code is much simpler.

You can follow my progress on bitbucket: http://bitbucket.org/ericgazoni/openpyxl/

Advertisements

My Python environment

Hello, this blog has moved to here : http://blog.ericgazoni.me, the blog on WordPress will be kept not to break links, but will not be updated anymore, see you there !

The early days

When I first heard about Python, it was just after the 2.5 release. I heard that one of my customer was using it but I had never seen a line of Python yet. At some point in a project, I needed a Bash script equivalent on Windows, and decided to give Python a try, instead of using Windows BAT files.

I installed it on my workstation, and started reading the (excellent) documentation about the tasks I needed to do.

I’ve used IDLE at first, because, well, it’s shipped with the Windows Python distribution. It has been a very unpleasant experience I must say (although I have learned to appreciate some features IDLE has that are missing in other editors). TKinter is a summary toolkit, the look and feel makes it look like it was written in the 80’s (in fact, it probably was). The main concept of editor/runner mix-in felt also a bit weird at first. I finally returned to my all-time-favorite editor, Notepad++, and ran my scripts from the command line.

Eclipse and Pydev

Later, I (luckily) landed on a new project, and it included a lot of Python. The team in place was using an editor they were not yet familiar with, but hopefully, I already knew quite well: Eclipse. I must say the Pydev extension for Eclipse is one of the greatest blessing you can get when working with Python. It features a lot of interesting features from IDLE:

  • syntax highlighting
  • code completion
  • real-time code inspection (very useful when using a dynamic language)
  • PyLint integration (once you’ve tasted it, you can never work without anymore)
  • smart indentations (you almost don’t have to worry about your indents)
  • unittest integration (although I’ve stopped using it)

Eclipse is an excellent product on its own, I already used it for PHP development for several years and I was happy it was also my customer choice for Python.

However, there was still some occasions Eclipse was not the right tool to use:

  • the integrated console implements the basic shell only
  • executing scripts outside the project path is hard, as well as changing the current directory
  • for 10 line scripts, creating a project is a bit overkill

The revelation: IPython

After a couple of months struggling with Eclipse and Notepad++ & python.exe, I discovered IPython and at last found a way to work on small scripts without the overhead of Eclipse, but with all its interesting features:

  • serves as well as a shell replacement as Python interpreter
  • excellent autocompletion (both for paths and Python code)
  • “magic” functions such as bookmarks, list of currently defined variables (“whos” command)
  • PDB (Python Debugger) integration with IPDB, providing code completion and history to PDB
  • post-mortem debugger (“debug” command after a traceback)
  • quick access to docstrings and source code of almost every library

IPython is packaged inside the Python(x,y) distribution with the Console application, which is a kind of command line emulator for Windows. Once configured with a readable setup, it’s probably the best development environment you can find of on Windows.

Continue reading

Using fake data to test your software

A little story

A couple of years ago, I was sitting at my desk at a customer office. The guy next to me was demonstrating an application built by another company to his colleague. The tool was built with an obscure WYSIWYG IDE that packaged its own framework and language. The point of using a GUI only framework for an important accounting application might sound a dangerous idea, especially when this language turns really inefficient.

During the demo, the guy opened a couple of customer accounts, scrolled down list of operations, and displayed a few reports. While his colleague seemed impressed by the tool, the only positive point I could find was the cuteness of the interface. It was really shiny, the colors were elegant, window decorations were well drawn. But there was something that made me sweat: the (relative) slowness of the product.

It might be hard to spot when you are not a professional developer, but I can’t imagine how scrolling down on the 20 customers present in the demo database could be so irregular. Every 5 items, the application froze for a split second, before resuming. Not something the final user will notice, as it’s not part of his job to have a responsive interface. His job is to work with the data shown by the interface.

A few month later, I came back to the customer, and by this time, the roll out phase of the accounting application had begun. The other company came to migrate the existing accounts (a few thousands) into the new application. When I arrived, I found a few people gathered around one computer, in the IT department. They were all yelling at the screen, so I joined them to see what was going on.

It appears that the problem I mentally noticed before had turned to a complete disaster when loaded with the production data. The customer selection screen took several minutes to just load, when you tried to open a customer record, either the application crashed, either it took half an hour and half of the main memory. I was happy I was not the one responsible for this mess…

Apparently, it seems that every screen was always loading all the records in the database into memory, then sorted the ones it needed, and displayed them. Sounds just goofy, but the worst was it did this every time the screen was modified ! Every time you selected an item in a list, refresh. Every time you modified a field, refresh. Well, you get the picture.

What to learn

The point of this story is: if the developer used a larger set of test data, and unless his workstation looked like a crazy NASA grade supercomputer, he would never have missed something like this. You cannot put something into production that has not been tested with a reasonable amount of data. “Reasonable” varies according to the business you’re into of course, when doing physics simulation, one million observation will seem low, but if you write a real estate portfolio management application, one thousand items will sound overkill.

But “how do I get that many data ?” you may ask. There are plenty of tools available to generate big datasets:

  • a Perl module to generate random data: Data::Faker (sorry, couldn’t find the equivalent in Python)
  • for human related data (name, email, phone, credit card): http://www.fakenamegenerator.com
  • for everything else, any scripting language will do

An example

For a project I am currently working on, I needed to populate my database will human related data, plus a few “business specific” fields.

I downloaded a small sample (5000 names) from fakenamegenerator.com, then I made some Python post-processing in order to have 5000 dummy people with all the fields I needed. I build lists of possible values, then pick randomly them for each dummy people.

Sorry for the colors, there is no pygments plugin on wordpress.com… Update: there is one, but I didn’t know where to find it

from itertools import permutations

COMPANIES = [' '.join(t) for t in list(permutations(['buzz','works','sim','corp','data','micro'], 3))]

ACRONYMS = list(set([cn[0] + cn[len(cn)/2] +  cn[-1] for cn in COMPANIES]))

DEPARTMENTS = [' '.join(t) for t in list(permutations(['artificial','financial','science','training','marketing','sales'], 3))]

LANGUAGES = ['de', 'en', 'fr', 'nl']

TITLES = ['Ph.D.', 'MSC', 'BSC', '']

I could refine it further, so there is a correlation between the name of the company, and the company acronym for example, but it’s not relevant for my application.

Other benefits

Except database performance benchmark, you can have several other benefits from using massive fake data.

Unicode support

As Tarek Ziadé pointed out recently, his name still breaks web applications, because of the “é” in it. Most of us will use “cultural” dummies when inputting them by hand. You don’t want to spend time investigating realistic examples of chinese or russian names for your application, but what if someone with a non-ascii symbol in his name wants to join you website ? With fake data, you can have one dataset for each culture, so it’s easy to test and include new cultures as well.

Field length

When I was a student, I was doing a lot of assumptions when writing code. One of my dumbest was that a family name (surname) would never exceed 20 characters. Of course, my tests included the famous french dummy family:  “toto toto” and his wife “tata toto”, as well as his son, “tutu toto”. You see ? Nothing above 20 characters. When I submitted the program to my teacher, I simply tried to input his name, and you guessed it, it was more than 20 characters long. The program crashed into flames (don’t mess with the length of C strings), and I learned the lesson.

First step to fuzzy testing

One technique I would like to use more often is fuzzing my inputs so they are “monkey ass proof”. If your application breaks just because someone typed a character he normally wouldn’t have (like a letter in a number field), then you should worry. Using random data, you can generate a lot of “possible” combinations (not probable, but possible) of inputs and then spot your errors more efficiently.

Open source packaging goofs

I am a true Open Source enthusiast, and always advise my customers to use open software whenever alternative to proprietary software exists. It’s link to the way many (Microsoft for instance) software company try to lock customers down by using closed formats, “forgetting” to put an “export all data” option and providing unstable (thus unusable) interoperability.

But I must admit it’s not possible to advise people to use software that I (computer geek type) can’t even install or try. If I’m not able to get the software working, my customer (usually not computer savvy) probably won’t be either. Recently, I gave some Python packages a try: virtualenv, pylons, and trac.

Test bench

I installed virtualenv on a FreeBSD 8.0 box, it installed cleanly, I created my environment, but was unable to use it because the command in the manual (“$ source activate.sh”) is broken and does not give a clean error message, so I was unable to debug the issue by myself. I assume the package is tried and tested on Linux boxes and the source command there works differently than mine.

I also installed Pylons three times, on my Windows dev box, because it looks like an excellent package. Out of three attempts, I managed to make the tutorial work only once. And as soon I modified the tutorial “Hello, World” into something else, I got strange and non-related errors every time I reloaded the page. Too bad, I will use Bottle for now.

Finally, yesterday I tried to install the dev branch of the Trac bug tracker. The instructions say you have to checkout and install the latest version of Genshi before checking out the latest of Trac itself from their SVN server. That’s what I did, and after ten minutes, I got an error message saying my version of Genshi was not high enough to use Trac. Probably someone set version requirements on his computer, tested it could deploy, but forgot to push the modification in the Genshi tree and I was … stuck. Finally I was able to download and install the last stable version, but not able to test the new 0.12 features.

Conclusion

Of course, I know those goofs are just exceptions in the long life of those three excellent packages, I probably came at the wrong time, and those bugs have probably been solved by now, but it shows the real need of automated integration testing inherent to every software development. And that could be a strategic advantage of the Open Source projects compared to the closed source ones, as their reactivity can make them implement those tools faster than it could be in a big corporation, with procedures, forms to fill and so on. Developers should pay extra attention making sure their software installs smoothly on every supported platform (virtualization is wide spread nowadays), so users are not scared away and then completely miss all the shiny features of their package.

Remember that 10 years ago, linux was labeled as “computer guru stuff”, I welcomed you by asking you the size of your hard drive and the number of cylinders. Now look at what Canonical did with Ubuntu: clean, mouse-driven, colorful installer. And then people were suddenly able to install linux, and realized that “damn, it’s even better than my old Windows box, and I don’t have to pay for it anymore !”. So please Open Source developers, fix your installers 😀

Update: I agree that instead of complaining, I could have filed a bug report in each project to report the issue. Although the statement is valid, it misses the point of “ready to use” software.

Test driven development in .NET

Several months after discovering TDD in Python thanks to Eric Jones, I don’t think I could develop anymore without it. Currently working on a different project in .NET, the first thing I did was re-installing NUnit (I already gave it a try several years ago, but without understanding how to use it, it was soon abandoned).

I have to admit that having a GUI for the test runner (compared to Nosetest) is better on the psychological point of view: having all those green and red lights is way funnier than a dark listing with a big “OK” at the end.

But NUnit runner also has its drawbacks, for instance there is no way to tell it to stay on the tray/taskbar even when pressing the close button. I don’t know if it would be a useful option but I have the (bad) reflex of closing windows instead of minimizing them, so I have to start NUnit like ten times a day. I also miss the “–pdb-failure” option a lot 🙂

I’ve recently read a book that was reviewed on Slashdot about unit testing and this, and combined with my previous experiments in Python, it completely changed the way I write code today. I think unit testing should really be taught in CS classes, as it improves software internal quality by so many orders of magnitude that it’s definitely worth the extra minutes spent writing tests.