Installing Python 2.4 under Ubuntu 11.10 (Oneiric Ocelot)

As I’m currently merging 2.x and 3.x branches of openpyxl, I now need to perform tests over all existing Python releases known to man. I decided to install a new Ubuntu VM to perform the the tests, but I struggled during 2.4 installation.

Tried pythonbrew, which was a very disappointing experience as nothing worked as it was supposed to.

Then I tried to build from source myself, but here again the GCC version available on ubuntu 11.10 is apparently no longer able to compile original python sources.

Here is how I did it :

  1. download RPM version for Fedora here : http://www.python.org/download/releases/2.4/rpms/ (direct link)
  2. install alien (sudo apt-get install alien)
  3. type the following command
$ sudo alien -i python2.4-2.4-1pydotorg.i386.rpm

Now you can type

$ python2.4

and everything should work !

openpyxl starts being used

Hello, this blog has moved to here : http://blog.ericgazoni.me, the blog on WordPress will be kept not to break links, but will not be updated anymore, see you there !

When I started working on openpyxl a few months ago, I didn’t know it would catch that much activity around it. I’m very happy to see that it can apparently help  so many people :)

I’ll try to follow-up on the bug fixes and new features as far as my time permits, and will usually answer emails within the day. Thank you everyone for using the library, even though it is still far from being perfect ;-) Keep posting bugs on the tracker or ideas and requests on the mailing list !

openpyxl is on PyPi

Yes, I know, yet another post about openpyxl ;-)

This time I’m announcing the release of the 1.1.0 version on PyPi.

As mentioned in the Wiki, you can now just type easy_install openpyxl to get the latest released version, if you are not much into getting the snapshop from bitbucket.

Also, thanks to Yaroslav’s great job, openpyxl is also packaged under Debian testing !

Now there is no valid reason for not giving it a try… maybe except that darn memory footprint…

openpyxl turns 1.1

After two weeks of intense activity around openpyxl, I’m releasing version 1.1 today. This new version brings support for dates and number formats.

Several bugs have been fixed, thanks to the careful testing of two new contributors, Jonathan Peirce and Yaroslav Halchenko, both working on the PsychoPy project.

Thanks guys for boosting my morale, providing valuable advises and patches !

Many thanks goes to Marko Loparic for his support and enthusiasm ;-)

You can get the sources for the latest version here http://bitbucket.org/ericgazoni/openpyxl. I expect a lot of bug reports with this new version, as it is stable but not extensively tested yet, and that the number of possibilities have seriously increased with the introduction of number formatting.

Keep in mind that the memory footprint is still high, but that it is the target for milestone 1.2. It should perform reasonably well if your needs are moderate (<100.000 cells), but if you want to add more data, then it might start consuming RAM pretty quickly. This holds for writing and reading.

Memory consumption is almost linear, and a 15MB workbook results in 450MB in RAM.

There is also a new mailing list for the project: http://groups.google.com/group/openpyxl-users. It’s pretty empty for now, but feel free to ask questions there, I’ll be reading it regularly.

Bug reports will be better handled if they are filed on the project bug tracker: http://bitbucket.org/ericgazoni/openpyxl/issues/new.

Happy coding !

openpyxl reaches 1.0 mark

Hello, this blog has moved to here : http://blog.ericgazoni.me, the blog on WordPress will be kept not to break links, but will not be updated anymore, see you there !

After a few more efforts, I am pleased to announce the release of the first version of openpyxl.

The reader and the writer are working and tested for strings and numbers.

I have been able to read and write simple Excel 2007 xlsx files from Python and open them with Excel.

You can clone the repository using Mercurial:

hg clone https://ericgazoni@bitbucket.org/ericgazoni/openpyxl

or download the release in zip format.

Edit: 1.0 release is really outdated, you might want to get a more recent version here.

The (sparse for now) documentation can be found on the wiki.

Reader usage (using the “empty_book.xlsx” file from the previous example)

from openpyxl.reader.excel import load_workbook

wb = load_workbook(filename = r'empty_book.xlsx')

sheet_ranges = wb.get_sheet_by_name(name = 'range names')

print sheet_ranges.cell('D18').value # should display D18

Code is published under the MIT licence, so you can use it for whatever use you need, and I’d be very happy if  you drop me an email if  you use it :)

If you don’t find it useful, spot a bug, or want to suggest an enhancement, you can do so by filling a ticket on the tracker.

Features that will be added in the next version are listed here, so if you need something in this list, please be patient or send me a message to tell me to hurry ;)

openpyxl: simple writer done

I’ve been very busy on openpyxl the last few days, and I managed to get a working writer for basic data types (strings, numerics).

For the impatient, you can clone my bitbucket repository:

hg clone https://ericgazoni@bitbucket.org/ericgazoni/openpyxl

It’s still a work in progress, so expect some quirks here and there, and if that happens, please file a new issue here.

If you like it, you can also drop a comment below or send me an email (see Contact page).

Usage is pretty simple as you can see:

from openpyxl.workbook import Workbook
from openpyxl.writer.excel import ExcelWriter

from openpyxl.cell import get_column_letter

wb = Workbook()

ew = ExcelWriter(workbook = wb)

dest_filename = r'empty_book.xlsx'

ws = wb.worksheets[0]

ws.title = "range names"

for col_idx in xrange(1, 40):
    col = get_column_letter(col_idx)
    for row in xrange(1, 600):
        ws.cell('%s%s'%(col, row)).value = '%s%s' % (col, row)

ws = wb.create_sheet()

ws.title = 'Pi'

ws.cell('F5').value = 3.14

ew.save(filename = dest_filename)

Next features are:

  1. a working reader (so that I can read back files generated by the writer)
  2. dates support
  3. calculations
  4. formatting

IronPython and WPF

Hello, this blog has moved to here : http://blog.ericgazoni.me, the blog on WordPress will be kept not to break links, but will not be updated anymore, see you there !

Last week I came across a few websites that were dealing about dynamic generation of Winforms in IronPython.

I’m not much into code-generated UIs, because it’s easy to get two or three controls on a form, but as soon as you have a dozen, it can be a nightmare to lay them out properly only with code. For example, it might need several tries to get a decent width for your text boxes, or a pleasing height for your lists. When using a WYSIWYG UI editor, at least you’re playing with the real thing, and save a lot of time on the design process.

On the other side, I’m not much into the Visual Studio way of doing UIs (aka “mouse click hell”), where it’s so tempting to put your logic behind the form, because that’s the way it expects you to do it.

The best way of designing forms I know is how Qt does it:

  1. design your interface in a WYSIWYG, drag-n-drop designer
  2. save it in a programming language agnostic format (Qt uses XML)
  3. translate it into a module in your favorite programming language, through a specialized compiler
  4. import it in your application
  5. now you can plug it to your application logic

I wanted to use the same flow in .NET, but that was not possible … until introduction of WPF and Xaml format.

Continue reading

openpyxl: my python xlsx library

Hello, this blog has moved to here : http://blog.ericgazoni.me, the blog on WordPress will be kept not to break links, but will not be updated anymore, see you there !

Update: openpyxl 1.0 is now out !

At a customer, we read a lot of Excel files. We’ve tried the conventional approaches, that are xlrd and xlwt, pyinex, and COM automation.

That’s COM that we mainly use, because it’s able to deal with every Excel file format, from the ancient Excel 5 to most recent Excel 2007 Office Open XML format.
However, we experience from time to time stability issues (Excel is a complex beast, sometimes you don’t fully understand why it is angry).

We then looked for a native reader for .xlsx format, to get rid of the Excel part of the equation, but unfortunately, there are only two small read-only libraries for now:

Finally, I thought that I was the only guy who needed a native .xslx writer, and decided to stick with COM for now.
I wouldn’t be doing this project now without a tweet from Tarek Ziadé, who was also looking for such a library. That meant that we were at least two in need for the same thing, so I simply decided to write it.

Trust me, the Office Open XML format is open, but it’s also a bit twisted, so I spent a few days gathering documentation, and I finally landed on the PHPExcel library, that was already doing what I needed, but in PHP.

So now, I’m busy porting the PHPExcel library under Python, which is really easy, because of the similarities between both languages, but I can also benefit from all the nice things that come with Python, so the code is much simpler.

You can follow my progress on bitbucket: http://bitbucket.org/ericgazoni/openpyxl/

My Python environment

Hello, this blog has moved to here : http://blog.ericgazoni.me, the blog on WordPress will be kept not to break links, but will not be updated anymore, see you there !

The early days

When I first heard about Python, it was just after the 2.5 release. I heard that one of my customer was using it but I had never seen a line of Python yet. At some point in a project, I needed a Bash script equivalent on Windows, and decided to give Python a try, instead of using Windows BAT files.

I installed it on my workstation, and started reading the (excellent) documentation about the tasks I needed to do.

I’ve used IDLE at first, because, well, it’s shipped with the Windows Python distribution. It has been a very unpleasant experience I must say (although I have learned to appreciate some features IDLE has that are missing in other editors). TKinter is a summary toolkit, the look and feel makes it look like it was written in the 80’s (in fact, it probably was). The main concept of editor/runner mix-in felt also a bit weird at first. I finally returned to my all-time-favorite editor, Notepad++, and ran my scripts from the command line.

Eclipse and Pydev

Later, I (luckily) landed on a new project, and it included a lot of Python. The team in place was using an editor they were not yet familiar with, but hopefully, I already knew quite well: Eclipse. I must say the Pydev extension for Eclipse is one of the greatest blessing you can get when working with Python. It features a lot of interesting features from IDLE:

  • syntax highlighting
  • code completion
  • real-time code inspection (very useful when using a dynamic language)
  • PyLint integration (once you’ve tasted it, you can never work without anymore)
  • smart indentations (you almost don’t have to worry about your indents)
  • unittest integration (although I’ve stopped using it)

Eclipse is an excellent product on its own, I already used it for PHP development for several years and I was happy it was also my customer choice for Python.

However, there was still some occasions Eclipse was not the right tool to use:

  • the integrated console implements the basic shell only
  • executing scripts outside the project path is hard, as well as changing the current directory
  • for 10 line scripts, creating a project is a bit overkill

The revelation: IPython

After a couple of months struggling with Eclipse and Notepad++ & python.exe, I discovered IPython and at last found a way to work on small scripts without the overhead of Eclipse, but with all its interesting features:

  • serves as well as a shell replacement as Python interpreter
  • excellent autocompletion (both for paths and Python code)
  • “magic” functions such as bookmarks, list of currently defined variables (“whos” command)
  • PDB (Python Debugger) integration with IPDB, providing code completion and history to PDB
  • post-mortem debugger (“debug” command after a traceback)
  • quick access to docstrings and source code of almost every library

IPython is packaged inside the Python(x,y) distribution with the Console application, which is a kind of command line emulator for Windows. Once configured with a readable setup, it’s probably the best development environment you can find of on Windows.

Continue reading

Using fake data to test your software

A little story

A couple of years ago, I was sitting at my desk at a customer office. The guy next to me was demonstrating an application built by another company to his colleague. The tool was built with an obscure WYSIWYG IDE that packaged its own framework and language. The point of using a GUI only framework for an important accounting application might sound a dangerous idea, especially when this language turns really inefficient.

During the demo, the guy opened a couple of customer accounts, scrolled down list of operations, and displayed a few reports. While his colleague seemed impressed by the tool, the only positive point I could find was the cuteness of the interface. It was really shiny, the colors were elegant, window decorations were well drawn. But there was something that made me sweat: the (relative) slowness of the product.

It might be hard to spot when you are not a professional developer, but I can’t imagine how scrolling down on the 20 customers present in the demo database could be so irregular. Every 5 items, the application froze for a split second, before resuming. Not something the final user will notice, as it’s not part of his job to have a responsive interface. His job is to work with the data shown by the interface.

A few month later, I came back to the customer, and by this time, the roll out phase of the accounting application had begun. The other company came to migrate the existing accounts (a few thousands) into the new application. When I arrived, I found a few people gathered around one computer, in the IT department. They were all yelling at the screen, so I joined them to see what was going on.

It appears that the problem I mentally noticed before had turned to a complete disaster when loaded with the production data. The customer selection screen took several minutes to just load, when you tried to open a customer record, either the application crashed, either it took half an hour and half of the main memory. I was happy I was not the one responsible for this mess…

Apparently, it seems that every screen was always loading all the records in the database into memory, then sorted the ones it needed, and displayed them. Sounds just goofy, but the worst was it did this every time the screen was modified ! Every time you selected an item in a list, refresh. Every time you modified a field, refresh. Well, you get the picture.

What to learn

The point of this story is: if the developer used a larger set of test data, and unless his workstation looked like a crazy NASA grade supercomputer, he would never have missed something like this. You cannot put something into production that has not been tested with a reasonable amount of data. “Reasonable” varies according to the business you’re into of course, when doing physics simulation, one million observation will seem low, but if you write a real estate portfolio management application, one thousand items will sound overkill.

But “how do I get that many data ?” you may ask. There are plenty of tools available to generate big datasets:

  • a Perl module to generate random data: Data::Faker (sorry, couldn’t find the equivalent in Python)
  • for human related data (name, email, phone, credit card): http://www.fakenamegenerator.com
  • for everything else, any scripting language will do

An example

For a project I am currently working on, I needed to populate my database will human related data, plus a few “business specific” fields.

I downloaded a small sample (5000 names) from fakenamegenerator.com, then I made some Python post-processing in order to have 5000 dummy people with all the fields I needed. I build lists of possible values, then pick randomly them for each dummy people.

Sorry for the colors, there is no pygments plugin on wordpress.com… Update: there is one, but I didn’t know where to find it

from itertools import permutations

COMPANIES = [' '.join(t) for t in list(permutations(['buzz','works','sim','corp','data','micro'], 3))]

ACRONYMS = list(set([cn[0] + cn[len(cn)/2] +  cn[-1] for cn in COMPANIES]))

DEPARTMENTS = [' '.join(t) for t in list(permutations(['artificial','financial','science','training','marketing','sales'], 3))]

LANGUAGES = ['de', 'en', 'fr', 'nl']

TITLES = ['Ph.D.', 'MSC', 'BSC', '']

I could refine it further, so there is a correlation between the name of the company, and the company acronym for example, but it’s not relevant for my application.

Other benefits

Except database performance benchmark, you can have several other benefits from using massive fake data.

Unicode support

As Tarek Ziadé pointed out recently, his name still breaks web applications, because of the “é” in it. Most of us will use “cultural” dummies when inputting them by hand. You don’t want to spend time investigating realistic examples of chinese or russian names for your application, but what if someone with a non-ascii symbol in his name wants to join you website ? With fake data, you can have one dataset for each culture, so it’s easy to test and include new cultures as well.

Field length

When I was a student, I was doing a lot of assumptions when writing code. One of my dumbest was that a family name (surname) would never exceed 20 characters. Of course, my tests included the famous french dummy family:  “toto toto” and his wife “tata toto”, as well as his son, “tutu toto”. You see ? Nothing above 20 characters. When I submitted the program to my teacher, I simply tried to input his name, and you guessed it, it was more than 20 characters long. The program crashed into flames (don’t mess with the length of C strings), and I learned the lesson.

First step to fuzzy testing

One technique I would like to use more often is fuzzing my inputs so they are “monkey ass proof”. If your application breaks just because someone typed a character he normally wouldn’t have (like a letter in a number field), then you should worry. Using random data, you can generate a lot of “possible” combinations (not probable, but possible) of inputs and then spot your errors more efficiently.