IT projects, like tattoos, are commitments. You wouldn’t accept that kind of answer from a tattoo artist, it’s your body, you have the right to change your mind as long as the ink is not under your skin. Why would that be different for software development ? Why would requirements be written into stone once they were signed ?

One could explain the too common attitude towards change by two main factors:

  • process momentum
  • stability issues

Process momentum is the consequence of everyone trying to protect their butt from backfires in case of project failure. As long as you can blame someone else for failure, you think you are safe, but in companies, the way to design the scapegoat is by signing contracts, SLAs, and other forms of paperwork. Every time you make a change in the requirement, someone can feel the contract terms have changed, and therefore, his paper walls are not protecting him anymore. Therefore, either you don’t make changes, either you spend a lot of valuable time to sign agreements.

Stability issues are the quicksand in which developers who don’t practice testing at a sufficient level are thrown when the requirements drift too much from the initial requests. That’s the typical “We will soon release a patch to fix the bugs introduced by previous patch. Sorry for the inconvenience.”. When you designed things like this and they end up looking like this.

The Agile manifesto states “Customer collaboration over contract negotiation”. Customer don’t change the requirements because they are evil, or helpless, but because they found a new way to improve their work. That’s what we do, helping them get their work done, and if possible, done better than their competitors. Don’t think of change as a mutation, but as an evolution.

If our quadruped, monkey-smelling ancestors decided that “walking on our rear limbs was not part of the original design”, just imagine where we would be now. Evolution is the way every live being manages to keep up with changes in its environment. Not changing (or changing two years too late), means putting your own kind in danger. If you eat beef and the new breed of beefs grows two more legs, you’d better find a way to continue chasing them, either by becoming smarter, either increasing your muscular mass (or number of legs, but that’s just gross).

If you eat money, and the competitor grows a new technology to eat more money than you (and possibly start eating your money), you’d better evolve your own tech right now !

Continue reading

Using fake data to test your software

A little story

A couple of years ago, I was sitting at my desk at a customer office. The guy next to me was demonstrating an application built by another company to his colleague. The tool was built with an obscure WYSIWYG IDE that packaged its own framework and language. The point of using a GUI only framework for an important accounting application might sound a dangerous idea, especially when this language turns really inefficient.

During the demo, the guy opened a couple of customer accounts, scrolled down list of operations, and displayed a few reports. While his colleague seemed impressed by the tool, the only positive point I could find was the cuteness of the interface. It was really shiny, the colors were elegant, window decorations were well drawn. But there was something that made me sweat: the (relative) slowness of the product.

It might be hard to spot when you are not a professional developer, but I can’t imagine how scrolling down on the 20 customers present in the demo database could be so irregular. Every 5 items, the application froze for a split second, before resuming. Not something the final user will notice, as it’s not part of his job to have a responsive interface. His job is to work with the data shown by the interface.

A few month later, I came back to the customer, and by this time, the roll out phase of the accounting application had begun. The other company came to migrate the existing accounts (a few thousands) into the new application. When I arrived, I found a few people gathered around one computer, in the IT department. They were all yelling at the screen, so I joined them to see what was going on.

It appears that the problem I mentally noticed before had turned to a complete disaster when loaded with the production data. The customer selection screen took several minutes to just load, when you tried to open a customer record, either the application crashed, either it took half an hour and half of the main memory. I was happy I was not the one responsible for this mess…

Apparently, it seems that every screen was always loading all the records in the database into memory, then sorted the ones it needed, and displayed them. Sounds just goofy, but the worst was it did this every time the screen was modified ! Every time you selected an item in a list, refresh. Every time you modified a field, refresh. Well, you get the picture.

What to learn

The point of this story is: if the developer used a larger set of test data, and unless his workstation looked like a crazy NASA grade supercomputer, he would never have missed something like this. You cannot put something into production that has not been tested with a reasonable amount of data. “Reasonable” varies according to the business you’re into of course, when doing physics simulation, one million observation will seem low, but if you write a real estate portfolio management application, one thousand items will sound overkill.

But “how do I get that many data ?” you may ask. There are plenty of tools available to generate big datasets:

  • a Perl module to generate random data: Data::Faker (sorry, couldn’t find the equivalent in Python)
  • for human related data (name, email, phone, credit card): http://www.fakenamegenerator.com
  • for everything else, any scripting language will do

An example

For a project I am currently working on, I needed to populate my database will human related data, plus a few “business specific” fields.

I downloaded a small sample (5000 names) from fakenamegenerator.com, then I made some Python post-processing in order to have 5000 dummy people with all the fields I needed. I build lists of possible values, then pick randomly them for each dummy people.

Sorry for the colors, there is no pygments plugin on wordpress.com… Update: there is one, but I didn’t know where to find it

from itertools import permutations

COMPANIES = [' '.join(t) for t in list(permutations(['buzz','works','sim','corp','data','micro'], 3))]

ACRONYMS = list(set([cn[0] + cn[len(cn)/2] +  cn[-1] for cn in COMPANIES]))

DEPARTMENTS = [' '.join(t) for t in list(permutations(['artificial','financial','science','training','marketing','sales'], 3))]

LANGUAGES = ['de', 'en', 'fr', 'nl']

TITLES = ['Ph.D.', 'MSC', 'BSC', '']

I could refine it further, so there is a correlation between the name of the company, and the company acronym for example, but it’s not relevant for my application.

Other benefits

Except database performance benchmark, you can have several other benefits from using massive fake data.

Unicode support

As Tarek Ziadé pointed out recently, his name still breaks web applications, because of the “é” in it. Most of us will use “cultural” dummies when inputting them by hand. You don’t want to spend time investigating realistic examples of chinese or russian names for your application, but what if someone with a non-ascii symbol in his name wants to join you website ? With fake data, you can have one dataset for each culture, so it’s easy to test and include new cultures as well.

Field length

When I was a student, I was doing a lot of assumptions when writing code. One of my dumbest was that a family name (surname) would never exceed 20 characters. Of course, my tests included the famous french dummy family:  “toto toto” and his wife “tata toto”, as well as his son, “tutu toto”. You see ? Nothing above 20 characters. When I submitted the program to my teacher, I simply tried to input his name, and you guessed it, it was more than 20 characters long. The program crashed into flames (don’t mess with the length of C strings), and I learned the lesson.

First step to fuzzy testing

One technique I would like to use more often is fuzzing my inputs so they are “monkey ass proof”. If your application breaks just because someone typed a character he normally wouldn’t have (like a letter in a number field), then you should worry. Using random data, you can generate a lot of “possible” combinations (not probable, but possible) of inputs and then spot your errors more efficiently.

Test driven development in .NET

Several months after discovering TDD in Python thanks to Eric Jones, I don’t think I could develop anymore without it. Currently working on a different project in .NET, the first thing I did was re-installing NUnit (I already gave it a try several years ago, but without understanding how to use it, it was soon abandoned).

I have to admit that having a GUI for the test runner (compared to Nosetest) is better on the psychological point of view: having all those green and red lights is way funnier than a dark listing with a big “OK” at the end.

But NUnit runner also has its drawbacks, for instance there is no way to tell it to stay on the tray/taskbar even when pressing the close button. I don’t know if it would be a useful option but I have the (bad) reflex of closing windows instead of minimizing them, so I have to start NUnit like ten times a day. I also miss the “–pdb-failure” option a lot 🙂

I’ve recently read a book that was reviewed on Slashdot about unit testing and this, and combined with my previous experiments in Python, it completely changed the way I write code today. I think unit testing should really be taught in CS classes, as it improves software internal quality by so many orders of magnitude that it’s definitely worth the extra minutes spent writing tests.