Measurbation* can make you blind

•March 30, 2008 • No Comments

Being new at work, our team took a while before we placed our code on the standard build system (just conventions on where to place codes, reports, etc). One of the tools in the build system is checkstyle. Since we were trying to roll-out a working prototype, we had received some nasty violations from this system. Among those violations is having a cyclomatic complexity above the threshold (it was my code). My code was doing some list processing so it was rich with conditional branches and loops. I reasoned out that we should take it as a warning instead of an error (as checkstyle does) or a flag that the code “might” be improved further and not consider it as an outright violation. There are just some code that cannot be written in any other way.

I always have problems with metrics. Most of the time they are not well understood and you just have this tool that calculates it for you and a rule of thumb that says : “everything above this is bad/good”. Just in case I have this perfectly good code that violates our cyclomatic complexity threshold, I better understand it.

Cyclomatic complexity is metric that measures how complex a program is. Computing cyclomatic complexity involves modelling your program as graph. Vertices are the statements and the edges are the flow from one statement to another. Cyclomatic complexity is then computed as the number of edges minus the number of nodes plus two times the number of connected components. If you are not into graph theory, a simpler way to compute this is to count the number of conditional statements inside ifs and loops plus one.

As an example, I fed the code below to checkstyle and I got a cyclomatic complexity of 7. It’s a simple code right? If I add more colors (and consequently the score will increase by the number of colors I add to it), it would still be simple right?


    public static final int WHITE = 0;
    public static final int BLACK = 1;
    public static final int RED = 2;
    public static final int BLUE = 3;
    public static final int GREEN = 4;
    public static final int YELLOW = 5;

    public String getColor(int colorCode) {
        switch(colorCode) {
            case WHITE:
                return "white";
            case BLACK:
                return "black";
            case RED:
                return "red";
            case BLUE:
                return "blue";
            case GREEN:
                return "green";
            case YELLOW:
                return "yellow";
            default:
                throw new RuntimeException("Invalid color!");
        }
    }

————————————————–

P.S. The cyclomatic complexity violation of my code was fixed when I removed some code duplication and pushed some of my code on a different method.

*Measurbator is a term coined by Ken Rockwell in Seven Levels of Photographers. A measurbator is someone who pixel-peeps and care more about their gears than take photographs.

Revisiting the Mini Prolog Interpreter (part 2)

•February 8, 2008 • No Comments

Data Structures

Once we’ve recognized the tokens, we need to put them together to form an appropriate data abstraction. The closest thing I could come up with is to create a Functor object for a group of successfully parsed tokens. Each Functor would have a name and a list of arguments.


class Functor:
    def __init__(self, name, args):
        self.name = name
        self.args = args

Before we can create Functors, we need to first get the Identifiers (name and args in Functor). To do this, we go back to the parser and add code to collect the characters using a parameter called an accumulator. Each character is added to this list and when the parsing is finished for the Identifier, a string is created from the contents of the accumulator.


    def identifier(self, acc=[]):
        c = self.match(self.IDENT)
        acc.append(c)
        # Get the next identifier
        if self.token(self.lookahead) == self.IDENT:
            self.identifier(acc)

    def functor(self):
        id = []
        self.identifier(id)
        self.match(self.LPAREN)
        args = []
        self.identlist(args)
        self.match(self.RPAREN)
        return Functor("".join(id), args)

In case an identifier_list is encountered, we collect each Identifier in another list.


    def identlist(self, acc=[]):
        id = []
        self.identifier(id)
        acc.append("".join(id))
        if self.token(self.lookahead) == self.COMMA:
            self.match(self.COMMA)
            self.identlist(acc)

To test how Functors are created, we parse a Functor and verify that the returned object contains the name and args as its properties.


class FunctorCreation(unittest.TestCase):
    def testSingleArg(self):
        p = mp.Parser("male(bill).&quot ;)
        res = p.functor()
        self.assertEqual(res.name, 'male' ;)
        self.assertEqual(res.args, ['bill'])

For now, we’ll consider that Facts are represented as Functors. But how about Rules? A Rule is composed of more than one Functor.

 parent(X, Y) :- mother(X, Y).

We now call the left-hand side of the rule as its head and the right-hand side its body.


class Rule:
    def __init__(self, head, body):
        self.head = head
        self.body = body

Creation of a Fact or a Rule starts with a Statement. Matching a Rule from a Statement involves matching a colon followed by a dash. Once a Rule is encountered, we create a Rule object using the first Functor as its head, and the list of subsequent Functors as its body.


    def statement(self):
        f = self.functor()
        if self.token(self.lookahead) == self.COLON:
            # Try to match a rule
            self.match(self.COLON)
            self.match(self.DASH)
            acc = []
            self.functorlist(acc)
            self.match(self.DOT)
            return Rule(f, acc)
        else:
            self.match(self.DOT)
            return f

Revisiting the Mini Prolog Interpreter (part 1)

•February 5, 2008 • No Comments

I came across a machine problem for a mini Prolog interpreter. This was given to us during our CS 150 (Programming Languages) class. Although I was able to finish it back then (read: dead last), I always thought that I could have done a better job.

Where do we start? As Lewis Carroll says, “Begin at the beginning.” To build an interpreter (or a compiler), we need to have a grammar. These are rules that define how pieces of a language are put together. These pieces are known as tokens.

Grammar

The spec for the project came with this grammar:

database ::= fact | rule | fact database | rule database
fact ::= functor.
rule ::= functor :- functor_list.
functor_list ::= functor | functor, functor_list
identifier_list ::= identifier | identifier, identifier_list

Notice that the grammar doesn’t specify the valid sequence of characters for identifiers. For now, let’s assume that identifiers are made up of characters from the English alphabet (both upper and lower case characters from a to z).

Weapons of choice

The original exercise was written in C++. This time, I decided to use Python mainly because I’ve been doing some stuff with Python lately. It also comes with its own unit testing library similar to JUnit. The book Dive into Python shows how to create unit tests for your Python modules.

Parsing all the way down

The first part of this exercise is to build a parser. The parser takes an input string, scans it to check for proper syntax, and extracts from it a set of tokens. These tokens will be part of a hierarchy of tokens called a parse tree.

I created a Parser class that takes a string for its input. This parser associates a token with a numeric code. What scan does is to remove the characters of the input string starting from the leftmost character.

My first attempt at writing scan did not consider whitespaces. I soon added a statement that continues scanning if the lookahead points to a whitespace character.


class Parser:
    IDENT   = 0
    LPAREN  = 1
    RPAREN  = 2
    COLON   = 3
    DASH    = 4
    DOT     = 5
    COMMA   = 6
    UNKNOWN = 99

    def __init__(self, str):
        self.input = str
        self.scan()

    def scan(self):
        if len(self.input) > 0:
            self.lookahead = self.input[0]
            self.input = self.input[1:]
            # Skip whitespaces
            if re.match('\s', self.lookahead):
                self.scan()

Our grammar is composed of terminals and non-terminals. A terminal is a token while a non-terminal refers to another grammar rule that needs to be simplified. For this exercise, I created a recursive descent parser based on our grammar. What makes this function recursive is that non-terminals are expressed as functions that can call themselves.


    def identlist(self):
        self.identifier()
        if self.token(self.lookahead) == self.COMMA:
            self.match(self.COMMA)
            self.identlist()

A match occurs when the type of token that we expect is the same as the one we found at the lookahead. If the match is successful, scan is invoked again to move to the next character. If a match fails, the program raises a ParseError, causing the parser to terminate.


    def match(self, tokentype):
        look = self.lookahead
        if tokentype != self.token(look):
            raise ParseError
        else:
            self.scan()
            return True

The parse function drives the recursive descent parser. It starts by invoking statement — a function that determines if the input string is consistent with the grammar for facts and rules.


    # Non-terminals are function calls
    def statement(self):
        self.functor()
        if self.token(self.lookahead) == self.COLON:
            # Try to match a rule
            self.match(self.COLON)
            self.match(self.DASH)
            self.functorlist()

        self.match(self.DOT)

Resources

Most of the time in this exercise was spent on reading about parsing and the basics of compilers.

Code as Config

•December 20, 2007 • No Comments

We have this XML-driven template-based reporting tool at work. Embedded JavaScript is also supported and we usually use this for formatting. Since reports have little or no logic in them, it is considered as a configuration file (we actually suffixed the file with config).

I needed to create a report with a table whose column needs to be split. This column contains a list of values. A row in this table looks like this:

col1 | col2.1;col2.2;col2.3…;col2.x | col3

And what we need in the report is something like this:

col1 | col2.1 | col3
col1 | col2.2 | col3
col1 | col2.3 | col3
….
col1 | col2.x | col3

JavaScript came to my rescue. For each row, a script takes care of dividing it into several rows.

Another requirement is that every value in that column has a description, but this description is not kept in the database. I was thinking of putting it in a properties file. We also have a problem that we cannot access external files from the report. I was thinking that since a report is a config I can put the configuration directly in the report. What I did was create a map in JavaScript and used the column values as keys and the description as their values.

var map = {’col2.1′:’Value 1′,
‘col2.2′:’Value 2′,
‘col2.3′:’value 3′};

After development, testing phase came. Some changes are needed to be made on the configuration. The changes would need to be made by the testing team. Then something struck me: what if later on production support would refuse to make changes because it is code (only developers can touch code in our organization)? Obviously it is code, but in the context it was used it is not. It is a report and we consider report as configuration.

If my design is questioned later on, my defense would be is that it is best way to do it. I will also say that that if they can change properties and xml configuration files, they would not find it hard to edit JavaScript maps. XSLT is also categorized as config files in our organization, so how would is it any different from an embedded JavaScript?

Abstractions and Wishful Thinking

•November 20, 2007 • No Comments

Have you ever used a term so often you take it for granted? I usually use the term abstraction to describe some part of a program as hidden from the rest. I went on to do most of my projects knowing just that. Recently, I’ve been studying some of the earlier lectures on the Structure and Interpretation of Computer Programs — an introductory course in computer science at MIT.

Watching the video of this lecture gives a familiar lesson: “Whenever you see yourself writing the same thing down more than once, there’s something wrong and you shouldn’t be doing it.” This is the same lesson imparted in Don’t Repeat Yourself. Also, it’s important to divide a large system into smaller pieces that can be understood (and can be debugged) separately.

Another lesson here: “Computers to make people happy, not people to make computers happy.” One of the reasons for having abstractions is to make it easier to understand and to write programs. There is a notation called lambda which serves as an abstraction for creating procedures having no names. This is used when passing procedures as parameters to other procedures. Another form of abstraction is creating procedures that transform a procedure and returning a new procedure as a result.

The video lectures (2a and 2b) brings out a powerful tool for creating large systems: wishful thinking. Assume first that most of the parts are already implemented. Then we figure out how to use these parts in our procedure and build the rest later. This allows us to complete our procedure and not be slowed down by absorbing too much detail.

Defining a procedure usually starts with an idea. However, procedures may use different ideas, some of them we haven’t yet thought how to build. Wishful thinking helps in providing details at the last possible moment. The lecture clarifies that there is “a very narrow line between delaying decisions and outright procrastination.”

“Unfortunately, we don’t have lambda keys in our keyboards.”

•November 5, 2007 • No Comments

After ranting for a while that I was doing nothing in the office, I was assigned to work on SIT(System and Integration Testing) support. I do not like support but the other option is worse. I have to manage the development of a fairly big web application in three weeks without prior experience (I am not that familiar with our front-end framework) . The said web application is also in a critical path so I took the safer option and choose the SIT support assignment. I would not be coding anyway on the design job.

——————————————————

The environment is a bit of a kludge. To transfer test data to our SIT server I need to do the following:

  1. Transfer the data to our IQA server using scp.
  2. Login to our SIT machine
  3. Pull the data from IQA using scp.

It took a lot of steps that midway through the process I sometimes forget what I need to do.

To make the transferring of data faster, I installed a web server on my machine. This enabled me to pull data via http using some perl scripts. The transferring of data is now faster. I login to our SIT machine then pull the data via http from my development box.

I also created some scripts that fetches the report and zips them to make it ready for sending.

——————————————————

The next assignment I was given was even more boring. I needed to update some status sheets. The process is like this: 1) download(in excel format) the open bugs from our bugs database, 2) group it according to modules(the modules field in the form is free text so there are variations) 3) Then count the high, medium and low bugs for that module 4) You also need to include the bug number and the retest date as comments for the number of low, medium or high count.

What I did was to create a Java program using POI to access the excel report, group them and count the high, medium and low bugs then embed the comments. The problem is grouping since “Module 1″ might be written as “Modul 1″ or “Mod 1″. My input for a program is a list of module and module variation pair. Basically something like this:

Module 1|Modul 1
Module 1|Mod 1
Module 1|Module 1

I also created a test mode where unclassified modules will be listed. I just update the configuration file and put them into their proper classification. After that I just run the program and my report is done.

——————————————————

Creating a program for tedious things like reports and upload scripts take some time. But if you will repeat the tedious task many times, the invested time will pay-off.

Erlang: Atoms have limits

•October 28, 2007 • No Comments

While browsing the mailing list archive, I’ve come across this post on dynamically creating atoms:

The atom table has got a fixed max size, so creating atoms during runtime is often not a good idea, unless it happens a limited number of times. If creating new atoms with new different string representations is a part of the normal application processing, the atom table will be filled one day and the runtime system fails.

Atoms are character constants with a maximum length of 255 characters. Single-quotes can be used for non-alphanumeric characters.

Atoms can be used anywhere in an Erlang program and when an atom is evaluated, its value is the same atom. There are also built-in functions to convert atoms to lists and vice-versa.

Examples of atoms:

this_is_an_atom
node@host
‘hello:world’

Using atom BIFs:

> atom_to_list(this_is_an_atom).
> “this_is_an_atom”
> list_to_atom(”this_is_a_list”).
> this_is_a list

According to System Limits, the maximum number of entries that the atom table can store is 1,048,576. Does every node in a distributed Erlang system use the same atom table?

Further Reading

  • http://www.trapexit.org/Atom_Table
  • Programming Erlang, p. 34.

Iterative Process using Recursive Procedures

•October 20, 2007 • No Comments

One of the essays in the Beautiful Code is Beautiful Concurrency written by Simon Peyton Jones. It was about Software Transactional Memory and it got me interested in learning Haskell. I went on to read more tutorials on the language in order to have a better grasp on the topic (it was hard enough on its own). Unfortunately, it takes a while to get used to functional programming if you’re used to imperative languages such as Java and C. I already took Lisp from my Advanced Programming Languages class and I felt that I should re-acquaint myself with the concepts of functional programming before going further with Haskell.

I started reading the Structure and Interpretation of Computer Programs (SICP) to refresh myself of the fundamentals of functional programming. The book uses Scheme, a dialect of Lisp, for its examples. The first few pages of the book are devoted to evaluating expressions and recursion. Although I assumed that this would just be a refresher, but I was surprised that I learned a couple of new things about recursion.

A recursive function contains a base case, sometimes called a termination condition, and a call to itself. A good example of a recursive function is the fibonacci function.

It has two base case( n is equal to 0 or equal to 1). The recursive step is the sum of the fibonacci of (n-1) and of (n-2). The recursive step stops until it reaches the base case or the termination condition.

fibonacci(0) = 0
fibonacci(1) = 1
fibonacci(n) = f(n-1) + f(n -2)

The Lisp implementation would be like this:

(define (fib n)
(if (or (= n 0) (= n 1) ) n
(+ (fib (- n 1)) (fib (- n 2)))))

Lisp is a simple language. The first element is the operator and the rest are operands.

(operator operand1 operand2 … operandN)

Addition of three numbers will be written as:

(+ 1 2 3)

The define operator simply defines a function. The first element would be the list containing the function name and the parameters for that function.

The if function on the other hand would have its first parameter as its boolean condition and returns the second parameter if the first parameter is true. Otherwise(else) it returns the third parameter.

The call tree of the function would look something like this:

fib(4)
|
|–fib(3)
| |
| |-fib(2)
| | |
| | |-fib(1) = 1
| | |
| | |-fib(0) = 0
| |
| |-fib(1) = 1
|
|–fib(2)
|
|-fib(1)
|
|-fib(0)

Notice that we computed fib(2) two times. This is a very expensive operation since we need to keep track of many things in the call stack. This is a recursive process implemented as a recursive procedure.

We could implement the fibonacci function iteratively using a recursive procedure. In the recursive process we are doing top-down computation. The solution is short and elegant but wasteful of resources computation of the same fibonacci number is done many times. What we want is compute the lower fibonacci number first then use it to compute the higher fibonacci number. This is a bottom-up approach. The function for doing that is shown below:

(define (fib-iter-helper n x a b)
(if (= n x) (+ a b)
(fib-iter-helper n (+ x 1) (+ a b) a)))

(define (fib-iter n)
(if (or (= n 0) (= n 1)) n
(fib-iter-helper n 2 1 1)))

Most of the work is the by the fib-iter-helper function. A fibonacci number is defined as the sum of the two previous fibonacci number. This is indicated by the variables a(fib (- x 1)) and b(fib (- x 2)). The variable x is the current fibonacci number that can be computed. If x is not equal to the variable n we compute the next higher fibonacci number by calling the helper function and setting the formal parameter x with x + 1, the formal parameter a with the the current fibonacci number and the formal parameter b with a.

The call tree would then look like this:

fib-iter-helper(4 2 1 1) = fib(2)
|
|–fib-iter-helper(4 3 2 1) = fib(3)
| |
| |-fib-iter-helper(4 4 3 2) = fib(4)

This implementation is significantly faster. The lower fibonacci numbers are computed only once. In a recursive process the lower fibonacci numbers are computed multiple times. Try computing the fib(30) using the recursive and iterative process and you will notice a big difference in performance.

When describing a function as recursive, always be explicit if we are pertaining to the process or the procedure. As we have seen in the example an iterative process can be implemented using a recursive procedure.

Re-learning Concepts in Programming Languages

•October 2, 2007 • No Comments

I used to attend a class where students get to learn about different concepts in programming languages. The class was divided into two sessions: lecture and practice. Lectures were usually about concepts such as language paradigms, syntax, and types, to name a few. During practice sessions, we learned how to program using languages introduced to us in lectures.

When working on projects, even after graduating from University, I only used a small part of the concepts were taught in that class — you wouldn’t get to use them when studying a web framework or figuring out how to perform CRUD operations on a database. Now as I remember past projects, it seems that object-orientation and a pinch of syntax are the only topics that I was able to apply the most. One of the topics in that class was about Prolog’s resolution mechanism and it was something not used until I encountered a related concept called pattern matching.

For some time now, I’ve been practicing on different programming languages and I’ve been reading the textbook we used for our languages class. On the first chapter of that book, the author gave his reasons why it is important to study concepts for evaluating and designing languages. Among them, I find that increased capacity to express ideas and increased ability to learn new languages are enough reasons to try out different languages.

The number of basic elements (primitives) affects how programmers can easily grasp a new language. One rule-of-thumb that I apply is how much (or how little) elements do I need to know in order to create a first program using a different language. An example would be how much do you need to know to write “hello world” as your first program.

Different languages expose programmers to different abstractions (hiding details in programs). These abstractions are commonly classified as data or procedural. A simple example of data abstraction, according to the book, is an abstract data type. One of the first lessons in object-oriented programming is the concept of a class and how to create programs in terms of classes (defining attributes and operations for those attributes).

Another form of abstraction is called procedural abstraction. A function is a simple form of procedural abstraction and describes how a certain procedure is performed. Beginners to functional programming are taught how to create programs using a combination of procedures.

In learning new languages, I’ve been re-acquainted to the initial discomforts. This I think is what happens when other programmers say that using a particular language influences how you think. However, I find this renewed sense of being dumb to be important for continuous learning and improvement.

Spaghetti Coding using Yahoo Pipes

•September 16, 2007 • No Comments

While being annoyed of having no access to Twitter at work, I’ve come across Yahoo! Pipes. This is service that allows me to read RSS feeds from different sources and to group them as another feed. I began playing around by creating a Twitter viewer.

Yahoo! Pipes provides a set of modules that can be used to fetch data from the Web. It can manipulate these source and present them as a new feed. It has modules that can “pipe” its output to other modules. Here I show the output of the “Fetch Feed” module is piped to the “Pipe Output” module.

Using the drag-and-drop interface reminds me of Business Connector (also known as WebMethods for SAP). In my Twitter Viewer, I used the “Fetch Feed” module to get my RSS feed (you can try it by using this URL http://twitter.com/statuses/friends_timeline/<twitter_id>.rss).

Running my Twitter Viewer produced a result like this:

You will notice that the title and the message are similar. I used a Regex module so the title part will only show the name and the message part will only show the message.

I also replaced the link since it will originally go to the post. Which I do not like since I replaced the title with just the name of my twitter contact. I wanted to show the timeline of my contact with his/her friends. Again I replaced the url using the Regex module.

So far, so good. Now I can see my Twitter messages, but I can only use this pipe to view only my own Twitter update. To remove this hard-coded kludge, I can use the Number Input module to ask for input (Twitter ID). Using this ID, I can now form the URL using the String Builder module. Here I piped the output of my Number Input module to the second parameter of the String Builder module.

Then I piped String Builder’s output to the “base” parameter of the URL Builder module. The output of the URL Builder module is then sent to the Fetch Feed module. Now the Twitter Viewer can accept input from the user. From the image, notice that initially when the pipe is invoked, it displays an empty list. Once you enter a Twitter ID to this pipe, it will start fetching the feed.

Initially when the pipe is invoked it will display an empty list. You enter first your Twitter id before it can retrieved a users Twitter feed.

I tried running this pipe in the office and it worked! However, I observed that there is some form of caching that happens along the way and sometime I cannot get the latest updates.

You can view my Twitter Viewer here.