Monday, December 29, 2014

A simple source code line counter

We sometimes desperately want a tool that will count the total "real" line count of an opensource project or the projects we are working in. It should not count the blank lines and the comments so that we know how big is the code base.

Simpler the tool, better it is. Here is a simple tool (sourcelines.py click here ) I wrote which will do the simple job. Currently it supports C, C++, Java, Scala, Python, PHP and Perl, Go. But you may add other types as well by providing a comment syntax file (explained later).

How do we run the tool? Let us print the helps.
$ python sourcelines.py -h
Usage: sourcelines.py [options]

Options:
  -h, --help            show this help message and exit
  -c COMMENT_FILE, --comment-file=COMMENT_FILE
                        comment syntrax description file
  -d SOURCE_ROOT_DIR, --root-source-dir=SOURCE_ROOT_DIR
                        root directory for source code

An example run for counting the source code lines for Go 1.4
$ python sourcelines.py -d /home/geet/sws/go
File-type:       Go  Line-count:        473968
File-type:       Python  Line-count:       313
File-type:       C  Line-count:          170744
File-type:       C++  Line-count:               7
File-type:       Perl  Line-count:           929

The tool determines the file type by looking at the extension of the files and doesn't do any other magic for that. All the files with extension .pl will be assumed to be Perl files, all the files with extension .java will be assumed to be Java files etc.

Now the tool doesn't know about Haskell files and how Haskell code is commented. It also doesn't know about Javascript files. So, we instruct the tool by providing it a Json file that describes how commenting is done in Haskell and Javascript files.

Below is content from the sample Json file  (let us name it as syntax_haskell_js.json):
{  "hs" : {
        "output_as" : "Haskell",
        "other_extns" : [ "haskell" , "hask" ],
        "start" : "{-",
        "end" : "-}",
        "whole_line" : ["--"]
    },   
    "js" : {
        "output_as" : "Javascript",
        "other_extns" : [ "javascript"],
        "start" : "/*",
        "end" : "*/",
        "whole_line" : ["//"]
    }
}

The Json file describes how the Haskell and Javascript files are commented. The top level keys denote the languages. So, hs and js are denoting Haskell and Javascript languages. Files with extension .hs are output as "Haskell" files.  Also, files with extensions .haskell and .hask will be treated as Haskell files. 
"start" tag denotes the start tag of a comment. For Haskell, it is '{-'. 
"end" tag denotes the end of a comment. For Haskell, it is '-}'.  
"whole_line" tag denotes the commenting tag which indicates the rest of the line to be a comment. For Haskell it is '--', for Javascript it is '//'.

A sample test run output is shown below:
$ python sourcelines.py -d . -c syntax_haskell_js.json
File-type:    Python  Line-count:     173
File-type:   Haskell  Line-count:      49