Sign in to follow this  
Ed Welch

The wrong way to count lines of code

Recommended Posts

Ed Welch    1008

I was just wondering why some projects report such inflated figures for lines of code (google have claimed to have writen a billion lines of code)
I downloaded a tool called cloc and discovered that it automatically doubled the amount of code in my project (c++) - normally, I just search for ";" - a much easier way to do it.
It looks like to me that they are counting everything that isn't a comment and isn't white space as a line of code. A single open curly brace gets counted as an entire line of code. This is obviously wrong. If you format your code for readability you will double the lines of code, compared to some one who writes code in a more compact style - even though it's essentially the same code.

Also, every file gets counted, no mater what it is, stuff auto-generated by the IDE and even html files get counted.

 

Share this post


Link to post
Share on other sites
WoopsASword    963

That's why nobody cares about lines of code in the first place. (And if you do, please stop)

The only proper use of counting lines of code is to determine if your function or class is too big.

Edited by WoopsASword

Share this post


Link to post
Share on other sites
MarkS    3502

Not really. It's a useful metric- at least it would be if they implemented it properly. Gives you a ball park figure of how complex a project is.


It really isn't. Brackets, for instance, are arbitrary. They make the code cleaner, but do not translate to machine code. There is quite a bit in all modern languages that exists for aesthetics and code organization that do not have any effect on the final machine code.

This:

if(...){}
is equal to this:

if(...)
{
}
Your code really isn't any more compact in the first case.

Share this post


Link to post
Share on other sites
Ed Welch    1008

 

Not really. It's a useful metric- at least it would be if they implemented it properly. Gives you a ball park figure of how complex a project is.


It really isn't. Brackets, for instance, are arbitrary. They make the code cleaner, but do not translate to machine code. There is quite a bit in all modern languages that exists for aesthetics and code organization that do not have any effect on the final machine code.

This:

if(...){}
is equal to this:

if(...)
{
}
Your code really isn't any more compact in the first case.

 

That's why I said if they implemented it properly.

If you count each semi-colon as a line of code then you get the same number no matter what your style code is

Share this post


Link to post
Share on other sites
SeanMiddleditch    17565

If you count each semi-colon as a line of code then you get the same number no matter what your style code is


Macros, code generation, abuse of the comma operator, use of temporaries, etc. all affect the number of the semi-colons in code. In some cases, more semi-colons means _less_ complex code (as you're breaking up complex expressions into simpler ones). smile.png

Share this post


Link to post
Share on other sites
Ed Welch    1008

 

If you count each semi-colon as a line of code then you get the same number no matter what your style code is


Macros, code generation, abuse of the comma operator, use of temporaries, etc. all affect the number of the semi-colons in code. In some cases, more semi-colons means _less_ complex code (as you're breaking up complex expressions into simpler ones). smile.png

 

The metric is used to get a ball park figure it doesn't need to be 100% accurate to be useful and those are just corner cases - they don't happen very often.

Also, is a relative comparison, take two big projects and the corner cases will work out to be roughly even. Even if it's 10% inaccurate it's good enough.

Share this post


Link to post
Share on other sites
WozNZ    2010

To what end is it a good metric though.

 

It can't show complexity. A well written very complex system can come in far less lines than a badly written more trivial system

 

Things like lines produced per day is also meaningless. This far in to my career (decades long) I find I write less code. I sit and think longer and refactor and rewrite until I have the cleanest code I can get. I will also refactor out code duplication which means sometimes addition of functionality can reduce line count.

 

You can't actually infer anything meaningful from lines of code apart from the line count. This only use that would serve is if your IDE has game like achievements :)

 

Much more meaningful metrics would be

 

- Number of functions

- Average line count per function

- Min/Max line count for functions

 

Let you see how much refactoring is required

Share this post


Link to post
Share on other sites

I swear when I saw "doubled" my first thought was "it counted CR LF as two newlines". Also hunting for semicolons isn't accurate either, since it doesn't take for into account. Also some languages (like javascript) are somewhat loose on where semicolons are required, and this would also exclude stuff like preinitialized structures.

Edited by Sik_the_hedgehog

Share this post


Link to post
Share on other sites
Hodgman    51325
Off by a factor of 2 is within the ballpark for the usefulness of LOC.
1M LOC and 2M LOC are both "big".
1K and 2K are both "small".

Personally, if there's a 10 line comment above a function, it should be included in the LOC stat, as it's part of the code that's expected to be read by a maintainer.
But then this gets into a grey area when EVERY function has 10 lines of stupid XML markup comments above it for autogenerated docs. That's not really part of the code anymore - especially if your IDE automatically hides it..

As for google, they've got between 10k and 20k programmers (Samsung has over 40k!). They could write 1 billion LOC about once every 4 years.

Share this post


Link to post
Share on other sites
FRex    1778

What makes it even worse is that sometimes more lines can mean clearer code that will be easier to understand than the shorter version: sticking to doing one operation per line, using proper if else branching instead of ternary operators, less macro magic, this all makes lines count go up but makes the code more understandable.

Share this post


Link to post
Share on other sites
grumpyOldDude    2740

Not really. It's a useful metric- at least it would be if they implemented it properly. Gives you a ball park figure of how complex a project is.

 

It doesn't.  I fully agree with WozNZ

 

Reminds me... back in the days when i was in uni, after a course work was accessed i read a feedback note from the tutor which says something like "if i didn't know what the course work was about i would have thought your code was implementing a space rocket"  When i checked the code of some guy who got near perfect score i realised what the feedback meant. His code was about 1/20th the size of mine because it had more intelligent algorithms and thus more compact. Since then i always strive to make my code smarter (also creating better functionalities) and never took huge lines of code to indicate more complexity.

 

Of course no matter how smart your algorithms are, it will still grow in size for large projects.  

Edited by alwaysGrey

Share this post


Link to post
Share on other sites
grumpyOldDude    2740

So, multiply line count with the IQ of the author?

 

Well said, but it was due me being naive rather than low IQ, because i learnt my lesson and took it to heart after that assessment and my scores improved significantly 

Share this post


Link to post
Share on other sites
Alberth    9525
I meant it in the generic form. Smart people write short programs, while less-smart people solve it in more lines of code. I wouldn't be surprised if you end up for largely equivalent estimates for the same problem with different people.

Share this post


Link to post
Share on other sites
Ed Welch    1008

Off by a factor of 2 is within the ballpark for the usefulness of LOC.
1M LOC and 2M LOC are both "big".
1K and 2K are both "small".

 

Yes, but being off by a factor of 1.1 is better than being off by a factor of 2.

Share this post


Link to post
Share on other sites
Hodgman    51325


Yes, but being off by a factor of 1.1 is better than being off by a factor of 2.
You missed my point. The metric itself is so fuzzy that accuracy in the measurement largely doesn't matter.

 

Say you've got a laser which can tell you if an object's distance from you is within 5 brackets: larger than 1m, 1m to 10cm, 10cm to 1cm, 1cm to 1mm, or less than 1mm. Most of the time, if you double the distance of the object, you'll still get the same result from the laser because it takes a factor of 10 (not 2) to jump between brackets.

LOC is quite similar, most of the time you're categorizing projects based on the log10 of the LOC value, not the log2 :)

 

Counting the semicolons instead of newlines is also prone to inaccuracies, as people have mentioned above. Comments can be a crucial part of the code, vital for maintainers to read and understand, just like any other part of the code - semicolon count ignores them. Many code constructs are quite complex but don't use any or many semi-colons -- macro-based code-generation, lambdas, functions, etc... Other simple constructs are semi-colon heavy, such as for-loops (2) vs while loops (0). Style can also influence the count -- some people use commas to declare multiple variables at once, whereas other people declare one per line.

I've also seen some projects that use an 80-character section delimiter in their code made up of semicolons :lol:

/*;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;*/

 

It would actually be mildly interesting to perform different "LOC" metrics (such as semicolon count) for a large selection of different projects and see how the metrics vary. You could find out if there's a correlation between semicolons and lines in general, or if the relationship varies randomly from project to project. Maybe you could even use relationships between different metrics as a guess to the style of the code :)

Share this post


Link to post
Share on other sites
swiftcoder    18437
Does anyone actually (non-ironically) use lines-of-code as a metric for comparing projects?

The primary use of lines-of-code metrics is within a single project.

When a 10,000 lines code review comes across my desk, in a project with less than 50,000 lines of code, then I know it means trouble. If one engineer produced 5,000 lines of code last month and another produced only 500, while both adhering to the same coding guidelines, then I know that responsibility is unevenly distributed in the project.

This sort of thing is important to be aware of, not just for the pointy-haired, but also for the engineering leads.

Share this post


Link to post
Share on other sites
Nypyren    12074

If one engineer produced 5,000 lines of code last month and another produced only 500, while both adhering to the same coding guidelines, then I know that responsibility is unevenly distributed in the project.


If the guy who wrote 5000 lines was implementing a DLC system with an extremely well-written spec, and the guy writing 500 lines was integrating a third party library while having to deal with a poorly defined spec and was helping a different engineer with questions at the same time, they both might be handling their responsibilities perfectly.

Lines of code per unit time is completely meaningless.

Share this post


Link to post
Share on other sites
swiftcoder    18437

If the guy who wrote 5000 lines was implementing a DLC system with an extremely well-written spec, and the guy writing 500 lines was integrating a third party library while having to deal with a poorly defined spec and was helping a different engineer with questions at the same time, they both might be handling their responsibilities perfectly.

I didn't make any value judgement about their relative performance. If one engineer is being given tasks that are well defined, while the other is slogging through a wasteland, then their responsibilities are unevenly distributed.

This is a management problem, not the engineering witch hunt people so readily assume. And without the right data, you can't fix management problems.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this