How Much Space Is Saved By Using Tabs In Code?

One of the arguments for using tabs over spaces as a coding style is that tabs use less disc space than, well… spaces.

It's obvious that it does use less space. One tab character can represent what multiple space characters do, and the size of a tab and space character are identical. People argue about the degree of difference it makes. Most people just guess. Either saying "it's too small of a degree to matter" or "I guess it would be this much smaller".

I could not find any actual tests outside of a few toy examples. So today I decided to do my own test and get a real estimate.

I downloaded the Linux Kernel source code (6.4.2), since it's a sizable public repository. Additionally, the project's style guide mandates not going beyond and indentation depth of 3, which should avoid accidentally favoring tabs by limiting how many total indentations there are. Code that is deeper nested would benefit more greatly from tabs.

After looking through the source code, it does seem to primarily use tabs, but there are some spots using 8 space indents. Whoops. There's some places where an arbitrary number of spaces are used for alignment at the start of a line. The code does not seem to so closely follow the project's own standards.

I kept Makefiles, Kconfig, scripts, plain text files, and anything else related to code, including the documentation. I felt that was more representative of what would get stored in a codebase (since it is in the code base). Interestingly, thought the project style guide says to use 8 space indentations, the Python scripts use spaces at 4 wide, per Python's style guide.

I first did a light sanitation with the `unexpand` command in a simple depth-first directory walking script, converting any sequence of 8 spaces at the start of a line to a tab. The result represents the tab group.

I did not do any manual cleanup. If you're maintaining a codebase, a linter would be a better option. I did not want to do that for a quick test since there were multiple kinds of languages in use which require different styles and would require multiple linters that would need to be configured separately.

After that, I copied the tab group and followed with converting all tabs to 4 spaces with the `expand` command. The Kernal formatting style may say to use 8 spaces, but I believe 4 is a more fair estimate in this test to space users since it is the most used style. I created a third test group with 8 character space indents to represent that too. I did not test with 2 space indents because I was running out of hard-drive space.

I made a compressed version of each group to see what impact it had on compression. Files were compressed with `XZ_OPTS=-6e tar -Jcvf outfile.tar.xz indir`. The extreme option was used to help reduce possible differences.

To finish, I took a size measurement of each group using `du -s` which should give a total size in bytes. Here are the results.

| Type     | Decompressed | tar.xz compression |
------------------------------------------------
| Original |    1,510,240 |           134,576* |
| Tabs     |    1,389,188 |           134,008  |
| 4 Spaces |    1,542,448 |           135,340  |
| 8 Spaces |    1,747,692 |           137,504  |

*This is the compressed size of the distribution. It's size will differ from mine since I do not have the same hardware or know what settings they used to create the compression.

And as ratios, normalized to tabs, the smallest group.

| Type     | Decompressed | tar.xz compression |
------------------------------------------------
| Original |   108.71387% |        9.6873857%* |
| Tabs     |   100.00000% |        9.6464985%  |
| 4 Spaces |   111.03234% |        9.7423819%  |
| 8 Spaces |   125.80673% |        9.8981563%  |

And the difference in bytes, normalized to tabs.

| Type     | Decompressed | tar.xz compression |
------------------------------------------------
| Original |      121,052 |               568  |
| Tabs     |            0 |                 0  |
| 4 Spaces |      153,260 |             1,332  |
| 8 Spaces |      358,504 |             3,496  |

The results show that tabs can significantly reduce the size of a codebase. The difference is significantly reduced with compression.

Most code is not compressed unless it is being archived or redistributed. Since we will be working with decompressed code often, it's reasonable to look at it as an important factor for size.

4 space indents used ~11% more space, and 8 space indents used as high as ~25% more space! I find this very significant. If I could reduce my code size by 10% for free with no downsides, I'd take it straight away.

The sample codebase itself was rather large, yet the total amount of space saved is a few MB. Some people will look at that and say it's an insignificant amount of space saved to be worthwhile. It's a figure that will compound as a codebase grows, but realistically, how much is one going to be working on at a time? But a lot can be done with a few MB. The "code is small" argument can go both ways; the space saved is enough to fit a few more smaller codebases.

Unfortunately that means anyone could look at the results either way and say either it is significant (10% reduction) or it's not significant (only a few MB). Oh well. Something else for the computer nerds to argue about.

In my personal opinion, I like to reduce whenever possible. Not everyone has Terabyte hard-drives yet. Even if we do have significantly more storage space available to us, we shouldn't carelessly squander it. If we take care of what we already have, we can take it even further.

/gemlog/