9 minutes
Preventing Mistranslations with Git Repositories
Git has become the de-facto version-control system for developers. Its emphasis on decentralization makes it a natural choice for developers collaborating over the Internet, allowing them to time-shift their work schedules while also freeing them from the confines of a cubicle farm. Git makes this possible through a series of hand-offs whose complexities are mostly hidden from the Git-users. But wherever there are hand-offs there are risks of failure.
The Risks of Hand-Offs
In typical git-centric workflows, a developer works with an editor which saves source code as plain text files to the local filesystem. Git reads the text files, parses the content to identify changes, and pushes them to a remote Git server such as github.com.
Each and every one of these hand-offs increases the risk for unexpected problems:
- Every time the editor save there is a risk of losing formatting and incorrect encodings.
- Every
git commit
risks unwanted content being added to the origin repo. - Every read preformed by Git risks content being misinterpreted before parsing.
And this simplified flow diagram doesn’t come close to representing the whole picture! In reality Git-based workflows tends to look like this:
With each IDE, editor, and development machine used, the number of hand-offs grows exponentially. With every developer involved, the chances of error increases as they must communicate across mismatched schedules, physical distance, and even time itself! The green-field projects of today will become the legacy burdens of developers who are not even born yet.
To be good stewards of our projects we must actively mitigate these risks. We must assume all projects, no mater how seemingly insignificant, will transcend the space-time continuum and become production critical in the future. We must assume our code will run in environments and be worked on with tools that don’t exist yet. To not prepare our projects for this inevitability is a disservice to those who will come to depend on the decisions we make today.
“You’re just not thinking fourth dimensionally!” – “Dr. Emmett Brown”, Back to the Future Part III
When using Git it should feel like we are all working with the code directly; as though there are no handoffs, no 4th-dimensional rifts, and no chances for misunderstandings. It should feel like one seamless, collaborative experience. A flow that transcends work schedules, physical distances, and time itself.
An alternative way to think about this is with a challenge:
If a new, unaccustomed developer was left alone with our Git repo – how long before they can make a useful commit?
Git - Stupid as a Feature
The name “git” was given by Linus Torvalds when he wrote the very first version. He described the tool as “the stupid content tracker.” – The official git readme
For a system as widely praised as Git, it’s strange to hear it proudly called stupid; but this is a light-hearted jab at all the preceding version control systems that tried to be smart. Systems that forced encodings and made other ’time-saving’ decisions that resulted in countless hours lost to finding lost files and repairing bad merges. Developers found themselves fighting these “smart” mechanisms with clever hacks and macros, leading to the feeling that version control wasn’t a feature but an annoying necessity. So, by claiming itself as “stupid,” Git is bragging that it’s not going to get in the developer’s way.
The notion of not having to fight the development tool-chain is empowering, but daunting, as it’s up to the human side of the equation to keep projects ship-shape. It’s difficult to do a good job communicating, documenting, and keeping things clean when surrounded by distractions, a never-ending list of TODOs, and tight deadlines. So, more often than not, good collaboration depends on best intentions and chance.
Three Mechanisms to Prevent Mistranslations
Let’s take a quick look at three small files that can be used mitigate the risks of miscommunication, clutter accumulation, and mistranslations.
.editorconfig
to automate file encoding and formatting..gitignore
to keep cruft from being added to the repo..gitattributes
so Git knows how to safely process files.
Automatic Formatting With .editorconfig
Nothing stifles the excitement of a new developer faster than having to stop, read, and comply with a long-winded formatting guide before making their inaugural contribution. The new developer should be spending their brainpower on learning the project’s ins and outs – not worrying about which files use 2-space indents and which files should still use a windows EOL character.
What’s worse, when dueling developers differ on these particulars, commits become filled with formatting and whitespace changes, even in code that would otherwise remain unmodified.
While seemingly inconsequential this noise adds to the mental burden of code reviews and increases the barriers for new developers to learn from the commit history graph. We want git diff
reports to be short and succinct – to show us only content changes – as formatting is secondary to the code itself.
We can keep our git diff
clean of this noise by using EditorConfig; an open source project that maintains consistent coding styles for developers working on the same project across various editors and IDEs. By dropping an .editorconfig
file into our repo, we can instruct our editors how to format and encode all files before saving them to the local filesystem. EditorConfig is advantageous, even if working with a tool-chain that includes automatic formatting (e.g. Go), as its effects occur during the saving process with no compilation or sub-commands required.
Here are two examples:
Formatting Rules for Humans
EditorConfig can enforce formatting rules for developers, making those tab vs space debates completely immaterial.
[*]
indent_style = space
tab_width = 4
trim_trailing_whitespace = true
[*.go]
indent_style = tab
tab_width = 8
Rules for Consistent Encoding
EditorConfig can enforce encoding rules to ensure our source code stays malleable across the different encoding configurations used by various operating systems.
[*]
charset = utf-8
end_of_line = lf
[*.proj]
end_of_line = crlf
To get started visit the official EditorConfig project page: https://editorconfig.org/.
Keeping the Cruft out With .gitignore
“Don’t worry about those files, they’re generated and safe to delete” –Developers (everywhere)
Far too often developers are accustomed to directly telling each other what files to ignore, but this communication chain breaks down across the 4th dimension. These tidbits can be documented, but the phrase “ignore that documentation, it’s outdated” is another common developer expression.
When a new developer clones a repo everything they explore should be something relevant to the project they want to contribute to. Any time spent learning how to navigate around artifacts and ignoring obsolete header files is time that could have been better spent.
.gitignore
is a text file that instructs Git what to ignore. By including it in the root of the repo and giving it a list of rules we can prevent the stuff from our local file system from getting committed into our repo in the first place. That means when someone clones our code they don’t need to worry about cruft getting pulled down and in their way.
Let’s look at some common offenders that often end up in Git repos — some just cause minor irritation while others lead to security breaches.
Minor Nuisances (thumbs.db, .directory)
These files are generated by operating systems UIs (Windows and KDE respectively) to cache image thumbnails. Committing them to the repo provides no advantages as the same operating system on other physical machines will regenerate them anyway. But, excluding them from the repo keeps the repo’s commit history clean while keeping the total size on disk down.
Security Concern (.DS_Store)
.DS_Store
is a cache file that gets created by Mac OS X anytime the Finder application accesses a folder. Because this file starts with a .
it gets hidden by Mac OS X Finder by default, making it one of the most commonly committed files in public repos. This cache file includes metadata about the filesystem it came from and can be used for harm. For instance, in 2015, Telephone Communication Limited was hacked because the repo used to build the online storefront contained a .DS_Store file. This file was published during a website deployment and hackers used its contents to reverse engineer the structure of the website’s admin portal.
Dangerous: Executables (*.exe, *.jar, *.war)
In certain corporate environments it can be tempting to include executables in repos to make distribution easy. But there’s no way to know if the executable was built from the source code or purposely slipped in by a bad actor. And, because Git only cares about tracking content — and not timestamps — there is no way to speculate when the executable was built.
Using .gitignore
A .gitignore file is a plain text file where each line contains a pattern for files or directories that should be kept from being committed into the Git repo. The ignored files and folders will stay in the local file-system – they just won’t be added to the Git commits.
/src/*.exe
.directory
.DS_Store
thumbs.db
Steps to Implementing .gitignore Into the Repo
- Delete the files that shouldn’t be in the repository and make a commit.
- Add the .gitignore rules and commit.
- Repeat the process until all cruft is gone!
To get started: search the web for documentation and tutorials, try out community provided templates, and use gitignore.io to build tailor-made .gitignore files based on your requirements.
Directing Git’s Encoding with .gitattributes
While Git does a good job being “dumb” there are still encoding decisions it makes to ensure interoperability between operating systems. And, while Git generally does a good job at making the right decision, there are situations when it’ll make less-than-ideal choices.
To make things more complicated, the default settings in older installations of Git cause it to automatically switch EOL characters to \n
on Linux/Mac OS and \r\n
on Windows. This setting should be disabled with:1
git config --global core.autocrlf false
.gitattributes
is a plain text file that instructs Git exactly how to make these decisions, ensuring all developers working with the repo avoid Git making bad encoding decisions. With it we can tell Git which files are binary and shouldn’t be encoded, and which files are text and how to encode them.
* text=auto eol=lf encoding=utf-8
*.proj eol=crlf
*.gif binary
*.jpg binary
*.png binary
To get started check out the official documentation or search the web for documentation and tutorials.
Final Thoughts
An old proverb says “the more you sweat in times of peace, the less you bleed in war.” As developers we should take this exact same attitude with the repos that hold our code. By spending the time now to craft these three files in our repositories, we can avoid these little mistranslations biting us when we least expect. Use these files to optimize the onboarding process for new developers, so that they can get grinding right away!
Checklist
Always use .editorconfig
Automate the formatting rules and reduce noise that’ll clutter up those git diff
results.
Always include a .gitignore
file
Combine templates like those found at gitignore.io. Files must be deleted from the repo before they can be properly ignored.
Disable the line-ending auto-encoding on older git installations
git config --global core.autocrlf false
Always use .gitattributes
Make sure to specify the default line endings and which files should be binary!