Software Engineering: A Modern Approach

Marco Tulio Valente

1 Git

The best way to learn Git is probably to first only do very basic things and not even look at some of the things you can do until you are familiar and confident about the basics. – Linus Torvalds

In this appendix, we introduce Git, currently the most widely used version control system, and discuss examples of its use. Inspired by the quote above from Linus Torvalds, the creator of Git, we will focus on the basic concepts and commands of this system. As emphasized by the quote, it’s important to master these fundamental commands before exploring more advanced ones. If you’re not familiar with the objectives and services provided by a version control system, we recommend first reading the Version Control section in Chapter 10.

1.1 Init & Clone

To start using Git to manage the versions of a system, we must execute one of the following commands: init or clone. The init command creates an empty repository. The clone command performs two actions: first, it creates an empty repository; second, it copies all the commits from a remote repository specified as a parameter into the newly created repository. Here’s an example command:

git clone https://github.com/USER-NAME/REPO-NAME

This command clones a GitHub repository into the current directory. We should opt for clone when working on a project that is already underway and has commits on a central server. In this example, GitHub serves as the central server.

1.2 Commit

Commits create snapshots (or photos) of a system’s files. These snapshots are stored in the version control system in a compact and efficient manner to minimize disk space usage. Later, we can retrieve any of these snapshots. For example, we may want to restore an old implementation of a specific file.

Developers should make commits periodically, especially when they have introduced significant changes to the code. In distributed version control systems such as Git, commits are initially stored in the developer’s local repository. This makes the cost of a commit minimal, allowing developers to make multiple commits throughout a working day. However, developers should avoid making large commits that involve substantial modifications to multiple files. Moreover, changes related to more than one maintenance task should not be included in the same commit. For example, fixing two bugs in the same commit is not advisable. Instead, each bug should be addressed in a separate commit. This practice simplifies code review, especially in cases where a customer complains that a particular bug has not been resolved.

Commits also contain metadata, including date, time, author, and a message describing the modification introduced by the commit. The following figure shows a GitHub page that displays the main metadata of a commit from the google/guava repository. The commit refers to a refactoring, which is evident from its title. The refactoring is then explained in detail in the commit message. At the last line of the figure, we can see the author’s name and information indicating that the commit was made 13 days ago.

Also on the last line of the figure, we can see that every commit has a unique identifier, in this case:

1c757483665f0ba8fed31a2af7e31643a4590256

This identifier consists of 20 bytes, typically represented in hexadecimal. It provides a checksum of the commit’s content, computed using the SHA-1 hash function.

1.3 Add

Locally, the Git system has three distinct areas:

A working directory, where we save the files we intend to version. This area is also called a working tree.
The repository itself, which stores the commit history.
An intermediate area, called index or staging area, which temporarily stores the files intended for versioning. These files are referred to as tracked files.

Among these areas, the developer directly accesses only the working directory, which functions as a regular operating system directory. The other two areas belong to Git and are managed exclusively by it. Like any other directory, the working directory can contain various files. However, only files added to the index via the git add command are managed by Git.

In addition to storing the list of versioned files, the index also stores their content. Therefore, before executing a git commit, we must first run a git add to save the file’s content to the index. After this, we use git commit to store the version added to the index in the local repository. This process is illustrated in the next figure.

git add and git commit — `git add` and `git commit`

Example: Consider the following simple file, which we’ll use to explain the add and commit commands.

// file1 
x = 10;

After creating this file, the developer executes the following command:

git add file1

This command adds file1 to the index (or staging area). However, immediately after, the developer modifies the file again:

// file1
x = 20; // new value for x

After this modification, the developer executes:

git commit -m "New value of x"

The -m flag provides the message that describes the commit. However, the key point to emphasize is this: since the user did not execute another add after changing the value of x to 20, the commit will not save the most recent version of the file. Instead, the version of file1 that will be committed is the one where x equals 10, as it is the version stored in the index.

To avoid this problem, developers commonly use a commit command like this:

git commit -a -m "New value of x"

The -a option instructs Git to add to the index all tracked files that have been modified since the last commit. Only then is the requested commit executed. Thus, the -a option does not eliminate the need to use add. We still need to use this command at least once to inform Git that we want to make a specific file tracked.

Just as there is an add, there is also a command to remove a file from a Git repository. Here’s an example:

git rm file1.txt
git commit -m "Removed file1.txt"

In addition to removing the file from the local Git repository, the rm command also deletes the file from the working directory.

1.4 Status, Diff & Log

The status command is one of the most frequently used Git commands. It shows the state of the working directory and the index, among other information. For example, it can be used to show information about:

Files in the working directory that have been changed but not yet added to the index.
Files in the working directory that are not tracked by Git, meaning they have not been subject to an add command.
Files that are in the index, waiting for a commit.

The git diff command highlights modifications made to files in the working directory that haven’t been added to the index yet. For each modified file, the command shows the lines that have been added (+) and removed (-). Developers frequently use git diff before add/commit operations to check the changes that will be recorded in the version control system.

The git log command shows information on the latest commits, including the commit hash, date, author, commit time, and description.

1.5 Push & Pull

The push command copies the most recent commits from the local repository to a remote repository. It is generally a slower operation, as it involves network communication. A push should be used when a developer wants to make a given modification visible to other developers. To update their local repositories, the other team members must use a pull command. This command performs two main operations:

First, a pull copies the most recent commits from the remote repository to the developer’s local repository. This operation is called fetch.
Then, it updates the files in the working directory. This operation is called merge.

The following figure illustrates the functioning of push and pull commands.

Example: Assume that the central repository contains the following file:

void f() {
  ... 
}

Imagine that two developers, Bob and Alice, each perform a pull to copy this file into their local repositories and working directories. The syntax is:

git pull

Later, Bob implements a second function g in this file:

void f() {
  ... 
}

void g() {   // by Bob
  ... 
}

Next, Bob executes an add, commit, and push. The syntax of the push command is:

git push origin main

The origin parameter is a default value used by Git to indicate the remote repository, such as a GitHub repository. The main parameter refers to the main branch. We’ll discuss branches in more detail later.

After running the above push command, the new version of the file is copied to the remote repository. A few days later, Alice decides to modify the same file. Since she’s been away from the project for a while, she should first execute a pull to update her local repository and working directory with any recent changes, such as the one made by Bob. After the pull, Alice’s local copy of the file will include the function g implemented by Bob.

1.6 Merge Conflicts

Merge conflicts occur when two or more developers modify the same section of the code simultaneously. Let’s examine an example to better understand this situation.

Example: Suppose Bob implements the following program:

main() {
  print("Helo, world!");
}

Upon completing the implementation, Bob executes an add, followed by commit, and push.

Later, Alice performs a pull to retrieve the file implemented by Bob. She then decides to translate the program’s message into Portuguese.

main() {
  print("Olá, mundo!");
}

While Alice is making the translation, Bob notices that he misspelled Hello with only one l. However, before Bob can make his correction, Alice completes her changes and executes the trio of commands add, commit, and push.

Bob, after correcting the typo, executes an add, followed by a commit. When he attempts to push, the command fails with the following message:

  Updates were rejected because the remote contains work that you do
  not have locally. This is usually caused by another repository
  pushing to the same ref. You may want to first integrate the
  remote changes (e.g., git pull …) before pushing again.

The message indicates that Bob can’t push because the remote repository contains a new version of the file, pushed by Alice. Bob needs to perform a pull first. However, when he does this, he receives a new error message:

CONFLICT (content): Merge conflict in file2
Automatic merge failed; fix conflicts and then commit the result.

This message clearly indicates a merge conflict in file2. Upon opening this file, Bob sees that Git has modified it to highlight, using special delimiters, the conflict-generating lines:

main() {                                                
 <<<<<<< HEAD                                     
 print("Hello, world!");                               
 =======                                                 
 print("Olá, mundo!");                                 
 >>>>>>> f25bce8fea85a625b891c890a8eca003b723f21b 
 }

These modifications should be interpreted as follows:

The code between <<<<<<< HEAD and ======= is Bob’s modification, who couldn’t execute a push and had to execute a pull. HEAD indicates that this code was modified in Bob’s most recent local commit.
The code between ======= and >>>>>>> f25bce8 ... is Alice’s modification, who successfully executed the push. The string f25bce8... is the ID of the commit in which Alice modified this code.

Now, Bob must resolve the conflict manually. He must choose which section of the code will prevail—his code or Alice’s—and edit the file according to his choice, removing the delimiters inserted by Git.

Assuming Bob decides Alice’s code is correct, since the system is now using messages in Portuguese, he should edit the file to look like this:

main() {                
  print("Olá, mundo!");                      
}

Note that Bob has removed the delimiters inserted by Git (<<<<<<< HEAD , =======, and >>>>>>> f25bce8...) as well as the print command with the message in English. After editing the code to its correct form, Bob should execute the commands add, commit, and push again, which will now succeed.

This example demonstrates a simple conflict, confined to a single line of a single file. However, a pull can result in more complex conflicts. For instance, the same file may contain several conflicts, or conflicts may span more than one file.

1.7 Branches

Git organizes the workspace into virtual folders, called branches. So far, we have not discussed branches because every repository has a default branch, named main, created by the init command. If we do not concern ourselves with branches, all development will occur on this branch. However, creating additional branches can often enhance the organization of development. To illustrate this concept, let’s explore an example.

Example: Suppose Bob is responsible for maintaining a certain feature of a system. For simplicity, let’s assume this feature is implemented in a single function f. Bob has an idea to completely change the implementation of f to use a more efficient algorithm and data structure. This change will require a few weeks of work. While optimistic, Bob is not sure if the new implementation will provide the gains he anticipates. Additionally, during the new implementation, Bob might need to access the original code of f, for example, to fix bugs reported by users.

This scenario presents an ideal opportunity for Bob to create a branch to implement and test this new version of f in isolation. To do this, he should use:

git branch f-new

This command creates a new branch named f-new, provided that this branch does not already exist.

To switch from the current branch to a new branch, use git checkout [branch-name]. To find the name of the current branch, simply use git branch. This command lists all branches and highlights the current one.

As mentioned earlier, we can conceptualize branches as virtual subdirectories within the working directory. The key distinction is that branches are managed by Git, not by the operating system, making them virtual in nature. Expanding on this analogy, the git branch [name] command is akin to the mkdir [name] command, but Git not only creates the branch but also copies all the files from the parent branch to it. In contrast, directories created by the operating system start empty. The git checkout [name] command is similar to the cd [name] command, while git status combines aspects of both the ls and pwd commands.

Just as we can customize the operating system prompt to include information about the current directory, a similar customization is possible with Git branches. As a result, the prompt displayed by Git can take, for example, the following form: ~/projects/systemXYZ/main.

However, there’s an important difference between branches and directories. A developer can only switch the current branch from A to B if they have saved their modifications to A, typically by executing add and commit. If there are uncommitted changes that would be overwritten by the switch, git checkout B will fail, resulting in an error message similar to this:

  Your local changes to the following files would be overwritten by checkout:
  [list of files]
  Please commit your changes or stash them before you switch branches.

Returning to the example, after Bob has created his branch, he must proceed as follows. When he plans to work on the new implementation of f, he should first switch the current branch to f-new. On the other hand, when he needs to modify the original code of f—the production code—he should ensure he is on the main branch. Regardless of which branch he is working on, Bob must use add and commit to save the state of his work.

Bob will continue with this workflow, alternating between the f-new and main branches until the new implementation of f is completed. Once completed, Bob needs to merge the new code into the original codebase. However, with the use of branches, he no longer needs to perform this operation manually. Git provides a command called merge that handles this integration automatically. The syntax is as follows:

git merge f-new

This command must be executed on the branch that will receive the modifications from f-new. In our example, it should be run on the main branch.

As the reader may anticipate, a merge can generate conflicts, also known as integration conflicts. These conflicts occur when both the receiving branch (main, in our example) and the branch being merged (f-new, in our example) have modified the same lines of code. As discussed in Section A.6, Git detects and marks the conflict areas, leaving it up to the developer who initiated the merge to resolve them, i.e., to choose the code that should prevail.

After completing the merge, Bob can remove the f-new branch if retaining the commit history for the new implementation is no longer necessary. To delete f-new, he should execute the following command while on the main branch:

git branch -d f-new

1.7.1 Commit Graphs

Commits may have zero, one, or more parents (or predecessors). As the following figure illustrates, the first commit of a repository does not have a parent. A merge commit, however, has two or more parents, representing the branches that were merged. For example, commit 10 in the figure has two parents. All other commits in this figure have exactly one parent.

A branch is simply an internal Git variable that contains the identifier of the last commit made on that branch. There is also a variable called HEAD, which points to the current branch. More specifically, HEAD contains the name of the variable that holds the identifier of the last commit on the current branch. For instance:

In this example, there are two branches, represented by the MAIN and ISSUE-45 variables. Each variable points to the last commit on its respective branch. The HEAD variable points to the MAIN variable, indicating that the current branch is MAIN. If a new commit is made, the graph changes as follows:

The new commit has the identifier 7. It was made on MAIN, as HEAD was pointing to this branch’s variable. The parent of the new commit is the previous HEAD, commit 3. The MAIN variable has moved forward to point to the new commit. Consequently, if the branch remains unchanged, the parent of the next commit will be commit 7.

However, if we switch to the ISSUE-45 branch, the graph would appear as shown in the following figure. The only change is that the HEAD variable now points to ISSUE-45. This change alone is sufficient to direct the next commit to this branch, ensuring that this commit will have commit 6 as its parent.

1.8 Remote Branches

Up until now, we’ve been working with local branches, i.e., the branches we’ve discussed exist only in the local repository. However, it is also possible to push a local branch to a remote repository. To illustrate this feature, let’s use an example similar to the one in the previous section.

Example: Suppose that Bob created a branch called g-new to implement a new functionality. He made some commits on this branch, and now he would like to share it with Alice so that she can collaborate on this new implementation. To achieve this, Bob should use the following push command:

git push -u origin g-new

This command pushes the g-new branch to the remote repository, referred to as origin by Git. The remote repository can be, for instance, a GitHub repository. The -u parameter indicates that, in the future, we will synchronize the two repositories using a pull (the u refers to upstream). This syntax applies only to the first push of a remote branch. In subsequent commands, we can omit -u, i.e., just use git push origin g-new.

In the remote repository, a g-new branch will be created. To work on this branch, Alice must first create it on her local machine and then associate it with the remote branch. To do this, she should execute the following commands while on the main branch:

git pull

git checkout -t origin/g-new

The first command makes the remote branch visible on her local machine. The second command creates a local g-new branch, which Alice will use to track changes on the remote branch. This is indicated by the -t parameter, which stands for tracking. Next, Alice can make commits to this branch. When she is ready to publish her changes, she should execute a push, using the usual syntax, i.e., without the -u parameter.

After that, Bob can execute a pull and may conclude, for example, that the implementation of the new functionality is finished and can be merged into the main branch. He can also delete the local and remote branches using:

git branch -d g-new

git push origin --delete g-new

Alice can also delete her local branch by using:

git branch -d g-new

1.9 Pull Requests

Pull requests are a mechanism that allows a branch to be reviewed and discussed before it is integrated into the main branch. When using pull requests, a developer first implements new features in a separate branch. Once this implementation is complete, they do not immediately integrate the new code into the main branch. Instead, they open a request for their branch to be reviewed and approved by another developer. This request for review and integration is called a pull request. Pull requests are common on GitHub, but they have equivalents in other version control systems.

Typically, the review and integration process takes place via a web interface provided by platforms such as GitHub. However, if this interface did not exist, the reviewer would have to start their work by performing a pull of the branch to their local machine. This is the origin of the name: a pull request is a request for another developer to review and integrate a certain branch. Without a web interface, to fulfill this request, the reviewer would begin by performing a pull of the branch.

In the following section, we will walk through the process of submitting and reviewing pull requests using an example.

Example: Bob and Alice are members of an organization that maintains a repository called awesome-git, which contains a list of interesting links about Git. The links are stored in the README.md file in this repository. Any member of the organization can suggest adding links to this page. However, they cannot push directly to the main branch. Instead, each suggestion needs to be reviewed and approved by another team member.

Bob decides to suggest adding this appendix to the list. To do so, he first clones the repository and creates a branch, named se-book-appendix, using the following commands:

git clone https://github.com/aserg-ufmg/awesome-git.git
git branch se-book-appendix

Next, Bob edits the README.md file, adding the URL of this appendix. He then performs an add, commit, and pushes the branch to GitHub:

git add README.md
git commit -m "SE: A Modern Approach - Appendix A - Git"
git push -u origin se-book-appendix

The steps described so far are similar to those presented in the previous section. The process diverges from here. First, Bob needs to go to the GitHub page and select the se-book-appendix branch. GitHub then displays a button to create a pull request. Bob clicks on this button and describes his pull request, as shown in the following figure.

A pull request is a request for another developer to review and, if appropriate, merge a branch you have created. Consequently, pull requests are a way for an organization to implement code reviews. In this case, developers do not directly integrate their code into the remote repository’s main branch. Instead, they request that other developers first review their code and then merge it.

On GitHub’s pull request creation page, Bob can invite Alice to review his code. She will then be notified that there is a pull request waiting for review. Through GitHub’s interface, Alice can review the commits from Bob’s pull request. For example, she can inspect a diff between the new and old code. If necessary, Alice can exchange messages with Bob to clarify any doubts about the code. She can also request changes to the proposed code. In such a case, Bob needs to implement the changes and perform a new add, commit, and push. The new commits will be automatically appended to the pull request, allowing Alice to verify if her requests have been addressed. Once all modifications are approved, Alice can integrate the code into the main branch by clicking a button on the pull request review page.

1.10 Squash

Squash is a Git command that allows the merging of several commits into a single commit. It is often recommended, for example, before submitting pull requests.

Example: In the previous example, suppose the pull request created by Bob has five commits. Specifically, he is suggesting the addition of five new links to the awesome-git repository, which he gathered over a few weeks. After discovering each link, Bob performed a commit on his machine. He planned to create the pull request only after accumulating five commits.

However, to facilitate Alice’s review of his pull request, Bob decides to merge the five commits into a single one. This way, instead of analyzing five commits, Alice will need to review only one. The submitted modification remains exactly the same, i.e., it consists of adding five links to the page. However, instead of the changes being distributed across five commits, they are now consolidated into a single commit.

To perform a squash, Bob should execute the following command:

git rebase -i HEAD~5

The number 5 indicates that he intends to merge the last five commits on the current branch. After executing this command, Git opens a text editor with a list containing the ID and description of each commit, as shown below:

pick 16b5fcc Including link 1
pick c964dea Including link 2
pick 06cf8ee Including link 3
pick 396b4a3 Including link 4
pick 9be7fdb Including link 5

Bob should use the editor to replace the word pick with squash for all lines except the first one. The file will then look like this:

pick 16b5fcc Including link 1
squash c964dea Including link 2
squash 06cf8ee Including link 3
squash 396b4a3 Including link 4
squash 9be7fdb Including link 5

After saving and closing the file, Git automatically opens a new editor for Bob to enter the message for the new commit—that is, the commit that will merge the five listed commits. After providing this message, Bob saves the file, and the the squash operation is complete.

1.11 Forks

A fork is a mechanism provided by GitHub to clone remote repositories, i.e., repositories stored on GitHub. A fork is performed via GitHub’s web interface. On the page of any repository, there is a button to perform this operation. If we fork the torvalds/linux repository, a copy of this repository will be created in our GitHub account, named, for example, mtov/linux.

As in previous sections, let’s use an example to explain this operation.

Example: Consider the aserg-ufmg/awesome-git repository, used in the example about pull requests. Now, let’s introduce a third developer, named Carol. However, Carol is not a member of the ASERG/UFMG organization, so she doesn’t have permission to perform a push in this repository, as Bob did in the previous example. However, Carol has identified an important and interesting link that she believes is missing from the current list, and she would like to suggest its inclusion. However, recall that Carol cannot follow the same steps used by Bob in the previous example, as she doesn’t have permission to push to the repository in question.

To solve this problem, Carol should start by forking the repository. To do so, she simply needs to click on the fork button on the page of the GitHub repository. After forking, she will have a new repository in her GitHub account, named carol/awesome-git. She can then clone this repository to her local machine, create a branch, add the link she wants to the list, and perform an add, commit, and push. This last operation will be carried out on the forked repository. Finally, Carol should go to the page of her GitHub fork and create a pull request. Since the repository is a fork, she has an extra option: to direct the pull request to the original repository. Thus, the developers of the original repository, like Bob and Alice, will be responsible for reviewing and, if appropriate, accepting the pull request.

In summary, a fork is a mechanism that, when combined with pull requests, allows an open-source project to receive contributions from external developers. Specifically, an open-source project can receive contributions—more precisely, commits—not only from its core team of developers (Bob and Alice, in our example) but also from any other developer with a GitHub account (like Carol).

Bibliography

Scott Chacon, Ben Straub. Pro Git. 2nd edition, Apress, 2014.

Rachel M. Carmena. How to teach Git. Blog post (link).

Exercises

Replicate each of the examples presented in this appendix. For examples involving remote repositories, use a GitHub repository. For examples involving two users (Alice and Bob), create two local directories and use them to simulate each user’s commands.

This book was formatted using Pandoc to convert Markdown to LaTeX, which was then used to generate a PDF file. The fonts used are Bitstream Charter for text and Beramono for code, both at 11pt. The EPUB and HTML versions were also generated from the same Markdown files using Pandoc.

1 Git 🔗

1.1 Init & Clone 🔗

1.2 Commit 🔗

1.3 Add 🔗

1.4 Status, Diff & Log 🔗

1.5 Push & Pull 🔗

1.6 Merge Conflicts 🔗

1.7 Branches 🔗

1.7.1 Commit Graphs 🔗

1.8 Remote Branches 🔗

1.9 Pull Requests 🔗

1.10 Squash 🔗

1.11 Forks 🔗

Bibliography 🔗

Exercises 🔗