Book cover
All rights reserved. Version for personal use only.
This web version is subjected to minor edits. To report errors or typos, use this form.

Home | Dark Mode | Cite

Software Engineering: A Modern Approach

Marco Tulio Valente

1 Git

The best way to learn Git is probably to first only do very basic things and not even look at some of the things you can do until you are familiar and confident about the basics. – Linus Torvalds

In this appendix, we introduce and discuss examples of using the Git system, currently the most widely used version control system. Inspired by the quote above from Linus Torvalds, the creator of Git, we will focus on the basic concepts and commands of this system. As suggested by the quote, it’s important to master these commands before delving into more advanced ones. If you are not familiar with the objectives and services provided by a version control system, we recommend first reading the Version Control section from Chapter 10.

1.1 Init & Clone

To start using Git to manage the versions of a system, we must execute one of the following commands: init or clone. The init command creates an empty repository. The clone command first calls init to create an empty repository. Then, it copies into this repository all the commits from a remote repository, passed as a parameter. The following command is an example:

git clone https://github.com/USER-NAME/REPO-NAME

It clones a GitHub repository into the current directory. Thus, we should use clone when working on a project that is already underway and has commits in a central server. In the example, this server is on GitHub.

1.2 Commit

Commits are used to create snapshots (or photos) of a system’s files. Once these snapshots have been taken, they are stored in the version control system in a compact and efficient manner to minimize disk space usage. Subsequently, we can retrieve any of the snapshots. For instance, we may want to restore an old implementation of a specific file.

Developers are advised to make commits periodically, especially when they have introduced significant changes to the code. In distributed version control systems such as Git, commits are initially stored in the developer’s local repository. Thus, the cost of a commit is minimal, allowing developers to make multiple commits throughout a working day. However, it is not recommended to make large commits that involve substantial modifications in multiple files. Additionally, changes related to more than one maintenance task should not be included in the same commit. For instance, fixing two bugs in the same commit is not advisable. Instead, each bug should be addressed in a separate commit. This practice simplifies code review, especially in cases where a customer complains that a particular bug has not been resolved.

Commits also contain metadata, including date, time, author, and a message describing the modification made by the commit. The next figure shows a GitHub page that displays the main metadata of a commit from the google/guava repository. We can see that the commit refers to a refactoring, which is clear from its title. Then, the refactoring is explained in detail in the commit message. In the last line of the figure, we can see the author’s name and the information that the commit was made 13 days ago.

GitHub Commit

In the last line of the figure, we can also note that every commit has a unique identifier, in this case:

1c757483665f0ba8fed31a2af7e31643a4590256

This identifier has 20 bytes, normally represented in hexadecimal. It provides a checksum of the commit’s content, computed using an SHA-1 hash function.

1.3 Add

Locally, the Git system has three distinct areas:

  • A working directory, where we save the files we intend to version. Sometimes, this area is also called a working tree.

  • The repository itself, which stores the commit history.

  • An intermediate area, called index or stage, which temporarily stores the files intended for versioning. Such files are referred to as tracked.

Among these areas, the developer only accesses the working directory, which functions as a regular operating system directory. The other two areas belong to Git and are managed solely by it. Like any directory, the working directory can contain various files. However, only the ones added to the index, by means of a git add, are managed by Git.

In addition to storing the list of versioned files, the index also stores their content. Thus, before conducting a git commit, we must execute a git add to save the file’s content to the index. Having done that, we should use a git commit to store the version added to the index in the local repository. This process is illustrated in the next figure.

git add and git commit

Example: Suppose the following simple file, which is sufficient to explain the add and commit commands.

// file1 
x = 10; 

After creating this file, the developer executed the following command:

git add file1

It adds file1 to the index (or stage). However, immediately thereafter, the developer modified the file again:

// file1
x = 20; // new value for x

Having done that, the developer executed:

git commit -m "New value of x"

The -m flag provides the message that describes the commit. However, the point we want to stress here is this: since the user did not execute a new add after changing the value of x to 20, the commit will not save the most recent version of the file. Instead, the version of file1 that will be versioned is the one where x equals 10 because it is the version in the index.

To avoid this problem, it’s common to use a commit like this:

git commit -a -m "New value of x"

The -a option indicates that before executing the commit, we want to add to the index all tracked files that have been modified since the last commit. Therefore, the -a option does not eliminate the need to use add. We still need to use this command at least once to tell Git that we want to make a specific file trackable.

Just like there is an add, there is also a command to remove a file from a Git repository. An example is as follows:

git rm file1.txt
git commit -m "Removed file1.txt"

Besides removing from the local Git repository, the rm command also deletes the file from the working directory.

1.4 Status, Diff & Log

The status command is one of the most used Git commands. Among other information, it shows the state of the working directory and the index. For example, it can be used to show information about:

  • Files in the working directory that have been changed but haven’t been added to the index yet.

  • Files in the working directory that are not tracked by Git, meaning they have not been subject to an add.

  • Files that are in the index, waiting for a commit.

The git diff command highlights modifications made to the files in the working directory that haven’t been moved to the index yet. For each modified file, the command shows the lines that have been added (+) and removed (-). git diff is frequently used before an add/commit to check the changes that will be perpetuated in the version control system.

The git log command shows information on the latest commits, including date, author, time, and description.

1.5 Push & Pull

The push command copies the most recent commits from the local repository to a remote repository. Hence, it is generally a slower operation, as it involves network communication. A push should be used when a developer wants to make a given modification visible to other developers. To update their local repository, the other team’s members must use a pull command. This command performs two main operations:

  • First, a pull copies the most recent commits from the remote repository to the developer’s local repository. This operation is called fetch.

  • Then, the files in the working directory are updated. This operation is called merge.

The next figure illustrates the functioning of push and pull commands.

Push and pull commands

Example: Assume that in the central repository of a project there is the following file:

void f() {
  ... 
}

Imagine two developers, named Bob and Alice, performed a pull and, hence, copied this file into their local repositories and their respective working directories. The syntax of this command is:

git pull

On the same day, Bob implemented a second function g in this file:

void f() {
  ... 
}

void g() {   // by Bob
  ... 
}

Next, Bob executed an add, commit, and push. The syntax of the push is:

git push origin main

The origin parameter is a default value used by Git to indicate the remote repository, such as a GitHub repository. The main parameter refers to the main branch. We will study more about branches soon.

After running the above push command, the new version of the file will be copied to the remote repository. A few days later, Alice decided she needs to modify that same file. Since she’s been away from the project for a while, it is recommended that she first executes a pull to update her local repository and working directory with the changes that have occurred in that period, like the one made by Bob. Thus, after the pull, the file in question will be updated on Alice’s machine to include the function g implemented by Bob.

1.6 Merge Conflicts

Merge conflicts occur when two or more developers modify the same section of the code at the same time. To better understand this situation, let’s use an example.

Example: Suppose Bob implemented the following program:

main() {
  print("Helo, world!");
}

Upon completing the implementation, Bob executed an add, followed by commit, and push.

Next, Alice performed a pull and retrieved the file implemented by Bob. Then, she decided to translate the program’s message into Portuguese.

main() {
  print("Olá, mundo!");
}

While Alice was making the translation, Bob noticed that he wrote Hello incorrectly, with only one l. However, Alice was faster and executed the trio of commands add, commit, and push.

Bob, after correcting the typo, executed an add, followed by a commit. Lastly, he performed a push, but this command failed with the following message:

  Updates were rejected because the remote contains work that you do
  not have locally. This is usually caused by another repository
  pushing to the same ref. You may want to first integrate the
  remote changes (e.g., git pull …) before pushing again.

The message is clear: Bob can’t execute a push as the remote repository contains a new version of the file, in this case, pushed by Alice. Thus, Bob needs first to perform a pull. However, when he does this, he receives a new error message:

  CONFLICT (content): Merge conflict in file2
  Automatic merge failed; fix conflicts and then commit the result.

This new message is also clear: there is a merge conflict in file2. After opening this file, Bob realizes that Git modified it to highlight the conflict-generating lines:

main() {                                                
 <<<<<<< HEAD                                     
 print("Hello, world!");                               
 =======                                                 
 print("Olá, mundo!");                                 
 >>>>>>> f25bce8fea85a625b891c890a8eca003b723f21b 
 }                                                       

These modifications should be understood as follows:

  • Between <<<<<<< HEAD and ======= we have the code modified by Bob, who couldn’t execute a push and had to execute a pull. HEAD indicates that this code was modified in Bob’s most recent commit.

  • Between ======= and >>>>>>> f25bce8 ... we have the code modified by Alice, who successfully executed the push. f225bce8... is the ID of the commit in which Alice modified this code.

It’s then up to Bob to resolve the conflict, which is a manual task. He must choose which section of the code will prevail—his code or Alice’s—and edit the file according to his choice, thus removing the delimiters inserted by Git.

Let’s assume that Bob decides Alice’s code is correct, since the system is now using messages in Portuguese. Therefore, he should edit the file so that it looks like this:

main() {                
  print("Olá, mundo!");                      
}                       

Note that Bob removed the delimiters inserted by Git (<<<<<<< HEAD , =======, and >>>>>>> f25bce8...). And also the print command with the message in English. After leaving the code in the correct form, Bob should execute the commands add, commit, and push again, which will now be successful.

In this example, we showed a simple conflict, confined to a single line of a single file. However, a pull can give rise to more complex conflicts. For instance, the same file may include several conflicts. We can also have conflicts across more than one file.

1.7 Branches

Git organizes the workspace into virtual folders, named branches. So far, we have not discussed branches because every repository has a default branch, named main, created by the init command. If we do not concern ourselves with branches, all development will occur on this branch. However, in some cases, it might be beneficial to create other branches to better organize the development. Thus, to explain the concept of branches, let’s use another example.

Example: Suppose Bob is responsible for maintaining a certain feature of a system. For simplicity, let’s assume this feature is implemented in a single function f. Bob had the idea to completely change the implementation of f to use a more efficient algorithm and data structure. For this, Bob will need a few weeks. However, despite being optimistic, Bob is not sure if the new implementation will provide the gains he anticipates. Finally, but not to be overlooked, during the new implementation, Bob might need to access the original code of f, for example, to fix bugs reported by users.

This is an interesting scenario for Bob to create a branch to implement and test, in isolation, this new version of f. To do this, he should use:

git branch f-new

This command creates a new branch, named f-new, presuming that this branch does not already exist.

To switch from the current branch to a new branch, we should use git checkout [branch-name]. To find the name of the current branch, we simply use git branch. In reality, this command lists all the branches and shows which one is current.

As we mentioned, we can conceptualize branches as virtual subdirectories within the working directory. The key distinction lies in the fact that branches are managed by Git, not by the operating system, making them virtual in nature. Expanding on this analogy, the git branch [name] command is akin to the mkdir [name] command, but Git not only creates the branch but also copies all the files from the parent branch to it. In contrast, directories created by the operating system are initially empty. The git checkout [name] command is similar to a cd [name] command, while git status combines aspects of both ls and pwd commands.

Usually, we also have the option to customize the operating system prompt by including information about the current directory. A similar customization is possible with Git branches. Consequently, the prompt exhibited by Git can take, for example, the following form: ~/projects/systemXYZ/main.

However, there’s an important difference between branches and directories. A developer can only switch the current branch from A to B if they have saved their modifications to A, meaning they have first executed add and commit. If these commands are omitted, git checkout B will fail, resulting in the following error message:

  Your local changes to the following files would be overwritten by checkout:
  [list of files]
  Please commit your changes or stash them before you switch branches.

Returning to the example, after Bob has created his branch, he must proceed in the following way. When he plans to work on the new implementation of f, he should first switch the current branch to f-new. On the other hand, when he needs to modify the original code of f—the production code—he should make sure that the current branch is main. Regardless of which branch he is on, Bob must use add and commit to save the state of his work.

Bob will continue with this workflow, alternating between the f-new and main branches until the new implementation of f is completed. When this happens, Bob should merge the new code into the original one. However, with the use of branches, he no longer needs to perform this operation manually. Git provides a command called merge that handles this integration for him. The syntax is as follows:

git merge f-new

This command must be invoked on the branch that will receive the modifications from f-new. In our case, on the main branch.

As the reader may be thinking, a merge can generate conflicts, also known as integration conflicts. In the specific case of merging branches, these conflicts will occur when both the branch receiving the modifications (main, in our example) and the branch being integrated (f-new, in our example) have modified the same lines of the code. As discussed in Section A.6, Git detects and delimits the conflict areas, and it is up to the developer who called the merge to resolve it, i.e., choose the code that should prevail.

Finally, after performing the merge, Bob can remove the f-new branch if it’s no longer important to maintain the commit history for the new implementation. To delete f-new, he must execute the following command on the main branch:

git branch -d f-new

1.7.1 Commit Graphs

Commits may have zero, one, or more parents (or predecessors). As the next figure illustrates, the first commit of a repository does not have a parent. A merge commit however, has two or more parents, representing the branches that were merged. For example, commit 10 in the figure has two parents. The other commits in this figure have exactly one parent node.

A branch is nothing more than an internal Git variable containing the identifier of the last commit made on this branch. There is also a variable called HEAD, which points to the current branch’s variable. That is, HEAD contains the name of the variable holding the identifier of the current branch’s last commit. Here is an example:

In this example, there are two branches, represented by the MAIN and ISSUE-45 variables. Each one points to the last commit of their respective branches. The HEAD variable points to the MAIN variable. This means that the current branch is MAIN. If a commit is made, the graph changes to:

The new commit has identifier 7. It was made on MAIN, since HEAD was pointing to this branch’s variable. The parent of the new commit is the old HEAD, i.e., commit 3. The MAIN variable moved forward to point to the new commit. This means that if the branch isn’t changed, the parent of the next commit will be commit 7.

However, if we switch to the ISSUE-45 branch, the graph would be the one shown in the next figure. The only change is that the HEAD variable now points to ISSUE-45. This is enough to direct the next commit to this branch, i.e., for this commit to have commit 6 as its parent.

1.8 Remote Branches

Up until now, we’ve been working with local branches, i.e., the branches we’ve discussed exist only in the local repository. However, it is also possible to push a local branch to a remote repository. To illustrate this feature, let’s use an example similar to the one in the previous section.

Example: Suppose that Bob created a branch called g-new to implement a new functionality. He made some commits on this branch, and now he would like to share it with Alice so that she can collaborate on this new implementation. To achieve this, Bob should use the following push:

git push -u origin g-new

This command executes a push of the current branch (g-new) to the remote repository, referred to as origin by Git. The remote repository can be, for instance, a GitHub repository. The -u parameter indicates that, in the future, we will sync the two repositories using a pull (the letter in the parameter refers to upstream). This syntax applies only for the first push of a remote branch. In the following commands, we can omit -u, i.e., just use git push origin g-new.

In the remote repository, a g-new branch will be created. To work on this branch, Alice must first create it on her local machine and then associate it with the remote branch. For this, she should execute the following commands on the main branch:

git pull

git checkout -t origin/g-new

The first command makes the remote branch visible on her local machine. The second command creates a local g-new branch, which Alice will use to track changes on the remote branch. This is indicated by the -t parameter, short for tracking. Next, Alice can make commits to this branch. Finally, when she is ready to publish her changes, she should execute a push, with the usual syntax, i.e., without the -u parameter.

After that, Bob can execute a pull and conclude, for example, that the implementation of the new functionality is finished and can be merged into the main branch. He can also delete the local and remote branches using:

git branch -d g-new

git push origin --delete g-new

Alice can also delete her local branch by using:

git branch -d g-new

1.9 Pull Requests

Pull requests are a mechanism that allows a branch to be reviewed and discussed before it is integrated into the main branch. When using pull requests, a developer first implements some features in a separate branch. Once this implementation is finished, they do not immediately integrate the new code into the main branch. Instead, they open a request for their branch to be reviewed and approved by a second developer. This request for review and integration is called a pull request. This mechanism is common on GitHub, but it has equivalents in other version control systems.

Nowadays, the review and integration process takes place via a web interface provided, for instance, by GitHub. However, if this interface did not exist, the reviewer would have to start their work by performing a pull of the branch to their local machine. This is the origin of the name: a pull request is a request for another developer to review and integrate a certain branch. To fulfill this request, when not using a web interface, this reviewer should begin by performing a pull of the branch.

Next, we detail the process of submitting and reviewing pull requests using an example.

Example: Suppose that Bob and Alice are members of an organization that maintains a repository called awesome-git, with a list of interesting links about Git. The links are stored in the README.md file of this repository. Any member of the organization can suggest the addition of links to this page. However, they cannot do a push directly to the main branch. Instead, the suggestion needs to be reviewed and approved by another team member.

Bob then decided to suggest adding this appendix to this list. To do so, he first cloned the repository and created a branch, named se-book-appendix, using the following commands:

git clone https://github.com/aserg-ufmg/awesome-git.git
git checkout se-book-appendix

Then, Bob edited the README.md file, adding the URL of this appendix. Finally, he carried out an add, commit, and pushed the branch to GitHub:

git add README.md
git commit -m "SE: A Modern Approach - Appendix A - Git"
git push -u origin se-book-appendix

Actually, these steps are not new compared to what we presented in the previous section. However, the differences start now. First, Bob should go to the GitHub page and select the se-book-appendix branch. Once this is done, GitHub displays a button to create pull requests. Bob should click on this button and describe his pull request, as shown in the next figure.

Pull request example

A pull request is a request for another developer to review and, if appropriate, merge a branch you have created. Consequently, pull requests are a way for an organization to adopt code reviews. That is, developers do not directly integrate their code into the remote repository’s main branch. Instead, they request other developers to first review this code and then merge it.

On GitHub’s pull request creation page, Bob can invite Alice to review his code. She will then be notified that there is a pull request waiting for review. Also via GitHub’s interface, Alice can review the commits from Bob’s pull request. For example, she can inspect a diff between the new and old code. If necessary, Alice can exchange messages with Bob to clarify doubts about the code. She can also request changes in this code. In this case, Bob should provide the changes and carry out a new add, commit, and push. The new commits will be automatically appended to the pull request, so Alice can check if her request has been met. Once the modification is approved, Alice should integrate the code into the main branch, by clicking a button on the pull request review page.

1.10 Squash

Squash is a command that allows merging several commits into a single commit. It is recommended, for example, before submitting pull requests.

Example: In the previous example, suppose the pull request created by Bob has five commits. Specifically, he is suggesting the addition of five new links to the awesome-git repository, which he gathered over some weeks. After discovering each link, Bob performed a commit on his machine. In fact, he plans to create the pull request only after accumulating five commits.

However, to facilitate the review of his pull request by Alice, Bob intends to merge the five commits into a single one. Thus, instead of analyzing five commits, Alice will need to review only one. The submitted modification is exactly the same, i.e., it consists of adding five links to the page. However, instead of the changes being distributed across five commit, they are consolidated into a single one.

To perform a squash, Bob should call:

git rebase -i HEAD~5

The number 5 means that he intends to merge the last five commits in the current branch. After that, Git opens a text editor with a list containing the ID and description of each commit, as shown below:

pick 16b5fcc Including link 1
pick c964dea Including link 2
pick 06cf8ee Including link 3
pick 396b4a3 Including link 4
pick 9be7fdb Including link 5

Bob should use the editor itself to replace the word pick with squash, except for the one in the first line. The file will then look like this:

pick 16b5fcc Including link 1
squash c964dea Including link 2
squash 06cf8ee Including link 3
squash 396b4a3 Including link 4
squash 9be7fdb Including link 5

Then, Bob should save this file. Automatically, Git opens a new editor for him to inform the message of the new commit—that is, the commit merging the five listed commits. After providing this message, Bob should save the file, and then the squash is completed.

1.11 Forks

Fork is the mechanism provided by GitHub to clone remote repositories, i.e., repositories stored on GitHub. A fork is performed via GitHub’s interface. On the page of any repository, there is a button to perform this operation. If we fork the torvalds/linux repository, a copy of this repository will be created in our GitHub account, named, for example, mtov/linux.

As we always do, let’s use an example to explain this operation.

Example: Consider the aserg-ufmg/awesome-git repository, used in the example about pull requests. Also, consider a third developer, named Carol. However, since Carol is not a member of the ASERG/UFMG organization, she doesn’t have permission to perform a push in this repository, as Bob did in the previous example. Despite this, Carol believes that an important and interesting link is missing from the current list, and she would like to suggest its inclusion. But remember: Carol cannot follow the same steps used by Bob in the previous example, as she doesn’t have permission to push to the repository in question.

To solve this problem, Carol should start by forking the repository. To do so, she just needs to click on the fork button, that exists on the page of any GitHub repository. After that, she will have a new repository in her GitHub account, whose name is carol/awesome-git. Then, she can clone this repository to her local machine, create a branch, add the link she wants to the list, and perform an add, commit, and push. This last operation will be carried out in the forked repository. Finally, Carol should go to the page of her GitHub fork and create a pull request. Since the repository is a fork, she has an extra option: to direct the pull request to the original repository. Thus, the developers of the original repository, like Bob and Alice, will be responsible for reviewing and, possibly, accepting the pull request.

Therefore, a fork is a mechanism that, when combined with pull requests, allows an open-source project to receive contributions from other developers. To explain a bit better, an open-source project can receive contributions—more specifically, commits—not only from its team of developers (Bob and Alice, in our example) but also from any other developer with a GitHub account (like Carol).

Bibliography

  • Scott Chacon, Ben Straub. Pro Git. 2nd edition, Apress, 2014.

  • Rachel M. Carmena. How to teach Git. Blog post (link).

Exercises

Try to reproduce each of the examples presented in this appendix. In the examples involving remote repositories, we suggest to use a GitHub repository. In examples involving two users (Alice and Bob, for example), we suggest to create two local directories and use them to reproduce each user’s commands.

This book was formatted using the Pandoc system to convert Markdown to LaTeX and subsequently generate a PDF file. The font is Computer Modern, 11pt. Additionally, from the Markdown files, the EPUB and HTML versions were generated.