Book cover

Buy on Leanpub (pdf and epub)

To report errors or typos, use this form.

Home | Dark Mode | Cite

Software Engineering: A Modern Approach

Marco Tulio Valente

10 DevOps

Imagine a world where product owners, development, QA, IT Operations, and Infosec work together, not just to aid each other, but to guarantee the overall success of the organization. – Gene Kim, Jez Humble, Patrick Debois, John Willis

This chapter begins by discussing the concept of DevOps and its benefits (Section 10.1). Essentially, DevOps is a movement—or more specifically, a set of concepts and practices—aimed at introducing agile principles in the last mile of a software project, i.e., when the system is entering production. In addition to discussing the concept, we address three key practices for adopting DevOps: version control (Section 10.2), continuous integration (Section 10.3), and continuous deployment (Section 10.4).

10.1 Introduction

Throughout this book, we have studied a set of practices for high-quality and agile software development. From agile methods, like Scrum, XP, and Kanban, we learned that clients should be involved from day one in the software development process. We also discussed key practices for producing high-quality software, such as unit testing and refactoring. Additionally, we examined several design principles and patterns.

After applying these practices, principles, and patterns, the software product—or an increment of it resulting from a sprint—is ready for production. This process is known as deployment, release, or delivery. However, regardless of the term used, it is not as simple or straightforward as it may seem.

Traditionally, in conventional organizations, the information technology area was divided into two main departments:

Today, the problems caused by this division are evident. Typically, the support team would become aware of a system only on the eve of its deployment. Consequently, the deployment might be postponed for months due to a variety of unaddressed issues, such as suboptimal hardware, performance problems, incompatibility with the production database, and security vulnerabilities. In extreme cases, these problems could lead to the cancellation of the deployment and the abandonment of the project.

In summary, in this traditional model, a significant stakeholder—the system administrators or sysadmins—would only become aware of the characteristics and non-functional requirements of the new software just before deployment. This issue was exacerbated by systems following a monolithic architecture, where deployment could create various concerns, including bugs and regressions in previously functioning and running modules.

Therefore, to facilitate the deployment and delivery of software, the DevOps concept was proposed. As it is a relatively recent term, it still lacks a consolidated definition. However, DevOps is commonly described as a movement that aims to unify the development (Dev) and operations (Ops) cultures, facilitating faster and more agile software deployments. This objective is reflected in the quote that opens this chapter, from Gene Kim, Jez Humble, Patrick Debois, and John Willis, key figures who helped propagate DevOps principles. According to them, DevOps represents a disruption in traditional software deployment culture (link):

Instead of starting deployments at midnight on Friday and spending the weekend working to complete them, deployments occur on any business day when everyone is in the company and without customers noticing—except when they encounter new features and bug fixes.

However, DevOps does not advocate creating a new professional role responsible for both development and deployment. Instead, its goal is to foster a closer relationship between the development and operations teams, aiming to make software deployment more agile and less traumatic. In other words, the aim is to avoid creating two independent silos: developers and operators, with little to no interaction between them, as illustrated in the following figure.

Organization that is not based on DevOps because there is little communication between Devs and Ops.

Instead, DevOps advocates argue that these professionals should work together from the early sprints of a project, as illustrated in the following figure. For the customers, the benefit should be the earlier delivery of the contracted software project.

Organization based on DevOps. Devs and Ops collaborate closely to discuss issues regarding software delivery.

When transitioning to a DevOps culture, agile teams can incorporate an operations professional, who engages in the work either part-time or full-time. Depending on the demand, this professional may contribute to multiple teams. As part of their work, they proactively address performance problems, security issues, incompatibilities with other systems, and other operational concerns. They also collaborate on the installation, administration, and monitoring scripts for the production software.

DevOps strongly advocates for automating all necessary steps to put a system into production and to monitor its correct operation. This requires the adoption of practices we have already studied in this book, notably automated tests. Furthermore, it also recommends the use of new practices and tools, such as Continuous Integration and Continuous Deployment, which we will examine later in this chapter.

Real World: The term DevOps began to be used in the late 2000s by professionals frustrated with the constant friction between development and operations teams. They became convinced that the solution lay in adopting agile principles not only in development but also in the deployment phase. To provide a specific reference point, the first industry conference on the topic, called DevOpsDays, took place in Belgium in November 2009. It is generally accepted that the term DevOps was coined at this conference, which was organized by Patrick Debois (link).

Finally, we’ll discuss a set of principles for software delivery proposed by Jez Humble and David Farley (link). Although these principles were proposed before DevOps gained traction, they align perfectly with this movement. The key principles include:

10.2 Version Control

As we have mentioned numerous times in this book, software is developed in teams. Therefore, we need a repository, which is a server to store the source code of the system being implemented by these teams. The existence of this server is vital for developers to collaborate and for operators to know precisely which version of the system should be deployed to production. Moreover, it keeps a history of the most important versions of each file. This history enables developers to undo changes and recover the code of a file as it was in the past, even years ago, if needed.

A Version Control System (VCS) offers the services mentioned above. First, it provides a repository to store the most recent version of a system’s source code, as well as related files, such as documentation files, configuration files, web pages, wikis, etc. Second, it allows the retrieval of older versions of any file, if necessary. As emphasized earlier, it is inconceivable in modern software development to create any system, no matter how simple, without a VCS.

The first version control systems emerged in the early 1970s, such as the SCCS system, developed for the Unix operating system. Subsequently, other systems appeared, including CVS in the mid-1980s, and later the Subversion system, also known by its acronym SVN, in the early 2000s. These were all centralized systems based on a client/server architecture (see the next figure). In this architecture, a single server stores the repository and the version control system. Clients access this server to obtain the most recent version of a file. They can then modify the file, for example, to fix a bug or implement a new feature. Finally, they update the file on the server, performing an operation called a commit, which makes the file visible to other developers.

Centralized VCS. There is a single server.

In the early 2000s, Distributed Version Control Systems (DVCS) began to emerge. Among them, we can mention the BitKeeper system, whose first release was in 2000, and the Mercurial and Git systems, both launched in 2005. Instead of a client/server architecture, a DVCS employs a peer-to-peer architecture. In practice, this means that each developer has a full version control system on their own machine, which can communicate with systems on other machines, as illustrated in the next figure.

Distributed VCS (DVCS). Each client has a server. Thus, the architecture is peer-to-peer.

In theory, when using a DVCS, the clients (or peers) are functionally equivalent. However, in practice, there is usually a primary machine that holds the reference version of the source code. In our figure, we refer to this repository as the central repository. Each developer can work independently and even offline on their own workstation, making commits to their local repository. Periodically, they should synchronize this repository with the central one through two operations: pull and push. A pull operation updates the local repository with new commits available in the central repository. Conversely, a push operation sends the latest commits made by the developers in their local repository to the central one.

Compared to centralized VCSs, a DVCS has the following advantages:

Git is a distributed version control system developed under the leadership of Linus Torvalds, who is also responsible for creating the Linux operating system. In its early years, the development of the Linux kernel used a commercial version control system called BitKeeper, which also followed a distributed architecture. However, in 2005, the company that owned BitKeeper decided to revoke the free licenses used in the development of Linux. The Linux developers, led by Torvalds, then decided to create their own DVCS, which they named Git. Like Linux, Git is an open-source system that can be freely installed on any machine. Git is primarily a command-line system. However, there are graphical interface clients—developed by third parties—that allow using Git without typing commands.

GitHub is a code hosting service that uses the Git system to provide version control. GitHub offers free public repositories for open-source projects and paid private repositories for corporate use. Rather than maintaining a DVCS internally, a software company can subscribe to this service from GitHub. A comparison can be drawn with email services: instead of installing an email server locally, a company typically accesses this service from third parties, like Google, through Gmail. Although GitHub is the most popular, similar services are provided by other companies, such as GitLab and Bitbucket.

In Appendix A, we present and illustrate the main commands of the Git system. Additionally, we explain concepts specific to GitHub, such as forks and pull requests.

10.2.1 Multirepos vs Monorepos

As we mentioned before, a VCS manages repositories. Therefore, an organization needs to decide on the repositories it will create in its VCS. A common approach is to create one repository for each project or system in the organization. However, solutions based on a single repository are also possible and are often adopted by large companies, such as Google, Meta, and Microsoft. These two alternatives—referred to as multirepos and monorepos, respectively—are illustrated in the following figures.

Multirepos: the VCS manages several repositories. Normally, one repository per project.
Monorepos: the VCS manages a single repository. Projects are directories of this repository.

If we think in terms of GitHub accounts and repositories, we can give the following examples:

Among the advantages of monorepos, we can mention:

On the other hand, monorepos require specific tools to navigate large codebases. For example, those responsible for Google’s monorepo have reported that they had to implement a plugin for the Eclipse IDE to facilitate working with their very large codebase (link).

10.3 Continuous Integration

We begin with a motivational example before introducing the concept of Continuous Integration (CI). Subsequently, we discuss complementary practices that an organization should adopt along with CI. We conclude with a brief discussion about scenarios that may discourage the use of CI in an organization.

10.3.1 Motivation

Before defining Continuous Integration, let’s describe the problem that led to the proposal of this integration practice. Traditionally, developers have commonly used branches when implementing new features. Branches can be understood as internal and virtual sub-directories, managed by the version control system. In these systems, there is a principal branch, known as main (when using Git) or trunk (when using other systems, such as SVN). In addition to the main branch, users can create their own branches.

For example, before implementing a new feature, developers often create a branch to hold its code. These branches are called feature branches, and depending on the complexity of the feature, they may take months to be merged back into the main development line. In fact, in larger and complex projects, there can be dozens of active feature branches.

When the new feature is completed, its code must be integrated back into the main branch using a command called merge, provided by the version control system. However, this process can lead to a variety of conflicts, known as integration or merge conflicts.

To illustrate, let’s consider a scenario where Alice created a branch to implement a new feature X in her system. Due to the feature’s complexity, Alice worked in isolation on her branch for 40 days, as shown in the following figure (where each node of the graph represents a commit). Note that while Alice was working—and committing changes on her branch—commits were also being made on the main branch.

Development using feature branches

After 40 days, when Alice merged her code into the main branch, numerous conflicts arose, such as:

In large systems, with thousands of files, dozens of developers, and several feature branches, the problems caused by conflicts can take on considerable proportions and delay the deployment of new features. Note that conflict resolution is a manual task, requiring analysis and consensus among the involved developers. This explains why the terms integration hell or merge hell are commonly used to describe the problems related to the integration of feature branches.

Long-lived feature branches can also create knowledge silos, with each new feature having a de facto owner who may work on it in isolation for weeks. Therefore, this developer may feel comfortable adopting different patterns than the rest of the team, including architectural and design patterns, code layout patterns, and user interface patterns.

10.3.2 What is Continuous Integration?

Continuous Integration (CI) is a programming practice that originated from Extreme Programming (XP). The motivation behind this practice was already discussed in the first section of this chapter: if a task causes pain, we should not let it accumulate. Instead, we should break it into subtasks that can be performed frequently. Because these subtasks are small and simple, they will cause less pain.

In our context, large integrations are a major source of pain for developers, as they have to manually resolve multiple conflicts. Therefore, CI recommends integrating the code frequently, that is, continuously. As a result, the integrations will be small and will produce fewer conflicts.

In his XP book, Kent Beck advocates the use of CI as follows (link, page 49):

Integrate and test changes after no more than a couple of hours. Team programming isn’t a divide-and-conquer problem. It’s a divide, conquer, and integrate problem. The integration step can easily take more time than the original programming. The longer you wait to integrate, the more it costs and the more unpredictable the cost becomes.

In this quote, Beck recommends several integrations over a developer’s workday. However, this recommendation is not universally accepted. Other authors, such as Martin Fowler, suggest at least one integration per day per developer (link), which seems to be a minimum threshold for a team to claim that it is using CI.

10.3.3 Best Practices When Using CI

When using CI, the main branch is constantly updated with new code. To ensure that it is not broken—that is, to ensure that the code compiles and runs successfully—some practices should be used along with CI, as discussed below.

Automated Builds

The build refers to the process of compiling and producing an executable version of a system. When using CI, this process must be automated; that is, it should not include manual steps. Furthermore, it should be as quick as possible, since with CI, builds are executed continuously. Some authors, for example, recommend a limit of 10 minutes for performing a build (link).

Automated Tests

In addition to ensuring that the system compiles without errors after a new integration, it is also important to verify that it continues to run correctly and produce the expected results. Therefore, when using CI, we should maintain good test coverage, particularly through unit tests, as discussed in Chapter 8.

Continuous Integration Servers

Automated builds and tests should be executed frequently, preferably before any code is integrated into the main branch. To achieve this, we can use CI Servers, which work as follows (also see the following figure):

Continuous Integration Server

The main goal of a CI server is to prevent the integration of code with errors, including both compilation and logic errors. For example, a build may succeed on the developer’s machine, but fail when executed on the CI server. This can occur, for instance, when the developer forgets to commit a file. Incorrect dependencies are another common reason for build failures. As an example, the code might be compiled and tested on the developer’s machine using version 2.0 of a certain library, while the CI server performs the build using version 1.0.

Several Continuous Integration servers are available in the market. Some of them are offered as independent services, typically free for public repositories, but requiring payment for private ones.

Another question is whether CI is compatible with feature branches. To maintain coherence with the definition of CI, the best answer is yes, provided that the branches are frequently integrated into the main branch, for example, every day. In other words, CI is incompatible only with long-lived feature branches.

Trunk-Based Development

As we’ve seen, when adopting CI, branches should last for a maximum of one working day. Therefore, the cost/benefit of creating them may not be worth it. For this reason, when shifting to CI, it’s common to also adopt Trunk-Based Development (TBD). With TBD, there are no longer branches for new features or bug fixes (or they exist only in the developer’s local repository and thus have a short duration). As a result, all development takes place on the main branch, also known as the trunk.

Real World: TBD is used by major software companies. For example, at Google, almost all development occurs at the HEAD of the repository, not on branches. This helps identify integration problems early and minimizes the amount of merging work needed. It also makes it much easier and faster to push out security fixes (link). Similarly, at Facebook (now Meta), all front-end engineers work on a single stable branch of the code, which also promotes rapid development, since no effort is spent on merging long-lived branches into the trunk (link).

Pair Programming

Pair Programming can be viewed as a continuous form of code review. When adopting this practice, any new piece of code is reviewed by another developer, who sits next to the lead developer during the programming session. As with continuous builds and tests, Pair Programming is often recommended for use with CI. However, this practice is not mandatory. For example, the code can be reviewed after the commit reaches the mainline. In this case, because the code is visible to other developers and can be moved into production at any time, the cost of applying a revision tends to be higher.

10.3.4 When not to use CI?

CI proponents set a firm limit for integrations on the mainline: at least one integration per day per developer. However, depending on the organization, system domain (which may be a critical application, for instance), and the developers’ profiles (who might be beginners), it can be challenging to follow this limit.

Moreover, this limit is not a law of physics. For example, it may be worthwhile to perform an integration every two or three days. In fact, any software engineering practice—including Continuous Integration—should not be applied literally, that is, exactly as it is described in a manual or textbook. Context-justified adaptations are not only possible but should be carefully considered. Therefore, experimenting with different integration intervals can help define the best setup for your organization.

CI is often not compatible with open-source projects. Frequently, the developers of these projects are volunteers and do not work on the code daily. In these cases, a model based on pull requests and forks, as popularized by GitHub, is more appropriate. We will provide more details about these concepts in Appendix A.

10.4 Continuous Deployment

With Continuous Integration, new code is often integrated into the main branch. However, this code isn’t necessarily production-ready. Rather, it can be a preliminary version, integrated so that other developers become aware of its existence and, consequently, avoid future integration conflicts. For instance, you can integrate a preliminary version of a web page with a basic interface, or a feature with known performance issues.

Beyond Continuous Integration, there is another step in the automation chain proposed by DevOps, called Continuous Deployment (CD). The difference between CI and CD is simple, but the impacts are profound: when using CD, every new commit that reaches the main branch is deployed to production, typically within a matter of hours, for instance. Specifically, the workflow when using CD is as follows:

Among the advantages of CD, we can mention:

Real World: Various companies that develop web apps use CD. For instance, in an article published in 2016, Savor and colleagues reported that at Facebook, each developer, on average, deployed 3.5 updates into production per week (link). These updates added or modified an average of 92 lines of code. This data suggests that, to work effectively, CD requires small updates. Therefore, developers must develop the skill to break down a programming task (e.g., a new feature, even if complex) into small parts, which can be quickly implemented, tested, and deployed.

10.4.1 Continuous Delivery

Continuous Deployment (CD) is not suitable for certain types of systems, such as desktop apps (like an IDE or a web browser) and embedded software (like a printer driver). Users typically don’t want to be notified daily about a new version of their browser or that a new driver is available for their printer. These systems require an installation process that is not transparent to users, unlike web system updates.

However, in such cases, a variant known as Continuous Delivery can be used. The idea is straightforward: with Continuous Delivery, every push is prepared for immediate deployment to production. However, an external authority—such as a project manager or a release manager— decides when the pushes will actually be released to customers. Marketing and corporate strategies are examples of forces that can influence this decision.

To clarify the distinction between these practices:

In Continuous Deployment, both processes are automatic and continuous. However, with Continuous Delivery, delivery is performed frequently, while deployment requires manual authorization.

Whether adopting Continuous Deployment or Delivery, software companies are increasingly reducing their release cycles to keep users engaged, receive feedback, maintain developers motivation, and remain competitive in the market. This trend is evident even in desktop apps. For example, as of 2024, Google releases a major version of the Chrome browser every four weeks. Additionally, weekly updates are used to deploy security fixes to keep Chrome’s patch gap short.

10.4.2 Feature Flags

However, it is unrealistic to assume that every commit will be ready for immediate deployment. For example, a developer may be working on a new feature X but still need to implement part of its logic. In such a situation, the developer may ask:

If new releases happen almost every day, how can I prevent my unfinished implementation, which has not been properly tested and has critical performance issues, from reaching the company’s customers?

A potential solution is to refrain from integrating the code into the main development branch. However, this practice is no longer recommended, as it leads to what is known as integration or merge hell. In other words, we don’t want to give up Continuous Integration and Trunk-Based Development.

A more pragmatic solution to this problem is to continuously integrate the partial code of feature X, but with its execution disabled, ensuring that any code related to X is guarded by a boolean variable (or flag) that evaluates to false while the implementation is unfinished. An example is shown below:

featureX = false;
...
if (featureX) 
   "here is my incomplete code for X"
...
if (featureX)
   "more incomplete code for X"

In the context of Continuous Deployment, variables used to prevent the deployment of partial implementations are called feature flags or feature toggles.

To further illustrate, consider another example. Suppose you’re working on a new page for a web application. You can use a feature flag to enable or disable this page, as shown below:

new_page = false;
...
if (new_page) 
   "show new page"
else
   "show old page"

This code can be safely deployed while the new page is not ready. However, during development, you can enable the new page locally by setting the new_page flag to true.

This approach results in code duplication between both pages for a period of time. However, after the new page is approved, deployed, and receives positive feedback from customers, the old page’s code and the feature flag (new_page) can be removed. Thus, the duplication is temporary.

Real World: Researchers from two Canadian universities, led by Professors Peter Rigby and Bram Adams, conducted a study on the use of feature flags across 39 releases of the Chrome browser, covering five years of development, from 2010 to 2015 (link). During this period, they identified more than 2,400 distinct feature flags in the browser’s code. In the first version analyzed, they documented 263 flags; in the last version, the number had increased to 2,409. On average, each new release introduced 73 flags and removed 43 flags, resulting in the net growth observed in the study.

However, feature flags can be retained in the code beyond the deployment phase. This can occur for two reasons, as described below.

First, feature flags help implement what is called a canary release. In this type of release, a new feature—guarded by a feature flag—is initially made available to a small group of users, for example, only 5% of the user base. This approach minimizes any problems caused by potential bugs in this new feature. After a successful initial deployment, the percentage of users with access to the new feature is gradually increased until it reaches all users. The term canary release refers to a historical practice used in the exploration of coal mines. Miners would enter these mines with a canary in a cage. If the mine contained any toxic gas, it would kill the canary, alerting the miners to withdraw to prevent intoxication.

Second, feature flags facilitate the implementation of A/B Tests, as discussed in Chapter 3. To recap, in these tests, two versions of a feature (old versus new version, for instance) are simultaneously released to distinct user groups, aiming to verify if the new feature indeed adds value to the current implementation.

To facilitate the execution of canary releases and A/B tests, a data structure can be used to store the flags and their state (on or off). An example is shown below:

FeatureFlagsTable fft = new FeatureFlagsTable();
fft.addFeature("new-shopping-cart", false);
...
if (fft.isEnabled("new-shopping-cart"))
   // process purchase using new cart
else 
   // process purchase using current cart
...      

There are also libraries dedicated to managing feature flags, which provide classes similar to FeatureFlagsTable from the previous example. The advantage of these libraries is that the flags can be set externally to the program, for example, in a configuration file. On the other hand, when the flag is an internal boolean variable, changing its value requires editing and recompiling the code.

In-Depth: In this section, we focused on the use of feature flags to prevent a code segment from reaching customers when an organization is using Continuous Deployment. Feature flags used for this purpose are also called release flags. However, feature flags can be used for other purposes. One example is creating different versions of the same software. For instance, consider a system with a free and a paid version. Customers of the paid version have access to more features, with this access controlled by feature flags. In this specific case, these flags are called business flags.

Bibliography

Gene Kim, Jez Humble, John Willis, Patrick Debois. The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations. IT Revolution Press, 2016.

Jez Humble, David Farley. Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Addison-Wesley, 2010.

Paul Duvall, Steve Matyas, Andrew Glover. Continuous Integration: Improving Software Quality and Reducing Risk. Addison-Wesley, 2007.

Exercises

1. Define and describe the objectives of DevOps.

2. Job offers in the IT sector often mention vacancies for a DevOps Engineer, requiring skills such as:

Based on your definition of DevOps from the previous question, assess the appropriateness of designating an employee’s role as a DevOps Engineer. Justify your answer.

3. Describe two advantages of using a Distributed Version Control System (DVCS).

4. Identify a disadvantage associated with the use of mono-repositories.

5. Define and differentiate among the following terms: Continuous Integration, Continuous Delivery, and Continuous Deployment.

6. Explain the importance of Continuous Integration, Continuous Delivery, and Continuous Deployment in DevOps practices. Relate your answer to the definition of DevOps that you provided in the first question of this list.

7. Research the meaning of the term CI Theater. Then, define it in your own words.

8. Imagine you are hired by a printer company to establish DevOps practices for the development of printer drivers. Would you implement Continuous Deployment or Continuous Delivery? Justify your answer.

9. Identify a problem (or challenge) that arises when using feature flags to delimit code that is not ready for production.

10. Programming languages such as C support conditional compilation directives such as #ifdef and #endif. Research the functionality and usage of these directives. Compare and contrast them with feature flags.

11. Compare the typical lifespan of release flags and business flags in code. Which tends to persist longer? Justify your answer.

12. When companies migrate to CI, they often abandon feature branches in favor of a single, shared branch. This practice is called Trunk-Based Development (TBD), as discussed in this chapter. However, TBD does not mean that branches are no longer used in these companies. Describe an alternative use for branches unrelated to feature implementation.

13. Read the following article from the official Gmail blog, which describes a major interface update made in 2011. The article compares the challenges of this migration to those of changing the tires of a car while it is moving. Based on this article, answer:

  1. Identify the key technology discussed in this chapter that was used to enable this Gmail interface update. How does the article refer to this technology?

  2. What term do we use in this chapter to describe this technology?