robertmitchellv - GitHub Actions and workflow

Setting Some Expectations

While I enjoy learning about DevOps, my primary role is as a Data Engineer, though I often function more like a software engineer working on data-intensive applications. Much of what I’ve learned comes from examining existing workflows, reading documentation, exploring other blogs, and practical experimentation. If you notice any inaccuracies, please feel free to create an issue on the repository. Your feedback is greatly appreciated!

How did I get here?

I’ve found working with GitHub Actions both empowering and intimidating. It’s empowering because it automates tasks that would otherwise be manual, but intimidating due to the complexity it adds, often leading to tricky debugging scenarios. A key takeaway is that while striving for simplicity can make code more manageable, over-optimization, especially in DevOps, can lead to diminishing returns and increased fragility. Debugging a draft PR with numerous commits and failed runs can be mentally exhausting, particularly when you’re making changes just to see what happens.

Recently, I explored how to trigger certain workflows based on the specific needs of the code changes for my team. This may not seem necessary for small projects where all tests run in under ten minutes, but in a monorepo with multiple services, including SDKs, containers running Python, JS, or databases, tests could take 15 minutes or more. This can significantly slow down the review process for PRs.

One solution I delved into was workflow_call. I had to spend some time understanding how it works through examples in our repo, particularly how to manage inputs and outputs between workflow parts effectively.

`workflow_call` from the docs

Let’s start by looking at the docs.

The documentation introduces workflow_call as a method to allow a workflow to be triggered by another. It then directs you to the section on “Reusing workflows”. This resource is thorough and detailed, but can be overwhelming initially, so I’ll break down the basics first to build a solid foundation.

Basic workflows

GitHub Action workflows are located in the .github/ directory of your repo, under the workflows folder:

.
└── .github
    └── workflows
        ├── lint.yaml
        └── test.yaml

Workflows are .yaml files that specify what actions should run based on certain triggers. Initially, I found this part of a repo daunting, often relying on the community for pre-built actions. However, workflows are actually quite straightforward once you understand the syntax and logic.

Here is an example of a simple linting workflow:

lint.yaml

name: Lint

on:
  push:
    branches:
      - main
  pull_request:
    branches:
      - main

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
    - name: Install dependencies
      run: pip install ruff
    - name: Lint with Ruff
      run: ruff code/

This workflow includes three top-level elements:

name: The name of the workflow.
on: Specifies the triggers for the workflow, such as push and pull_request events on the main branch.
jobs: Lists the jobs that will run, typically in parallel.

Let’s continue with a unit testing workflow:

test.yaml

name: Test

on:
  push:
    branches:
      - main
  pull_request:
    branches:
      - main

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.10'
    - name: Install dependencies
      run: pip install pytest
    - name: Run tests
      run: pytest code/

Similar to the linting workflow, this workflow contains the same three top-level elements, but with a different job configuration.

Understanding the components of jobs: and their keys was initially confusing. These elements specify the steps to be run in each job, which can be:

Shell commands: Executable commands using the run: key, similar to what you would type in a terminal.
Actions: Reusable actions specified with the uses: key, like actions/checkout@v3 for checking out repository code.
Composite run steps actions: Grouped steps within a single action for reuse across workflows, specified with the steps: key.

More complex workflows with `workflow_call`

Now let’s return to the workflow_call and how it can be used to trigger a workflow from another workflow.

determineChanges.yaml

name: Determine Changes

on:
  workflow_call:
    outputs:
      changed-files-data:
        description: "JSON formatted list of changed files with metadata"
        value: ${{ jobs.determine-changes.outputs.changed-files-data }}

jobs:
  determine-changes:
    runs-on: ubuntu-latest
    outputs:
      changed-files-data: ${{ steps.create-changed-files-data.outputs.result }}
    steps:
      - uses: actions/checkout@v3
        with:
          fetch-depth: 0

      - name: Fetch base and head branches
        id: gather-branches
        run: |
          git fetch origin ${{ github.base_ref }}:${{ github.base_ref }}
          git fetch origin ${{ github.head_ref }}:${{ github.head_ref }}

      - name: Create changed files data
        id: create-changed-files-data
        run: |
          echo "Base reference: ${{ github.base_ref }}"
          echo "Head reference: ${{ github.head_ref }}"

          # get the list of changed files
          DIFF_OUTPUT=$(git diff --name-only ${{ github.base_ref }}...${{ github.head_ref }})
          mapfile -t CHANGED_FILES <<< "$DIFF_OUTPUT"
          echo "Changed files:"
          printf '%s\n' "${CHANGED_FILES[@]}"

          JSON_ARRAY="["

          # create a JSON array of the changed files
          for FILE in "${CHANGED_FILES[@]}"; do
              EXTENSION="${FILE##*.}"
              
              JSON_ENTRY=$(jq -nc \
                  --arg file "$FILE" \
                  --arg extension "$EXTENSION" \
                  '{
                    file: $file,
                    extension: $extension
                  }')
              
              JSON_ARRAY+="$JSON_ENTRY,"
          done

          JSON_ARRAY="${JSON_ARRAY%,}]"

          echo "Changed files data: $JSON_ARRAY"
          echo "result=$JSON_ARRAY" >> $GITHUB_OUTPUT

This workflow has the same three top-level elements but we can see that there are some big differences. In the on: top-level element we see the workflow_call first-level element with its own set of keys that are unfamilar.

on:
  workflow_call:
    outputs:
      changed-files-data:
        description: "JSON formatted list of changed files with metadata"
        value: ${{ jobs.determine-changes.outputs.changed-files-data }}

What we’re demonstrating here is that this workflow is exclusively triggered by another workflow—this is the only way it activates, using the workflow_call trigger. This setup is designed to ensure that the workflow can contribute data to subsequent processes.

We also define an output for this workflow. This output is named changed-files-data and it includes a description and a value specifying where in this workflow the data is produced. Whenever you encounter ${{ ... }}, you’re seeing what’s known as a variable expression. These expressions help us dynamically reference data produced by the workflow.

Let’s break down the location of this variable:

jobs: This indicates that the variable is located within the second-level element of the workflow.
- determine-changes: This is the name of a specific job within our workflow where the output is produced.
  - outputs: This section within the job specifies where the output is generated.
    - changed-files-data: Here’s the interesting part! We encounter another variable expression that indicates where in the job’s steps the data is finalized.

Now, looking into the changed-files-data variable expression: ${{ steps.create-changed-files-data.outputs.result }}:

steps: This tells us that the variable is found within the third-level element of our workflow.
- create-changed-files-data: This is the id of the step, labeled “Create changed files data”, where the output is generated.
  - outputs: This subsection within the step delineates where the output is specifically created.
    - result: This is the identifier for the output produced by the step.

By structuring workflows in this manner, we enable modular, reusable components that can interact seamlessly within GitHub Actions.

How do we use the output from `workflow_call`?

Now that we’ve seen how we can set up a workflow that can be called by another workflow, let’s return to a slightly expanded linting example that includes the workflow from our previous steps.

lintChangedFiles.yaml

name: Lint Changed Files

on:
  workflow_call:
    inputs:
      changed-files-data:
        required: true
        type: string

jobs:
  lint-python:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.10'

      - name: Install Ruff
        run: |
          pip install ruff

      - name: Lint Python files
        run: |
          CHANGED_FILES_JSON='${{ inputs.changed-files-data }}'
          CHANGED_PY_FILES=$(echo "$CHANGED_FILES_JSON" | jq -r '.[] | select(.extension == "py") | .file')
          if [[ -n "$CHANGED_PY_FILES" ]]; then
            echo "Changed Python Files: $CHANGED_PY_FILES"
            ruff check --output-format=github $CHANGED_PY_FILES
            ruff format --check $CHANGED_PY_FILES
          else
            echo "No Python files to lint."

In this example, the on: top-level element specifies that this workflow is triggered by a workflow_call event, and it requires an input named changed-files-data. This input must be provided by the calling workflow, and it contains JSON-formatted data about which files have changed.

The run key within the “Lint Python files” step shows how we can access the output from the previous workflow. We use the ${{ inputs.changed-files-data }} variable expression to retrieve the JSON data produced by the previous workflow. This data is then parsed with jq to extract the paths of Python files that have changed.

Now, let’s expand this concept to our unit test workflow example from earlier.

testChangedFiles.yaml

name: Test Changed Files

on:
  workflow_call:
    inputs:
      changed-files-data:
        required: true
        type: string

jobs:
  test-python:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.10'

      - name: Install Dependencies
        run: |
          pip install -U pip
          pip install pytest

      - name: Test Python files
        run: |
          # Extract Python files from JSON input
          CHANGED_FILES_JSON='${{ inputs.changed-files-data }}'
          CHANGED_PY_FILES=$(echo "$CHANGED_FILES_JSON" | jq -r '.[] | select(.extension == "py") | .file')
          if [[ -n "$CHANGED_PY_FILES" ]]; then
            echo "Changed Python Files: $CHANGED_PY_FILES"
            pytest tests/
          else
            echo "No Python files to test."

In this example, the workflow assumes that the tests are located in a tests/ directory. This setup illustrates how you can use the output from a previous workflow to trigger subsequent testing only on the relevant files, although here we run tests on the entire directory whenever any Python file changes.

Putting it together

With individual workflows connected from output to input, we can create a workflow for a pull request that runs a smaller subset of tests, balancing speed and coverage efficiently. This setup helps keep “push to main” tests comprehensive for the entire codebase.

Stitching it all tegether

In a PR workflow, we can integrate the workflow that determines what has changed in our repo, which then sequentially triggers linting and testing based on those changes.

pullRequest.yaml

name: Pull Request

on:
  pull_request:
    branches:
      - "**"

jobs:
  determine-changes:
    uses: ./.github/workflows/changes.yaml

  lint-changed-files:
    needs: determine-changes
    uses: ./.github/workflows/lintChangedFiles.yaml
    with:
      changed-files-data: ${{ needs.determine-changes.outputs.changed-files-data }}

  test-changed-files:
    needs: determine-changes
    uses: ./.github/workflows/testChangedFiles.yaml
    with:
      changed-files-data: ${{ needs.determine-changes.outputs.changed-files-data }}

This setup triggers on any pull request, first determining the changes, then linting, and finally testing the code based on those changes. Using needs ensures that each job waits for the necessary data from its predecessor, creating an efficient and effective CI pipeline.

Now that we have established a dedicated PR workflow that efficiently manages the processes of determining changes, linting, and testing, we can streamline our existing lint.yaml and test.yaml workflows. Since the new PR workflow handles all pull requests, the original workflows can be adjusted to trigger only on pushes to the main branch. This reduces redundancy and focuses these workflows on final validation before or after merges into main.

Here is the specific section we can now remove from both lint.yaml and test.yaml workflows:

  on:
    pull_request:
      branches:
        - main

This modification ensures that the lint.yaml and test.yaml workflows are no longer triggered by pull requests, as the new PR workflow now covers this scenario. Instead, they will continue to run only on direct pushes to the main branch, which might include final checks or post-merge validations.

By making this change, we achieve a cleaner separation of concerns:

The PR workflow is optimized for handling all pull request-related checks and tests.
The Main branch workflows (lint.yaml and test.yaml) are streamlined to focus solely on changes directly pushed to the main branch, ensuring that these changes meet our standards without duplicating the checks done in PRs.

This setup not only organizes our workflows more logically but also helps in conserving CI/CD resources and reducing the potential for confusion about which workflows run under which circumstances.

Conclusion

There is always a trade-off when deciding to make certain parts of the codebase more DRY (Don’t Repeat Yourself) and simultaneously more complex. For many projects, determining changes and adding extra steps could introduce new points of potential failure in workflows. Like any other code, workflows require maintenance, and having more complex workflows is no exception. This guide serves as a minimal viable product (MVP) for using workflow_call, but it can also be a good starting point for thinking about what you really want out of your GitHub Actions workflows.

Perhaps instead of relying on off-the-shelf workflows, you’ll decide to tailor a set of automations that is uniquely suited to your project’s needs. Feel free to get in touch with me if you have any questions or comments. And again, if something isn’t correct, please create an issue on the repo so I can improve it.