End-to-End Testing an AI Application with Playwright and GitHub Actions

End-to-End Testing an AI Application with Playwright and GitHub Actions

Creating a Robust AI Testing Workflow From Localhost to Production

Featured on Hashnode

Why End-to-End Testing?

LLMs are notoriously finicky. You can try to corral them into an API, fine-tune them, lower their temperature, select JSON mode, pray, but in the end you may still end up with a hallucination rate of 15-20%. Developers expect their code to be deterministic, so this is not ideal. Enterprise applications typically have a great number of automated tests that can click around and point out even the slightest differences in expected behavior. For example, the automated tests for one application that I worked on was so sensitive that a simple change to an existing flow could cause dozens of broken tests, leading to long hours of manual testing for the quality engineers.

End-to-end tests are meant to verify that everything in a system works as it should in a real-world scenario. That means striking a balance between writing tests that are robust enough to handle acceptable levels of variance. However, they should not be so brittle that the test breaks on every other CI run. The reality of development is that time is finite, and the smaller the company the more painful it can be to write tests, whether they are unit, integration, or end-to-end tests. However, I'd like to prove that if a small team has the time to write even just one test before shipping, it should be an end-to-end test.

Architecture of Eidolon AI

Recently I've been contributing to the open source AI agent framework Eidolon AI. The Eidolon team noted that one of their highest priority needs for the project was just a simple, full end-to-end test for one of their many AI agent examples. The tech stack for their simplest examples includes a MongoDB database, a server that's built with a Dockerfile (eidolon-server), and a standalone Next.js UI that's also built with a Dockerfile (webui). A Docker Compose file at the base of the repository orchestrates each of these components together.

This is what the webui looks like after it's built and running:

Adding a Test to Eidolon AI

I wanted my first E2E test on Eidolon to target the example chatbot. The chatbot was an ideal example to target with E2E tests because it requires the database, server, and front-end application to coordinate, but it doesn't require any additional services outside the scope of the existing Docker Compose.

I decided the best way to test the chatbot example would be to use Playwright with GitHub Actions. Playwright is an excellent way to add end-to-end testing to modern applications because we can configure it to hook into a running Docker instance, and also it provides granular ways to target different parts of the DOM, such as selecting a chatbot's text box.

GitHub Actions is the ideal choice as a CI tool because Eidolon was already orchestrating the server and front end together in other GitHub Actions workflow files, and GitHub Actions has a useful action called upload-artifact that uploads the screenshots and result of the test as a test artifact, so we can see exactly why a test failed.

Configuring Playwright

Install

Unlike ordinary packages, when installing Playwright we have to install the package itself as well as the browser binaries:

pnpm install --save-dev @playwright/test@latest 
pnpm exec playwright install --with-deps

We need the browser binaries so that Playwright can see and control different browsers programmatically.

Configure

We need to add a playwright.config.js file at the root of the repo. Here's the sample config file I added to the Eidolon repo with explanations for each of the configuration options:

const { defineConfig } = require('@playwright/test');

module.exports = defineConfig({
    // Where in the repo Playwright should search for our tests
    testDir: './tests',
    // Where the test results should be stored
    outputDir: 'tests/test-results',
    // Important for a CI to specify a timeout or it could hang
    timeout: 30000,
    retries: 2,
    use: {
    // When running our tests we don't open the browser (headless)
    // Upon test failure a screenshot of the front end will be saved
        headless: true,
        baseURL: 'http://localhost:3000',
        screenshot: 'only-on-failure',
    },
    webServer: {
        // We need to launch a dev server before running the tests
        command: 'pnpm docker-compose up',
        // Working directory where we run the above command
        cwd: '../../..',
        port: 3000,
        timeout: 120000,
        // If a server is already running, use that for tests
        reuseExistingServer: true,
    },
});

Adding a Test

Now that we've configured Playwright to properly search for our tests and know where our front end and server are running, we can add our first test. Since we're testing the chatbot, the most basic E2E test we can run is to ensure the chatbot responds to basic input. Let's break down the chatbot.test.js file which we'll add in our tests directory as specified in the Playwright config:

const { test, expect } = require('@playwright/test');

// Test to check if the chatbot responds to input
test('Chatbot should respond to input', async ({ page }) => {
    await page.goto('/eidolon-apps/sp/chatbot');
    // If the user is not logged in, log in with a random email
    if (await page.locator('text=Eidolon Demo Cloud').isVisible()) {
        const randomEmail = `test${Math.random().toString(36).substring(7)}@example.com`;
        await page.fill('input[id="input-username-for-credentials-provider"]', randomEmail);
        await page.click('button[type="submit"]');
    }
    // Add a chat
    const addChatButton = await page.locator('text=Add Chat');
    await addChatButton.click();
    const inputField = await page.locator('textarea[aria-invalid="false"]');
    await inputField.waitFor();
    // Fill the input field with a message
    await inputField.fill('Hello, how are you? Type "Hello!" if you are there!');
    await page.locator('button[id="submit-input-text"]').click();
    const response = await page.getByText("Hello!", { exact: true });
    await response.waitFor();
    await expect(response).toBeVisible();
    await expect(response).toContainText('Hello!');
});

Let's assume that we've disabled login functionality, and we just want to add a chat, fill in text, and wait for a response.

When we're writing tests, how can we know what items to target in the DOM? Well, since we downloaded the browser binaries we can run Playwright in debug mode and use the browser's developer tools to know exactly what elements might be a good fit. The relevant command to run Playwright in debug mode is the following:

pnpm exec playwright test --debug

This is what the test buckle looks like when it's open. Note that we can use the "Step Over" button that I've highlighted in red to go line by line in our test file and see exactly how our front end changes at each step of our test:

At this point, we can move forward to enter the chatbot app, click "Add Chat" (note that it's an async operation), and select the text area where we'll write our prompt for the chatbot:

When stepping over each step the Playwright test buckle highlights the relevant locator. In this example we target the element by its element type (textarea) and a unique attribute (its aria-invalid value). It would be even better if our element had a unique ID so that we didn't need to target it by a combination of these two descriptors, but for now it'll suffice.

The URL in the browser has a unique process ID for the current conversation (66b58...). The process ID is a already a good sign that our server, front end, and database are working together, because a new process is created with a POST request and it's fetched with a GET request.

In the final step of our test, we fill in the textarea element with the fill() function, then target the enter arrow by its element and id: (button[id="submit-input-text"]. Finally, we use the async waitFor() function to wait for the application to respond (this is why a timeout is important), and we search the whole text of the page to find if the "Hello!" response we're expecting is found:

Our test is successful and the Chromium browser closes automatically.

In our front end's package.json we can add some scripts to run Playwright automatically.

I added the following:

"build": "pnpm run build-eidolon && next build",
"docker:up": "docker-compose up -d",
"docker:down": "docker-compose down",
"docker:build": "docker-compose build",
"docker:rebuild": "pnpm run docker:down && pnpm run docker:build && pnpm run docker:up",
"test:e2e": "pnpm run docker:rebuild && pnpm exec playwright test && pnpm run docker:down",
"playwright:debug": "pnpm exec playwright test --debug",

We now have the test:e2e command to run our E2E tests from the CI, and the playwright:debug command to test locally

GitHub Actions Workflow

Now that our test is working correctly, we need to have them run on the CI. Let's add an e2e.yml file in the .github/workflows directory at the base of our repo. For brevity, I'll only include a link to the workflow file and I'll include relevant snippets below.

Specify Workflow Trigger

To ensure our E2E workflow file doesn't run unnecessarily, we specify exactly when it should run using GitHub's workflow syntax:

on:
  # Lets us trigger the workflow manually
  workflow_dispatch:
  # Triggered on pushes to main, certain paths, and pull requests
  push:
    branches: [main]
    paths:
      - '**'
      - '!k8s-operator/**'
  pull_request:
    paths:
      - '**'
      - '!k8s-operator/**'

Architecture of the test-e2e Job

We'll run just one job, and it includes runs-on, services, and steps. The basic architecture looks like this:

jobs:
    test-e2e:
        runs-on: ubuntu-latest
        services:
            mongo:
        env:
        steps:

Because the server depends on a health MongoDB database, we run the database service first and then run the Docker scripts to spin up the server and front end later. The env section includes variables related to NextAuth.js and an OpenAI key. Finally, we get to the main portion of the workflow, the steps.

Setting up the steps workflow

We'll skip over the steps related to spinning up the Docker containers because this will likely be different for your own project. After installing dependencies in the workflow file, don't forget that we need to install the Playwright browser binaries in the CI as well:

        name: Install Playwright browsers
        run: pnpm exec playwright install
        working-directory: ./webui/apps/eidolon-ui2

Remember that in the package.json we wrote two scripts for Playwright: one for debugging and one for the CI. Here's what the one for the CI looks like when added to the workflow file:

      - name: Run Playwright tests
        run: pnpm exec playwright test --config=playwright.config.js
        working-directory: ./webui/apps/eidolon-ui2

Now we've officially run our test, but we should still configure logging to ensure we know what goes wrong during a test failure.

Logging

These two workflow steps at the end of the workflow file will be the most useful to you when logging your own tests. They save the test result and also a screenshot of the front end (which we configured in Eidolon to only be taken in case of a test failure).

      - name: Upload Playwright test results
        uses: actions/upload-artifact@v3
        if: always()
        with:
          name: playwright-results
          path: |
            webui/apps/eidolon-ui2/tests/test-results
          if-no-files-found: ignore

      - name: Upload Playwright screenshots
        uses: actions/upload-artifact@v3
        if: always()
        with:
          name: playwright-screenshots
          path: webui/apps/eidolon-ui2/tests/screenshots
          if-no-files-found: ignore

The trickiest part to remember about the above workflow steps is that we need to include always() to ensure the CI runs these steps even if a previous step fails (otherwise we would never arrive to the log steps when a test fails).

Now when a test fails during our workflow, we can view the Playwright test results by downloading playwright-results, a zip file that contains the relevant images.

A test failure screenshot taken by Playwright:

Conclusion

There are plenty of best practices that would take another article to go over in detail, like getting consistent LLM outputs, targeting DOM elements effectively, and adding retry/timeout flows in workflow files. For now you can get started adding simple end-to-end tests for your own AI application using just Playwright and GitHub Actions.

If you're interested in contributing to Eidolon AI the team is always looking for contributors—feel free to join their Discord.

If you'd like help adding end-to-end testing for your own application or just want to say hello, feel free to reach out!

Twitter (X): jahabeebs

LinkedIn: jacob-habib

Did you find this article valuable?

Support Jacob Habib's Blog by becoming a sponsor. Any amount is appreciated!