Writing tests

Semaloop runs your tests on real devices using an AI agent that drives the app the way a person would. That makes it powerful, but it also means a test written for Semaloop reads a little differently from a traditional script.

This guide shows you how to write tests that pass reliably and keep passing as your app changes. By the end you should be able to look at a test spec and know whether it will work well. Each section covers one idea, then shows what good looks like and what to avoid.

The big idea

Describe the goal and the result, and let the agent find the path.

Traditional test scripts spell out every tap: open this menu, press that button, scroll here. Semaloop works best when you do the opposite. You describe what the user is trying to do and what should be true when they’re done. The agent works out the route itself.

This matters because the route through an app changes often. Buttons move, labels get reworded, a screen gets redesigned. A test tied to a durable outcome survives all of that, but a test tied to exact taps breaks the moment the UI shifts.

A good sign you’ve got it right: if your team redesigns a screen tomorrow, would the test still describe a valid thing to check? If yes, it’s durable.

1. Describe the goal, not every tap

Tell the agent the destination, not the turn-by-turn directions.

Good

Add a track to a playlist and check it appears in that playlist.

Avoid

Tap the Search icon, search for a track, tap the three dots, tap “Add to playlist”, select a playlist, tap Done, then open the playlist and check the track is there.

The good version still covers the flow end to end. It just doesn’t lock the test to a specific button order that might change next release.

2. Check a durable outcome, not a temporary one

Anchor your check to something that stays on screen, not something that flashes by.

Loaders, spinners, toasts, and success banners often disappear in under a second. The agent may simply miss them, and the test fails for no real reason. Check the stable state that the transient thing was announcing instead.

Good

Open the lesson and check the lesson content loads.

Avoid

Open the lesson and check the loading spinner appears and then disappears.

3. One workflow per test

Keep each test focused on a single user goal.

When one test bundles several unrelated jobs together, a failure is hard to read. Did sign-up break, or search, or checkout? Splitting them means a red test points straight at the problem.

Good (three separate tests)

Create a new playlist and check it appears in your playlists.

Add a track to a playlist and check it appears in that playlist.

Delete a note and check it no longer appears in the notes list.

Avoid (one long test)

Sign in, create a playlist, add three tracks, share it with a friend, then delete an old note and update your profile photo and check everything worked.

A useful exception: if the interaction between two things is the point — say, two users in the same shared workspace, where one edits a document and the other should see the change — keep it as one test, because that connection is what you’re checking.

4. Keep it short

Shorter specs are clearer specs. Remove anything that isn’t the goal, the key data, or the result.

Long prompts tend to drift into describing the UI step by step, which is exactly what makes a test brittle. If you find yourself writing the fifth “then tap”, that’s usually a sign to cut back to the outcome.

Good

Log a banana in the food diary and check it appears in today’s entries.

Avoid

Open the app, wait for the home screen, tap the plus button at the bottom, search “banana”, tap the first result, tap the serving size field, leave it as default, tap Add, wait for the diary to refresh, scroll to today, and confirm the banana is listed under breakfast with the right calories shown in the top summary ring.

5. Be specific about what “success” means

A vague check is one the agent can’t reliably judge. Point to a concrete, named thing.

“Looks right”, “loads quickly”, and “everything works” can’t be measured the same way twice. Replace them with a specific element or value you can actually observe.

Good

Switch the app language to Spanish and check the home screen heading reads “Inicio”.

Avoid

Switch the language to Spanish and make sure all the text is in Spanish and the page looks right.

Good

Log a 250ml glass of water and check today’s water total shows 250ml.

Avoid

Log some water and check the totals are correct.

6. Don’t check the same thing twice

One clear check is better than the same check worded two ways. Duplicate or circular checks add length without adding confidence.

Good

Create a playlist called “Road Trip” and check it appears in your playlists.

Avoid

Create a playlist called “Road Trip”, confirm the playlist was created, and make sure the playlist now exists and is shown.

7. Go easy on exact wording and visual cues

Match on stable labels, not on exact copy or colour — unless the wording or look is the actual thing you’re testing.

Marketing copy, button text, and colours get tweaked constantly. A test that demands the exact phrase “Welcome back, champion!” will fail the day someone shortens it to “Welcome back”. If copy accuracy is the point of the test, then check it on purpose — just be clear that’s what you’re doing.

Good

Complete a lesson and check a streak count is shown on the home screen.

Avoid

Complete a lesson and check the banner says exactly “You’re on fire! 1 day streak 🔥“.

Some details matter to the test and should stay. Others are just directions and should go.

Keep: amounts, quantities, item or entry counts, playlist or note names, specific inputs — anything that defines what the user is actually doing.
Drop: button names, tap order, screen names, “scroll down”, “tap the icon top right” — anything that just describes how to move through the app.

Good

Log a 500ml glass of water and check today’s water total increases by 500ml.

Avoid

Go to the diary, tap Add, tap Water, tap the 500ml option, tap Save, then go back to the home screen and check the total went up.

The amount (500ml) stays because it defines the task. The taps go because they just describe one possible path.

A quick checklist

Before you save a test, a quick read-through:

Does it describe a goal, not a sequence of taps?
Is the result something that stays on screen, not something that flashes by?
Is it focused on one workflow?
Could you cut anything without losing the goal, the data, or the check?
Is the setup in preconditions rather than in the steps?
Is the success check specific and observable?
Would it still be valid if the screen were redesigned tomorrow?

If all seven are yes, it’s a strong Semaloop test.

A worked example

Here’s a brittle spec turned into a durable one.

Before

Open the app and wait for it to load. Tap the Search tab at the bottom. Type “tennis” into the search box and wait for results. Tap the second result in the list. On the detail screen, tap the Play button. While it’s playing, tap the Summary icon in the top right. Wait for the loading spinner to disappear, then check the green “Summary ready” banner appears and make sure the summary text looks correct.

What’s wrong with it: it dictates every tap, depends on the second result being a specific item, checks a spinner and a banner that both vanish quickly, and asks for a vague “looks correct” judgement.

After

Play audio about tennis and check that a summary is shown for it.

Same intent. Far more likely to pass, and it stays valid even if Search moves, the results reorder, or the banner copy changes.

Using an LLM to draft tests

You may want to use an LLM like Claude or ChatGPT to help generate tests for your app. The project instructions below are based on this guidance: give it the test you’d like to add to Semaloop, and it returns a more Semaloop-friendly version.

Project instructions: Semaloop test writer

Your role

You turn a user’s test idea into one or more Semaloop test specs. Semaloop runs tests on real devices using an AI agent that drives the app like a person. Tests work best when they describe a user goal and a durable outcome to check, and leave the agent free to find its own path through the app.

The user will paste a test idea. It may be rough, verbose, or written as tap-by-tap steps. Your job is to return finished Semaloop tests that follow the rules below.

What you produce

Return one or more tests. Each test is:

a short, goal-based name (e.g. “Add track to playlist”, “Switch language to Spanish”), then
a one- or two-sentence spec: the user goal, then the result to check.

Write specs as plain declarative prose. No step numbering, no “Expected:” labels, no UI choreography. If a test genuinely needs setup that isn’t part of the workflow (signed in, a feature enabled, a sandbox account), put it on a short Precondition: line above the spec — otherwise leave it out.

Output the tests and nothing else, unless you made a notable decision the user should know about. Keep any such note to one line.

How to write each test

Describe the goal, not the taps. State what the user is trying to do and what should be true when they’re done. Drop button names, tap order, screen names, and “scroll down” — anything that just describes how to move through the app.
Check a durable outcome. Anchor the check to something that stays on screen: an item in a list, a value that updates, a setting that sticks. Never check spinners, loaders, toasts, banners, or animations — they vanish too quickly to observe reliably.
Be specific about success. Replace vague checks (“looks right”, “loads quickly”, “everything works”) with a named element or value the agent can actually observe.
Keep the real data, drop the navigation. Keep amounts, quantities, counts, names, and specific inputs — they define the task. Drop the directions for reaching them.
Check one thing once. No duplicate or circular checks worded two ways.
Match stable labels, not exact copy or colour — unless the wording or look is the actual point of the test, in which case check it deliberately.
Keep it short. Remove anything that isn’t the goal, the key data, or the result.

One or many tests

Split an idea into separate tests when it bundles distinct user goals. One workflow per test keeps failures easy to diagnose. For example, “sign up, then search, then check out” becomes three tests.

Keep it as a single test when the point is the interaction between two things — for example, two users in a shared workspace where one edits and the other should see the change. Splitting that would lose what’s being tested.

When in doubt, prefer one goal per test, but don’t fragment a single coherent workflow into trivial pieces.

When a request isn’t reliably testable

If the idea depends on something Semaloop can’t observe well (a brief loading state, exact copy that changes often, a precise visual layout, or timing like “loads within two seconds”), rewrite it toward the nearest durable check and note the change in one line. Only flag it as unsuitable if there’s no durable outcome to anchor to.

Clarifying

Make reasonable assumptions and proceed. Only ask the user a question if the idea is too ambiguous to know what workflow is being tested. Ask at most one question.

Examples

Example 1 — Input:

Open the app, tap Search, type tennis, tap the second result, hit Play, then tap the summary icon and wait for the spinner to go and check the green “Summary ready” banner shows up and the summary looks right.

Output:

Play audio with summary: Play audio about tennis and check that a summary is shown for it.

Example 2 — Input:

Test that a user can sign in, create a playlist, add a few tracks to it, delete an old note, and update their profile photo.

Output:

Create playlist: Create a new playlist and check it appears in your playlists.
Add track to playlist: Add a track to a playlist and check it appears in that playlist.
Delete note: Delete a note and check it no longer appears in the notes list.
Update profile photo: Change the profile photo and check the new photo is shown on the profile.

Note: Split into separate tests — these are distinct goals, so a failure points straight at the broken one. Sign-in is setup, so it’s a precondition rather than a test.

Example 3 — Input:

Check the water tracker adds up. Log a 500ml glass and make sure the total goes up.

Output:

Log water: Log a 500ml glass of water and check today’s water total increases by 500ml.

Writing tests

The big idea

1. Describe the goal, not every tap

2. Check a durable outcome, not a temporary one

3. One workflow per test

4. Keep it short

5. Be specific about what “success” means

6. Don’t check the same thing twice

7. Go easy on exact wording and visual cues

8. Keep the real data, drop the navigation

A quick checklist

A worked example

Using an LLM to draft tests