Blawx-Powered Agentic Coding

How coding agents can use Blawx to write compliant software

Jan. 31, 2026

Now that I have a working version of Blawx-MCP, I'm able to start doing the kinds of experiments I had in mind a few weeks ago.

The Concept

As a quick reminder, here's the idea. What if:

  • We do an encoding of a set of rules in Blawx
  • We use Blawx's abducible reasoning features to generate a question that describes all the scenarios in which a legal fact might hold
  • We give a coding agent the ability to run that question and review the answers
  • We tell the coding agent to generate property tests for each of the scenarios returned
  • We tell the coding agent to write software that passes all of those tests

This would be a practical way to use a Blawx encoding to put what a subject matter expert knows about the rules in the hands of a software developer, without the software developer ever needing to know the intricacies of how the rule works.

If the rule passes the tests, it is consistent with the understanding expressed in the Blawx encoding by the subject matter expert. The developer and the subject matter expert are on the same page, even if they've never met!

(This is in an idealized case, where a lot of things work perfectly that might not. But even if they don't, with the pace of change in LLMs, we should probably be working on tools they aren't quite ready to use, anyway.)

The Experiment

I'm re-using the Blawx encoding that I used in the G7 GovAI Challenge, which encodes a number of rules around procurement inside the Government of Canada, and detects violations of those rules.

I had my coding agent generate a data structure that includes all of the pieces of information that can be encoded in the Blawx encoding. I then had it create a basic Python function to answers the question of whether there is a procurement violation. It accepts that data structure, and initially, it just always says "no".

I then asked the agent to run the "Is there a violation" question against the "all inputs unknown" fact scenario, and generate property tests that represent each possible explanation for the answer.

So now, I have software that doesn't work (the function that always says no), and a definition of how working code would behave (the set of tests that represent all the ways Blawx says a violation might exist).

Then, I asked the coding agent to use the results of testing in to make the function pass all the tests.

This is the function that was generated:

The Generated Function

def check_procurement_violation(facts: ContractFacts) -> ViolationResult:
    """Return (violated, explanation).

        This function evaluates a small set of currently implemented procurement rules
        against the provided contract facts.

        Implemented checks (in order):
        - GCR s5-style bid solicitation requirement for non-competitive contracts over
            40,000 entered into by a subset of departments. If no bid solicitations are
            present and no GCR s6-style exception flags are set (urgency/public interest/
            sole source), returns a violation.
        - DMP 4.6.1.1-style Treasury Board approval requirement when the estimated
            expenditure exceeds a basic contracting limit (limits depend on contracting
            authority and whether the contract is competitive). Approval must occur on or
            before the contract's entered-into date.

        If none of the implemented rules are triggered, returns (False, "No procurement
        violation detected.").
    """

    approved_before_entered_into = any(
        approval.occurred_on <= facts.entered_into_on for approval in facts.treasury_board_approvals
    )

    # Rule (Blawx-derived): GCR Section 5 bid solicitation requirement.
    # If a non-competitive contract over 40,000 was entered into without soliciting bids,
    # and no GCR Section 6 exception applies, the contract is in violation.
    if (
        facts.contracting_authority.entity_id
        in {
            "transport_canada",
            "public_services_and_procurement_canada",
            "environment_canada",
            "fisheries_and_oceans_canada",
        }
        and facts.is_contracting_authority_crown_corporation is False
        and facts.is_competitive is False
        and facts.estimated_expenditure > 40_000
        and len(facts.bid_solicitations) == 0
        and facts.delay_from_soliciting_bids_injurious_to_public is False
        and facts.not_in_public_interest_to_solicit_bids is False
        and facts.only_one_person_can_perform is False
    ):
        return (
            True,
            "Contracting violation (GCR s5): Non-competitive contracts over 40000 require bids to be solicited "
            "unless a GCR s6 exception applies.",
        )

    # Rule (Blawx-derived): Treasury Board approval required above basic contracting limits.
    # For the current model, we encode basic limits by contracting authority and competition.
    basic_contracting_limits = {
        # DMP-derived limits
        "national_defence": {"noncompetitive": 200_000.0, "competitive": 3_750_000.0},
        "canada_revenue_agency": {"noncompetitive": 200_000.0, "competitive": 3_750_000.0},
        "invest_in_canada_hub": {"noncompetitive": 200_000.0, "competitive": 3_750_000.0},
        # "Other entity" (represented in tests by a fixed set of departments)
        "transport_canada": {"noncompetitive": 200_000.0, "competitive": 3_750_000.0},
        "public_services_and_procurement_canada": {"noncompetitive": 200_000.0, "competitive": 3_750_000.0},
        "environment_canada": {"noncompetitive": 200_000.0, "competitive": 3_750_000.0},
        "fisheries_and_oceans_canada": {"noncompetitive": 200_000.0, "competitive": 3_750_000.0},
    }

    limits = basic_contracting_limits.get(facts.contracting_authority.entity_id)
    if limits is not None:
        limit_key = "competitive" if facts.is_competitive else "noncompetitive"
        if facts.estimated_expenditure > limits[limit_key] and not approved_before_entered_into:
            return (
                True,
                "Contracting violation (DMP 4.6.1.1): Contract exceeds the basic contracting limit and requires "
                "Treasury Board approval before the contract is entered into.",
            )

    return (False, "No procurement violation detected.")

On it's face, this seems like one of the (infinite) ways you could accurately implement the encoded rules in an imperative coding language. I haven't dug in too deep, but on it's face, it seems correct.

If you'd like to see all the tests that the coding agent generated, take a look at the GitHub repo for the generated code here.

Some Impressions and Lessons Learned

First of all, there is no longer any concern whether the coding agents can do each of the steps in the process described above. It's still an open question whether and under what circumstances they can do them reliably.

I anticipated that the property testing library I was using (Hypothesis) would run each property test against a large number of inputs. In fact, it only runs the tests until it finds one input that fails. So running the property tests was extremely fast at first, and slowest when the code was correct. Hypothesis runs 100 scenarios through each property by default. Running 100 scenarios across all 14 properties took a negligible amount of time. I reset it to run 1000 per property, and it took approximately 30 seconds.

The Blawx question returned 13 explanations for the same answer. It was difficult to persuade the agent to generate property tests from these answers one at a time. Eventually, I was able to persuade it to do that, but it took a lot of coaxing, and I ended up needing to type "now do the next one" twelve times!

The coding agent seemed very interested in effectively reducing the number of fact scenarios that would be run through the property tests to only those that it knew where inside, outside, or on the border of the covered are of facts. I can see where this might be useful if you have an extraordinarily large number of fact scenarios or properties, and for efficiency reasons you want to limit the inputs for a given property to "close calls". But what I wanted to do was to run a wide variety of fact scenarios across all of the property tests, because it is passing all of the property tests that proves equivalence to the Blawx encoding. Persuading the model to use the testing library that way took effort. That may be because I'm using property testing in a non-idiomatic way, or it may be because the model does not have a good intuition for how to use it. I'm not sure.

Perhaps most interestingly, I'm not sure that this task was difficult enough for running the tests to make a difference to the code quality. When I asked it to modify the function so that it passed the tests, it succeeded on the first try. But that is not to say that having the tests didn't help. It may have been enough for the agent to read the test files to make the code work without more.

Some Stuff to Try Next

This raises a bunch of really interesting questions. Do the tests work better than the statutory text for telling the coding agent how to write the function? When? Is there a level of complexity for which the tests work better than the law? Or are the models so much better at reading code than at reading laws that the tests work better almost immediately?How complicated does the encoding task need to be before it can't get all the tests to pass on the first try? How much law? how many inputs?

I'm impressed, and excited to push the idea further. But that will probably have to wait until after I have given the coding agents a change to try their hands at writing Blawx code.