How to Evaluate an AI Audit Tool Without Getting Misled by the Demo

accounting firm partner evaluating software tools on a laptop at a professional

The problem with evaluating audit software is that every vendor's demo uses cleaned, curated data that makes the product look better than it performs in production. The demo dataset was designed to showcase the features. Your client's data wasn't. The journal entry population you'll actually process has ERP-specific quirks, account structures the demo didn't model, and edge cases the vendor's team hasn't encountered yet.

By the time you discover these gaps, you've committed to the platform, trained your staff, and incorporated it into your engagement planning. Reversing that commitment mid-cycle is expensive. The evaluation process is where you find the gaps — but only if you ask the questions that reveal them.

The Demo Is a Sales Presentation, Not a Technical Review

This sounds obvious, but the evaluation processes at most firms don't reflect it. Partners attend a 45-minute product demonstration, the software looks impressive, references are checked, pricing is negotiated, and a purchase decision is made. The technical questions — about methodology, data handling, documentation, failure modes — aren't asked because nobody in the room is there to ask them.

The partner selecting the tool is primarily evaluating whether it will save time and whether the vendor seems credible. Both are necessary criteria but insufficient. The staff auditors who will use the tool daily and the quality control partners who will review the workpapers it produces should be involved in the evaluation. The questions they ask are different from the questions a managing partner asks, and they're the questions that matter for production use.

The Five Questions Vendors Don't Volunteer Answers To

1. What is the false positive rate on a typical engagement, and how is it defined?

Vendors readily share true positive rates (detection accuracy on anomalies). The false positive rate — the fraction of non-anomalous entries that the system flags as suspicious — is the number that determines how much time you'll spend clearing false alarms. A vendor who can't tell you their false positive rate, or who defines it in a way that minimizes its practical significance, is either not tracking it or knows it's not favorable.

The follow-up question: what threshold is used for the published accuracy figures, and what is the false positive rate at that threshold versus at the default threshold that auditors use in practice? These numbers may be different, and the difference is meaningful.

2. What happens when my client's chart of accounts doesn't match your training data?

Every anomaly detection model is trained on a specific population. The model's performance on populations that differ significantly from the training data may be substantially worse than the published benchmarks. Clients in niche industries — specialty finance, government contractors, healthcare providers with complex billing — often have account structures that are unusual relative to the training datasets used by most vendors.

Ask specifically: what industries and account structures are represented in the training data? What performance degradation has been observed on clients outside those industries? Does the model require a calibration period on new clients, and if so, what is the quality of results during that period?

3. Show me a workpaper produced for an engagement that was subsequently PCAOB inspected.

This is the most revealing request you can make. If the vendor has clients whose engagements have gone through PCAOB inspection and the workpapers generated by the platform have been reviewed, the vendor should be able to describe whether those workpapers satisfied the inspectors' documentation requirements. If the vendor has never had a client engagement inspected, or can't speak to how their documentation format has held up under inspection, that's material information about how new the product is in production.

The PCAOB has specific views about what IT-assisted JE testing documentation should contain. A product that has been in production through multiple inspection cycles has had its documentation format tested. A product that hasn't been through inspection is untested against the standard that matters most to PCAOB-registered firms.

4. What does the error handling look like when a data extraction fails mid-process?

The demo always completes without errors. Production use involves connectivity failures, authentication timeouts, ERP updates that break integrations, and data format changes that cause import failures. The question is not whether these failures occur — they do — but what the platform does when they occur and how completely they are logged and reported.

Specifically: if a SuiteQL extraction fails after retrieving 70% of the population and the failure is logged as a successful extraction, you've just tested 70% of your population and documented it as 100%. That's an audit quality problem. Ask the vendor to demonstrate or describe error handling for partial extraction failures and for import format mismatches.

5. How is the model updated, and how are customers notified?

Anomaly detection models should improve over time as more training data becomes available. The question is what happens to engagements in progress when the model is updated. If the model version changes between the first and second scan of the same engagement, the scoring may produce different results for the same entries — which is a workpaper consistency problem if both scans are referenced in the file.

The vendor should have a model versioning policy that ensures the model version used for an engagement is pinned for the duration of that engagement and documented in the workpaper export. Model updates should be applied between engagements, not during them. Ask for the written policy, not a verbal description of how they think about it.

The Evaluation Protocol That Actually Works

The evaluation protocol that reveals production performance versus demo performance requires three things: real data, real criteria, and a comparison standard.

Real data: Run the evaluation on an anonymized extract from a completed prior-year engagement. Not a small test file — the full journal entry population from a real engagement. This is the only way to observe how the platform handles your clients' actual account structures and volume. Most vendors will support this under an NDA covering the data handling for the evaluation.

Real criteria: Before running the evaluation, document what a successful result looks like. Specifically: what is the expected range of flagged items (as a percentage of total entries) that would indicate appropriate sensitivity? What documentation format do you need the export to produce to satisfy your workpaper requirements? Who on your team will review the flagged items, and in how long? Defining the criteria before seeing the output prevents the rationalization that happens when a vendor demonstrates a result and you retroactively decide it was acceptable.

A comparison standard: Run the same evaluation on a prior-year engagement where you already know the outcome. What anomalies did your manual testing identify? Which of those does the platform also identify? Which does it miss? Which does it identify that your manual testing didn't? This comparison tells you whether the platform is a net improvement over your current methodology, not just whether it produces output.

Red Flags That Should End the Evaluation

Several responses to evaluation questions should end the evaluation rather than trigger further due diligence:

The vendor can't share a SOC 2 report or can only share a Type I report (which covers the design of controls, not their operation). For a tool processing client financial data, Type II is the minimum bar.

The vendor's data handling involves client data leaving the firm's control without a documented legal basis for that transfer and a data processing agreement that specifically covers financial data. Accounting firms have professional confidentiality obligations under state CPA statutes and in some cases contractual obligations to clients — a vendor who is unclear about where data goes and what legal protections apply isn't a viable partner for this use case.

The vendor can't describe what the model tests for in plain language, without reference to proprietary methods. "We use proprietary AI" is not a description of methodology. Auditors need to be able to describe what procedure was performed and why it provides evidence relevant to the objective. If the vendor can't describe the procedure clearly enough for you to document it, you can't document it.

The vendor's workpaper export doesn't include the complete population with scores — only the flagged items. A complete-population workpaper means the population, the scoring, and the results. An exceptions-only report documents only the results and leaves the population definition and coverage unstated.

The Pilot Structure That Limits Your Exposure

If you've completed the evaluation protocol and the vendor has passed the key tests, a structured pilot limits financial and operational exposure before full commitment. The pilot should cover one complete engagement, using your data, with your staff as the primary users, over a defined period (60 to 90 days is typical). The contract for the pilot should specify the performance criteria that would result in contract termination without penalty.

The goal of the pilot is not to see whether the software works in general — you've already established that in the evaluation. The goal is to verify that it works in your specific workflow, with your staff's technical capabilities, on your clients' actual ERP environments. Those are the conditions that determine whether the investment in training and process change will produce the intended benefit.

AuditPulsar's standard pilot program runs exactly this way: 60 days, one engagement, your data, our support team available throughout. We tell firms what we find and what we think the product is and isn't ready to do in their specific environment before they make a commitment. That's the evaluation process we'd recommend applying to any tool in this category, including ours.