Voiceform | Does AI Give You Different Advice Because of Your Accent? Case Study

About

Alejandro Salinas is an NLP Research Fellow at Stanford Law School's Lift Lab, where he works on model evaluation with a focus on algorithmic fairness, computational law, and AI governance. A Physics Engineer and Lawyer by training (Tec de Monterrey), he previously interned at Meta and now builds the first large-scale human-speech benchmark for audio-language models.

Visit website

Industry

‍The gap in AI evaluation

Audio-language models are a new class of system. Insteadof the traditional cascade pipeline - record audio, transcribe it to text, then hand the text to a language model and they take audio directly as input. That single change makes them fundamentally different from the speech systems that came before, and it means the evaluation tooling built for older models no longer tells you what you need to know.

Meet Alejandro Salinas, he has spent the last two and a half years as an NLP research fellow at Stanford Law School’s Lift Lab, where model evaluation is a central pillar of the work. The lab partners with legal-services organizations to study how technology actually affects the people who depend on it - which makes rigorous, fair evaluation more than an academic concern. When Salinas turned to audio-language models, he found an open problem: this generation of models had barely been benchmarked at all. The reason wasn’t lack of interest. It was data. Building a real benchmark requires real human speech, captured at scale, under controlled conditions and most researchers assume that data is simply out of reach.

So the team set out to build what they believe is the first large-scale human-speech benchmark of its kind: voice recordings of advice-seeking prompts, captured from participants through Voiceform. The prompts mirror everyday situations where someone might naturally turn to a voice assistant, precisely the setting where audio-language models are used most. And the benchmark was designed, from the start, around fairness. Every participant reads the same prompts aloud, holding the text constant so the team can isolate the variables that matter:

‍Does the model give different advice depending on the speaker’s race, gender, or accent?

Why synthetic data wasn’t an option

Most researchers benchmarking these models reach for synthetic speech. It’s the path of least resistance, but it can’t capture thethings that actually drive disparate model behavior: real accents, dialects,prosody, tone, and the incidental background noise of how people really talk. A benchmark built on synthetic voices measures a world that doesn’t exist.

Salinas’ view is that the reliance on synthetic data is less a deliberate choice than a blind spot and many teams don’t realize that collecting real voice data at scale is now a tractable problem.

“People tend to only use synthetic data and I think one of the reasons is that they don’t know tools like Voiceform exist.” — Alejandro Salinas

Going in, the team expected this to be the easy part. Surely plenty of vendors offered survey-based voice collection? In practice, only a handful of companies do it well - and almost none paired collection with the thing the project actually needed: a clean API to pull every recording,with its metadata, the moment it came in.

“We wanted a service that could also give us an API tofetch all of our data immediately and in a seamless way.” — Alejandro Salinas

Standing up the collection pipeline

Salinas built the entire pipeline himself — as a non-developer — working from Voiceform’s documentation. He authored a Voiceform voice survey of roughly 150 crafted prompts, then embedded it directly into the lab’s existing survey instrument so the voice-capture step lived inside a survey flow participants we're familiar with.

Rather than launch everything at once, he ran adeliberate pilot first. He selected a random subset of prompts, embedded them in the survey, tested the flow repeatedly, and shipped it to a small group of participants. The pilot had two jobs: confirm the participant experience held up, and let him learn exactly how the API behaved before scaling.

The pilot validated the parts that usually break at scale:

• Voice capture worked cleanly — participants reported no problems with recording, and several said they enjoyed the survey.

• Controls behaved correctly — participants couldn’t skip questions, and could go back tore-record a previous answer when needed.

• The API was thereal unlock — response IDs and metadata could be pulled and paired back to each participant, making large-scale data handling straightforward.

“It was really easy for me, as a non-developer, to build the survey, and use the API — just through thewell-written documentation.” —Alejandro Salinas

With the pilot proven, he ran the full study - and the scale is what made it unprecedented. In a matter of hours, the survey reached hundreds of participants. In total the team collected more than 700 interviews,each one a single participant answering 30 to 40 questions, with recordings running from a few seconds to two minutes. Every participant contributed roughly 20 to 30 minutes of rich audio, all retrievable with full metadata through the API — the raw material for a benchmark that synthetic data could never produce.

That combination — real human speech, long-form responses, and genuine scale — is what the field had been missing. Wherestate-of-the-art benchmarks lean on synthetic speech or tiny datasets of five to twenty participants, here the team had over 700, each one a real person.

“We collected more than 700 interviews — something that,with this set of characteristics, was unprecedented for building an audio-language-model benchmark.” —Alejandro Salinas

What the benchmark revealed

The dataset did exactly what a good benchmark is supposed to do: it exposed behavior that would otherwise stay hidden.

It confirmed along-standing problem in newer models

For six to seven years, researchers have shown thatautomated speech-recognition systems transcribe differently depending on aspeaker’s accent and demographic characteristics. Using the newly collecteddataset, the team confirmed that this disparity persists in newer transcriptionmodels — evidence that the problem hasn’t been solved, only carried forward.

It surfaced asurprising result on audio models

When the team pushed audio-language models through transcription tasks — something they aren’t actually built for — the expected disparities largely disappeared. One hypothesis: the language model embedded in these systems can recognize and clean up errors that a pure speech model would propagate. It’s an early finding, but a genuinely interesting one.

It exposed disparities in the models’ actual job

On the task that matters — giving advice — the models behaved differently across speakers who read the exact same prompt. Because the text was held constant, those differences point to everything else the audio carries: accent, dialect, prosody, tone, even background noise. The mechanisms still need to be isolated, but the disparities are already measurable in today’s audio-language models.

“Even though they are saying the same prompt, speakers are receiving different advice.” —Alejandro Salinas

Why it matters

The hardest part of evaluating modern AI usually isn’tthe analysis — it’s getting data that reflects reality: real human voices,captured at scale, with every recording and its metadata available the momentit’s collected. That’s the bottleneck Voiceform removed. A single researcher,without an engineering team, built a novel benchmark dataset — from surveydesign to more than 700 fully retrievable interviews — in a fraction of thetime the work would otherwise demand, with no need to bring participants inperson. The benchmark that “didn’t exist yet” exists because collecting thedata underneath it stopped being the hard part.

“We saved a lot of time and money because we didn’t have to bring participants here — we just deployed it seamlessly across the pipeline we had built.” — Alejandro Salinas

Building a benchmark, dataset, or evaluation that needsreal human voice data? Voiceform turns large-scale speech collection into infrastructure - capture, controls, and an API to fetch it all.

‍

Alejandro Salinas

NLP Research Fellow at Stanford Lift Lab

"In a matter of hours, we deployed to hundreds of participants and collected over 700 interviews - a dataset at a scale that was simply unprecedented for this kind of benchmark."