Most AI teams hire an annotation vendor after a demo and a pricing call. The vendor looks professional, the pricing is reasonable, and the timeline sounds fast. Then you get your first batch of labeled data — and realize the annotators have never seen your domain before. They're making judgment calls on edge cases, consistency is all over the place, and your ML engineers are spending two days a week reworking labels instead of training models.

We've seen this play out dozens of times. It's not a vendor quality problem — it's a procurement problem. The questions teams ask during vendor evaluation almost never surface the issues that actually matter.

This guide covers the 12 questions we'd ask before signing with any annotation vendor, organized across the four areas that determine whether an engagement succeeds or fails: domain expertise, quality and accuracy, speed and scale, and integration and operations.

Want the printable version? We turned this framework into a scored 2-page checklist you can use in vendor calls. Download it free here →


Section 1 — Domain Expertise

Domain expertise is the single biggest predictor of ramp speed and first-batch quality. An annotator who has never labeled spatial data, medical imagery, or retail shelf planograms will learn on your data — and your model pays the tuition.

Question 1

Have you annotated data in our specific domain? Can you show blinded examples?

Generic annotation experience does not transfer to specialized domains. You want a vendor who can show you blinded sample outputs from comparable projects — not tell you they "work across all domains."

✓ Strong answer

Provides blinded samples, names comparable engagements, and speaks fluently about domain-specific challenges and edge cases.

✗ Red flag

"We work across all industries." Enthusiasm is not expertise — push for specific examples or walk away.

Question 2

How do you train annotators on a new taxonomy? Walk me through the process step by step.

How a vendor onboards annotators reveals everything about how they handle knowledge transfer — and whether you'll get production-quality data in week one or week six. Strong vendors run structured domain immersion before a single label is delivered.

✓ Strong answer

A defined curriculum: documentation review → sample tasks → accuracy threshold test → go-live. Timeline under one week.

✗ Red flag

"Our annotators are fast learners" or "we iterate to quality." This means your first batches are their training run.

Question 3

What is your process when annotators encounter an ambiguous or edge-case label?

Edge cases are where model performance is won or lost. Inconsistent handling of ambiguous labels is more damaging than outright errors — it's harder to detect and harder to correct in training.

✓ Strong answer

A documented escalation chain: annotator flags → QA reviewer → client consultation → taxonomy update logged for consistency.

✗ Red flag

No documented process. "Annotators use their best judgment" means you'll discover the consequences at model eval time.


Section 2 — Quality & Accuracy

Quality claims are easy to make and hard to verify after the fact. These questions force vendors to be specific about how they measure and guarantee accuracy — before you're locked into a contract.

Question 4

What is your inter-annotator agreement (IAA) rate, and how do you measure it?

Inter-annotator agreement (IAA) measures consistency between annotators labeling the same task. It's the most objective quality signal available before you run a model eval. A strong vendor tracks this continuously, has domain-specific benchmarks, and monitors for drift over long engagements. If they can't define IAA, that tells you everything.

✓ Strong answer

Defines IAA clearly, shares benchmarks (e.g., >92% Cohen's Kappa for classification), and monitors it per sprint with drift alerts.

✗ Red flag

Can't define IAA, gives a single lifetime number with no context, or says "we don't track that formally."

Question 5

Is your QA layer separate from your annotation layer? What is the reviewer-to-annotator ratio?

Self-reviewed work is not QA — it's proofreading. You need annotators whose output is audited by a dedicated reviewer before delivery. The reviewer-to-annotator ratio signals how seriously they take it: 1:5 is strong, 1:15+ is a rubber stamp.

✓ Strong answer

Separate QA team, defined reviewer-to-annotator ratio, and per-annotator QA scores shared in delivery reports.

✗ Red flag

"Our annotators review each other's work." Peer review without a dedicated QA layer is not a quality system.

Question 6

What accuracy guarantees are in the contract, and what is the remediation process if you miss them?

Reputable vendors stand behind their output with written SLAs and clear remediation terms. "We always deliver quality" is a signal that accountability is negotiable — push for the specific threshold, measurement method, and rework policy in writing.

✓ Strong answer

Written SLA with specific accuracy threshold (e.g., ≥95%), defined measurement methodology, and free rework within a stated timeframe.

✗ Red flag

Oral commitments only, or accuracy defined as "we'll make it right" without a threshold, timeline, or escalation path in writing.

Midpoint check: are you scoring vendors as you go?

Our free checklist includes a 0–2 scoring rubric for each question and a threshold score to identify high-risk vendors before you sign.

Download the Scoring Checklist →

Section 3 — Speed & Scale

Speed-to-first-delivery is the most honest measure of a vendor's operational readiness. These questions surface the flexibility you'll need once your training schedule stops being predictable — which it always does.

Question 7

How quickly can you deploy a team from contract signature to first labeled batch?

Every week your annotation capacity sits empty is a week your model training schedule slips. Best-in-class vendors — including domain onboarding and QA setup — deliver a first batch within 1–2 weeks. If the answer is "4–6 weeks to staff up," that timeline will compound throughout the engagement.

✓ Strong answer

First labeled batch within 1–2 weeks of contract signature, including domain training and pipeline integration.

✗ Red flag

"We'll need 4–6 weeks to staff up." This is almost always understated — build in another 2 weeks for scope changes.

Question 8

Can you scale capacity up or down mid-engagement without renegotiating the contract?

Model training doesn't run on a fixed schedule. You'll hit sprints that need surge capacity and quiet periods where you're waiting on eval results. A vendor locked into fixed headcount costs you money during slow periods and fails you during surges.

✓ Strong answer

Flexible headcount tiers built into the contract. Clear lead time for surge requests — ideally under one week.

✗ Red flag

Fixed team size with change-order requirements. Surge capacity requires a new SOW and another 2–4 week ramp.

Question 9

How do you align labeled data delivery cadence to our model training schedule?

Data delivered at random intervals is operationally useless for an ML team. Weekly sprint cycles tied to your training schedule ensure your engineers always have labeled batches ready when the model needs them — not sitting in a queue while training idles.

✓ Strong answer

Delivery cadence structured around your training schedule from day one. Weekly sprints with agreed batch sizes and QA sign-off before delivery.

✗ Red flag

"We deliver when batches are complete." Ad-hoc delivery means ad-hoc training — your release timeline will drift.


Section 4 — Integration & Operations

The operational relationship determines your day-to-day experience. These questions surface whether you're getting a true partner or a headcount-as-a-service arrangement with no continuity.

Question 10

Which annotation tools and pipelines do you support natively?

Switching your tooling to accommodate a vendor creates migration overhead, version control friction, and workflow disruption. The best vendors integrate into your existing stack — whether that's Labelbox, Scale AI, CVAT, V7, Roboflow, or a custom pipeline — rather than requiring you to adopt theirs.

✓ Strong answer

Native support for your stack, or a documented integration path with a clear timeline and technical point of contact.

✗ Red flag

"We use our proprietary platform." Unless it's demonstrably best-in-class, this is a lock-in play — and a migration risk when you outgrow them.

Question 11

Will we have a dedicated project manager, or are we triaging a shared client success queue?

The PM relationship defines your operational experience. A dedicated PM who knows your taxonomy, your edge cases, and your release schedule becomes an extension of your team. A shared queue means you're re-explaining context on every message and competing for attention with other clients.

✓ Strong answer

Named PM assigned at contract. PM participates in sprint planning and is your single point of escalation — not a rotating support desk.

✗ Red flag

"You'll work with our client success team." A pool of generalists with no domain continuity is a support model, not a partnership.

Question 12

How do you handle data security and confidentiality for proprietary training data?

Your training data is your competitive moat. You need clear, written answers on annotator access controls, data retention and deletion policies, NDA coverage at the individual annotator level, and whether the vendor retains any rights to data processed on your behalf. This is non-negotiable for any IP-sensitive AI team.

✓ Strong answer

SOC 2 compliance or equivalent, NDA covering individual annotator access, defined data retention/deletion policy, and zero data rights retained by the vendor.

✗ Red flag

No formal security policy, annotators access data without individual agreements, or vague assurances without documentation.


How to Score Your Vendor Conversations

After each vendor call, rate their responses on a simple 0–2 scale:

A vendor scoring below 18 out of 24 carries meaningful execution risk. Any score of 0 on questions 4, 5, or 12 should be a disqualifier — these are the questions where weak answers have direct, measurable consequences for your model.

See how this plays out in practice. Read our case study on scaling annotation pipelines for a $42M spatial AI company — including what made the difference between annotation as a bottleneck vs. a competitive advantage.


The Bottom Line

The annotation vendor decision is one of the highest-leverage choices your ML team makes. A bad vendor doesn't just cost you money — they cost you model training cycles, release delays, and sometimes data that has to be relabeled from scratch.

The 12 questions above won't guarantee a perfect vendor. But they will surface the red flags that matter, force specificity from vendors who hide behind vague commitments, and give you a defensible framework for comparing options side by side.

If you want the printable version with a scoring column and a threshold guide, download the free checklist here. It takes about five minutes to fill out during or after a vendor call.

Trinovation answers yes to all 12.

We'll walk you through our QA reports, IAA benchmarks, and connect you with a current client in your domain — before you sign anything.

Book a 20-Minute Call → Send Us a Message