Battleground 1

AI Definitions

Nobody raises their voice over definitions. That is what makes this battleground dangerous. You lose it in silence at the top of the agreement, and the loss cascades into ownership, training rights, and data exit.

Legacy SaaS definitions cover what you upload: documents, files, data sent to the platform. That worked when the product was a database with a UI. AI products are different. Your input passes through a pipeline — it becomes a prompt, gets converted into a mathematical representation, lands in a retrieval index, gets processed by a model, generates an output, and leaves a trail of logs at every step. Each intermediate artifact carries your information. Some of it is more revealing than the original.

The tension is simple. Vendors define Customer Data narrowly — the raw text you type and the final response you receive — and claim everything generated in between. Buyers want the whole pipeline: every derivative, every representation, every log that captures their content's statistical DNA.

Three flashpoints dominate: derivatives (when your document becomes a high-dimensional vector, that vector is a compression of everything that made the text valuable — semantic relationships, domain terminology, confidential patterns; if the vendor owns derivatives, they own a machine-readable version of your trade secrets); retrieval indices (your knowledge base — searchable and operational; if the contract does not define it as Customer Data, the vendor keeps it after termination); and logs (every prompt, response, and latency metric maps your organization's strategic priorities; vendors classify logs as 'Usage Data' and claim broad rights).

The fix: define Customer Data by function, not by artifact type. A non-exhaustive list handles what today's technology produces; the functional catch-all handles what tomorrow's produces. The standard vendor line — 'Customer Data means data, information, and content uploaded or provided by Customer to the Services' — creates a clean line: if you uploaded it, it's yours; if the system generated it, it's not. Everything valuable in an AI deployment sits on the wrong side of that line.

The nine definitional terms that do work in every agreement: AI Feature / AI System (the umbrella trigger — every downstream obligation attaches here); AI Customer Input (prompts, documents, context, fine-tuning data — what the user types and what the system ingests on the customer's behalf); AI Customer Output (content, predictions, classifications, or recommendations the AI generates in response to Input); Data Derivatives (embeddings, inference logs, retrieval indices, fine-tuned weights, plus the functional catch-all); Usage Data (anonymous, aggregated metrics — the acceptable carve-out, but only where no specific customer can be identified); Aggregated Statistics (pooled across customers, anonymized — but 'aggregated' is often a training pipeline wearing a metrics label); Training Data (explicitly excludes Customer Data, AI Customer Input, AI Customer Output, and Data Derivatives); Provider AI Policies (the vendor's documented governance, referenced in the MSA to keep it current); and General-Purpose AI Model Provider (the third party behind the API — the flowdown target for every no-training promise).

The negotiation sits in the 'hard middle': the vendor's position that embeddings are their technology because they built the embedding model and run the vector database. The buyer's counter: the embeddings encode the buyer's content, represent the buyer's knowledge base, and if the vendor can repurpose them after termination, the buyer's operational memory is being retained without consent. The resolution — where most enterprise deals land — is that derivatives are Customer Data, but the vendor is not required to export them in a usable format. The buyer gets a deletion obligation and a certificate; the vendor keeps the vector database architecture.

Vendor View

"Our embeddings are our technology. We built the embedding model. We run the vector database. If we have to segregate every customer's embeddings, we cannot offer a multi-tenant architecture. The derivatives are a byproduct of our technology applied to your data — not your data."

Buyer View

"The embeddings encode our content. The retrieval index is our knowledge base. If the vendor can repurpose our embeddings after termination — to train their next model, to benchmark against other customers, to sell as aggregated insights — then the no-training promise means nothing. We need derivatives in the Customer Data definition."

Red Flags

  • Customer Data defined solely as 'documents, files, and data uploaded to the platform' — pre-2022 language that predates AI architecture
  • No mention of embeddings, inference logs, or retrieval indices in the Customer Data definition
  • Usage Data defined broadly enough to include customer-identifiable logs or prompt-level data
  • 'Aggregated Statistics' defined without an explicit exclusion of Customer Data or derivatives
  • Training Data definition that does not explicitly exclude Customer Data in all its forms
  • A no-training clause covering only the vendor's own models but not model provider training pipelines

Sample Clause (Illustrative)

"Customer Data means all data, information, and content provided by or on behalf of Customer to the Services, and all derivatives thereof, including any data generated by the Services that incorporates, derives from, or is based on Customer Data. Without limiting the foregoing, Customer Data includes vector embeddings, inference logs, retrieval indices, fine-tuning data, and any other intermediate or derivative data representations. For clarity, Customer Data excludes Usage Data, which means anonymous, aggregated data that cannot be used to identify Customer or reconstruct the substance of Customer Data. Provider shall not combine or commingle Customer Data with data of Provider or any third party, including Training Data, and shall logically segregate and isolate Customer Data from all such Provider and third-party data. 'Training Data' means all data used to train, retrain, fine-tune, validate, or otherwise improve any machine learning models, and Training Data expressly excludes Customer Data in all forms."

Closes the three definitional gaps — derivatives, Usage Data as a backdoor, and training data exclusion — in a single architecture. The functional catch-all ('incorporates, derives from, or is based on') covers future artifacts. The non-exhaustive list covers today's. The segregation commitment converts a definition into an engineering obligation. The Training Data exclusion creates a clean, bidirectional boundary: Customer Data is never Training Data, and Training Data is never Customer Data. This is the strongest customer-favorable architecture. If the vendor cannot segregate (because embeddings are stored in a shared vector database), the fallback is to drop the segregation clause and substitute a binding deletion and non-reuse certification.

Illustrative only. Clause language requires adaptation to your jurisdiction, deal context, and risk profile. Not legal advice.

Drawn from the Enterprise AI MSA Playbook (June 2026) by Laith Sarhan, Sarhan Data Law. Educational content only — not legal advice.