AI Attribution and Provenance

In this post, I want to explore how we might establish AI attribution frameworks and increase the transparency of provenance.

You might have encountered the term provenance in a gallery heist film or read about missing art forgeries. The more relatable application of provenance is traceability. So before we explore what this means for artificial intelligence, let’s start with a snack and a cuppa.

Food traceability

When you bite that sandwich at lunchtime, do you ever wonder where the food is from? Maybe you pondered on the origin of the ingredients as you stood in the supermarket aisle, trying to figure out what to choose. I often wonder how far these products have travelled, and these questions have only increased over my lifetime.

brett jordan 9 7Vkjz61Vk unsplash — Photo by Brett Jordan on Unsplash

Consumer demands for information about where food is from and what’s in it have increased considerably over the last two decades. The need for more food transparency continues to rise as we pay more attention to our health, sustainable practices, climate and food miles.

A 2020 report from the US-based FMI, The Food Industry Association and Label Insight found that 81% of shoppers think transparency is important or extremely important when shopping online or in-store. A jump of 12% from 69% in 2018.

The shifting consumer demands in the food industry are an interesting parallel to how we might respond in the coming months to the emergence of AI tools that support our creativity and productivity.

The call for greater transparency about what lies beneath the magic act will only get louder.

Just as I want to trust the eggs I buy are ethically farmed, with investment in high levels of animal welfare and a commitment to sustainable industry standards; I want the same from the AI bots I collaborate with.

Coffee origins

Another place where provenance and transparency are vital is the $460bn coffee industry.

Coffee is my favourite drink. A family move to Melbourne, Australia, over a decade ago propelled this interest to another level. Rather than complex fancy brews, I enjoy exploring high-quality single-origin beans. I am lucky to have an excellent espresso machine, a home bean grinder, and a great local roastery.

The traceability of coffee beans is essential to the quality and standard you strive for. I have enjoyed learning about the story behind the products I buy as much as drinking coffee. Information is part of the consumer experience; it helps me make better decisions and understand the context and origin of beans.

I like to investigate further and pin the farm on a map, learn a little about the history of the coffee industry in the country or remind myself about the methods used to get the beans to me.

Commonfolk Roastery in Mornington provides a card of information about their coffee products, giving you a range of technical data (processing method, farm location, species, altitude).

The origin story is shared too. Let me show you what I mean, here is part of the information shared for a bag of Columbian beans called La Linea:

When we visited the Campos and Roa families in May 2022 the patriarch of the family, Don Elias, was particularly excited to show us a new farm that they had recently acquired just minutes down the road from Finca Tamana. The farm is positioned along a north facing ridge that extends out straight along the mountain range pointing in the direction of Huila’s highest peak. Thus it was given the name La Linea – translated, the Line. Rows of healthy Amarillo and Rosa colombia coffee shrubs adorn the finca’s terraces. When we visited the trees were heavily laden with ripe cherries ready for harvest. Elias is meticulous in his approach to coffee farming and production and was excited to implement his own techniques and rigour to the quality coffee already being produced at La Linea.

The cynic might see this as good marketing copy that builds trust in a brand, which I appreciate. But I also appreciate the commitment to transparency, information sharing and the effort to articulate the origin story clearly.

These beans are called single origin, meaning they come from a single geographic location that allows you to tell stories like La Linea and conjure the source of the product we enjoy in your mind.

coffe cards — Some coffee cards from my collection

Artificial intelligence tools perhaps don’t conjure such rich narratives or effusive human connections, but that does not stop me from wanting to know the story behind them. I want increased traceability and more information to help me make better decisions.

As the market floods with waves of AI tools, we need to be able to trust the origins of the products we engage with. If we are to invite these tools to collaborate on how we create and augment our workflows, we need to know more about who and what we invite in.

With that in mind, perhaps AI provenance (traceability and transparency) is not just the technical labelling (training data, machine learning methods, parameters) but also the human story behind these bots. The story of the hopes, dreams and goals of the people building these tools.

A brand is the set of expectations, memories, stories and relationships that, taken together, account for a consumer’s decision to choose one product or service over another.
Seth Godin

Google Model Cards

Google’s Model Cards are a promising information structure that is being prototyped to share more technical details from under the hood of the AI tools.

Despite its potential to transform so much of the way we work and live, machine learning models are often distributed without a clear understanding of how they function. For example, under what conditions does the model perform best and most consistently? Does it have blind spots? If so, where? Traditionally, such questions have been surprisingly difficult to answer.

Each card attempts to provide simple overviews of both models’ ideal forms of input, visualize some of their key limitations, and present basic performance metrics.

You will have noticed the connection to the coffee cards created by Commonfolk that offer story, origins, provenance and technical specs. There are some initial prototypes of the Model cards on the site, and OpenAI has just published one too.

If you explore the recent release by OpenAI of their Classifier tool that attempts to distinguish between human generated and AI generated text, you will find a Model Card.

Scroll down to some of the expandable section, below the text window and look out for “Where can I find a Model Card for the classifier?”

The Curse of Knowledge

The complexity of the information that needs to be shared in these Model Cards is a barrier we must overcome to make them a utility for all. I want to be able to use the information to decide whether I will use this tool or not.

An informed decision is not just about the right tool for the job; it is also about alignment with our values and the imperatives of our times.

Despite the authors of the Google Model Cards stating they are a reference for all, regardless of expertise and that they believe increased transparency for machine learning models can benefit everyone, which is why model cards are aimed at experts and non-experts alike if the information confuses me or makes the decision more complex, we have an unintended negative side effect.

Of course, the fix is relatively easy and is about accessible communication and descriptions.

If only we had a tool that could quickly rewrite copy and adapt it to different reading levels?!

For example, take the opening paragraph from the OpenAI example Model Card for their classifier and an adapted version for a Grade 5 reading level.

classifier model card og — Adapted model card from OpenAI Classifier to Grade 5 reading level

classifier model card adapted — Adapted model card from OpenAI Classifier to Grade 5 reading level

(Prompt for image 2: Rewrite this model card into accessible language, simplify where possible, grade 5 level: [Model card details]

Perhaps Model Cards could include a slider to increase the accessibility of the text and a rollover glossary of terms for non-experts.

Another place we might publish clear provenance is in the tool itself. Perhaps the first responses outline the technical information we need as a user.

This seems obvious. However, the tools struggle with understanding what they are, and the accuracy is variable. Here is a good comparison of responses to the same prompt from Claude, a new conversational model released by Anthropic and OpenAI’s ChatGPT:

Note the transparency around methods, ethical alignment and intentions.

Blogging taught me attribution

In the early 2000s, I started writing a blog for my Grade 6 class and powered up a professional blog that became a portfolio of my thinking.

When the education community – pre-Twitter – was connecting and learning from each other via self-published blog posts and reflections, there was a collective commitment to attribution.

In those days, if we used an idea or media that did not belong to us, we would ensure we had attributed those works to the original author. We used the Creative Commons (CC) licenses to help guide us. Here is a description of what the CC licenses did for creators:

Creative Commons licenses give everyone from individual creators to large institutions a standardized way to grant the public permission to use their creative work under copyright law. From the reuser’s perspective, the presence of a Creative Commons license on a copyrighted work answers the question, “What can I do with this work?”

These standard attribution frameworks must be adapted and reintroduced when publishing in collaboration with AI tools.

Of course, it is not a single idea, piece of work, or article we are drawing on when we use generative AI tools. They are pre-trained on a vast dataset. OpenAI ChatGPT uses the GPT-3 series, which has been trained about 45 TB of text data from multiple sources, including Wikipedia and books.

The whole dataset has, in some way, contributed to the content generated by the tool. So how do we signal attribution? At the large language model level?

The legal and ethical ramifications of scraping content without permission in service of creating new content are playing out as we speak. The lack of attribution for code generators like Github’s Copilot and Stable Diffusion’s legal challenge from Getty images are just two examples worth tracking.

Citation is not enough

You might have noticed the academic community adjusting to this idea of authorship; recently, scientific journals Science, Springer Nature and Nature have introduced stricter guidelines.

First, no LLM tool will be accepted as a credited author on a research paper. That is because any attribution of authorship carries with it accountability for the work, and AI tools cannot take such responsibility.

Second, researchers using LLM tools should document this use in the methods or acknowledgements sections. If a paper does not include these sections, the introduction or another appropriate section can be used to document the use of the LLM.
Tools such as ChatGPT threaten transparent science; here are our ground rules for their use – Nature

It is not just about citing. OpenAI has some simple frameworks for citing ChatGPT in more formal writing and when creating images.

The challenge comes from the deeper integration of AI assistive tools into our creative methods, where layers of editing, tweaks, and rewriting occurs and the target of citation (in this case the words, sentences, paragraphs published) becomes much fuzzier.

In OpenAI’s Sharing and Publication Policy they address this fuzzier use case with some stock disclosure language:

“The author generated this text in part with GPT-3, OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.”

Which aligns with the policy conditions OpenAI outlines, that if a creator uses an AI tool to assist and support their process, they need to clearly disclose such use in a way that no reader could possibly miss, and that a typical reader would find sufficiently easy to understand.

The API policy goes on to outline the following conditions for use:

People should not represent API-generated content as being wholly generated by a human or wholly generated by an AI, and it is a human who must take ultimate responsibility for the content being published.

This is buried in a document that has not been updated since the release of ChatGPT and refers to the way other AI tools are plugging into GPT-3 series to generate content. But it is helpful and still applies to many of the tools emerging into the market that are powered in the same way.

The line between what is mine and what is assisted

This type of assistive technology has been with us for a long time. The written word has long been improved from spell checks and more recently tools like Grammarly.

The premium tools for Grammarly also extend beyond spelling and grammar to offering real-time feedback on:

Clarity-focused sentence rewrites
Tone adjustments
Plagiarism detection
Word choice
Formality level
Fluency

gram stats — My January 2023 stats from a personalised analytics email

Most of these can also be adjusted in alignment with your writing goals, which you can set in Grammarly’s editor.

A key distinction is that a Grammarly checker has never contributed anything substantive to the quality of ideas I have written. It is a narrow technology application and so is not generative like ChatGPT.

Qualitatively speaking, although my grammar, tone and spelling may be improved by an assistive checker, it does not write new ideas or introduce concepts not already present.

Another way of thinking about this quantitatively is how alterations and assistance come down to the character, token or word level.

What I find fascinating is that Grammarly has this data on me and every now and then sends insights about the language it has checked.

With that sort of data it would not be difficult to report how much of a whole text is altered, edited or improved. If that is possible it would be possible to do the same for AI assisted publishing.

Assistance and adaptation analytics

Just imagine at the start of an article, blog post or in the header of student written work, a set of analytics about the writing. Information about the edits, the proportion of AI vs human generated text and assisted editing.

The location of this data changes the readers experience. At the start of an article, it might create too much friction if the personal limits of AI generated text is too high; at the end of an article, it might pull the rug from under the reader in a ‘gotcha’ style, and be discarded. I suppose the aim is to increase trust, not diminish it.

Grammarly is not a generative AI tool like ChatGPT, but it still uses machine learning, and is pre-trained on a large language model, a grammatical error correction (GEC) dataset. It also harnesses user inputs to refine the model.

I find it helpful to categorise Grammarly on a spectrum of AI assistive technology that makes a difference to what we create. Under the hood it is not the same as ChatGPT but it uses similar machine learning methods.

Viable signals of truth

I think using AI tools to assist in the creative process will require disclosure and attribution. I don’t want the trust to be diminished.

This is how I see this playing out.

Content creation becomes more accessible and easier with AI tools assistance, which leads to…
A vast increase in published content, including content with malicious intent, leads to…
The demands on our ability to verify truth, intent and humanity become stretched, leading to…
Increased demand for provenance, traceability and attribution.

Maybe we are already at this point of diminishing trust, but I can’t see how this will improve. It will only be much harder to answer: is this real?

This is mind blowing technology. Generative AI will completely change how films are made.

From: @Flawlessai pic.twitter.com/0ciH07A4Wm
— Lior⚡ (@AlphaSignalAI) January 24, 2023

As a viable step for publishing and writing, mini-model cards could be generated for each AI tool that are easy to copy and paste.

Along with analytics, providing a hyperlinked information card as a standard part of the authorship method makes sense to me.

In mini-model cards, we can bring together the attribution – hat tips to the people and businesses who created the technology – and the provenance of the technology – the story behind the tech, links to ethics frameworks and technical specs.

I am doubtful this type of effortful attribution will gain traction; the productivity gains from AI assistance could be quickly lost from fudging around with the correct attribution.

But creators need something that communicates their commitment not to diminish trust.

Remember, the purpose of sharing provenance and attribution is to inform the reader (consumer) and help them make better decisions. When reading an article or student work, those decisions concern how we engage with the ideas expressed and the quality of relationships it impacts.

AI Mini Model Cards

Attribution and provenance of the AI tools used to create this article.

I partially generated this text with GPT-3.5, OpenAI’s large-scale language-generation model and other API tools. Upon generating the draft language, I reviewed, edited, and revised the language in part with Grammarly to my liking and take ultimate responsibility for the content of this publication.

gram — Which AI tools were used to assist this article?

Logo	Organisation	Ethics	Technical	Use
	Grammarly Founders: Max Lytvyn, Alex Shevchenko, and Dmytro Lider Company info	Nothing listed. Closest is Trust Center and Security	Machine Learning and Natural Language Processing. User data, large training corpus: UA-GEC 2.0.	Spelling, grammar and style checker. Edited all text using Desktop or Editor tool.
	Mem.ai Founders: Kevin Moody + Dennis Xu. Company info Funded by OpenAI startup fund	None listed on site Inherited, implied via OpenAI Charter + Pinecone use?	OpenAI embeddings GPT-3 Pinecone vector search Mem-x info Limitations	Notes and self-organising workspace. Smart write feature to expand and edit.
	OpenAI Board: Greg Brockman (Chairman & President), Ilya Sutskever (Chief Scientist), and Sam Altman (CEO), plus non-employees. Company info	OpenAI Charter Sharing and Publication Policy	Fine-tuned from a model in the GPT-3.5 series. Info here. Methods + Limitations	Used to expand and explore initial ideas. Sequence of opening sections and to expand notes.

Enter your search term below