The Data Hygiene Balance

Guest Blog: Shawn Curran, Head of Legal Technology at Travers Smith LLP; and Eleanor Hobson, Associate at Travers Smith LLP. #AIweek2021

We already hear that data is more valuable than oil but how do you protect the value of that data and any models that you might be training with it? If you're using data to train AI models, you need to know where it came from, what rights you had/have to use it and when you need to delete it. 

The Travers Smith Legal Technology team has been working on how to extract and maintain the value of our own training datasets for building machine learning and rules-based AI tools, for example to use when reviewing and analysing contracts.  As part of this, we want to democratise access to AI and are the first and only company in the world who have Open Sourced a data labelling and structuring platform, enabling businesses to prepare and pre-process their own AI training datasets. The platform, Etatonna, has now been shared with around 12 organisations. We've picked up some learnings along the way – we haven't managed to squeeze them all in here, but we're giving it a good go. (So we can keep this short, we'll refer to machine learning and deep learning algorithms here as AI.)


  1. Poor data hygiene could = AI model deletion

Regulators are getting interested in the training data used to produce AI models.  Once training data makes it into your model, you might be tempted, indeed required, to delete it. However, if you're still using the model, deleting the training data doesn't end the story.  In fact, failing to retain some information about historic training data might land you in even hotter water if you can't evidence that you were able to use it.  In addition, if you're using personal data, if someone opts-out of processing or withdraws their consent, well you'll want the data to retrain the model. If you're putting all your training data into your models and you're not maintaining proper data hygiene, you run the risk that at some point some of that training data might come unstuck and then you're stuck with a model you can't use. In January, the US Federal Trade Commission took enforcement action against Everalbum who had used users' photos to train their facial recognition algorithms without the users' consent, requiring them to not only delete the data but also the AI models trained using these datasets. There's also a possibility that poor data hygiene leaves an opening for competitor disruption. Competitors could query others' datasets – if everyone has poor hygiene this might be MAD, but as larger companies invest in data hygiene and audit, and AI suppliers try to subsume their clients' market share, companies that lack auditability and traceability of their AI models may expose themselves to unnecessary risk of their IP being challenged.An ability to re-play model creation, after datasets (particularly personal datasets) are removed, will be critical for the AI industry. 

  1. Balancing your data protection obligations

Balanced against the need for auditability (and maintaining lots of information about our datasets) is retention; storing large volumes of training data so that you can retrain AI models, attracts its own risks. If you're using personal data to train your models – you have an obligation to delete it as soon as it is no longer required and, if relying on "consent", you need to be keeping an eye on the scope of that consent and if/when it might expire. Where this personal data gets more "sensitive" (e.g. special category data or financial information), you'll need to take extra steps to keep it secure.  Therefore, if you keep your datasets around to disassemble and retrain (as above) you need to be careful to balance those benefits against the risks of accumulating data that you no longer need or that's creating a cyber risk for your business.

  1. Sharing AI models, reverse engineering and brute force attacks

AI software companies are already providing their models to customers – for example, in our industry, to carry out contract reviews.  This gives enormous potential to get a better "market" model, using data from across businesses and industries.  However, it does leave open a risk to competitors or malicious actors reverse engineering or carrying out a brute force attack to reveal the original training data. Reverse engineering proofs are still relatively nascent, but to stay ahead of the curve, potential solutions we've been looking at include (a) to use synthetic training datasets, whether in whole or part to replace and supplement our existing data and (b) to anonymise or pseudonymise our data and keep any "key" outside the model, thereby removing any unique identifiers. If launching a product, there's also the layers of technology above AI models to help detect and block brute force attempts. However, where businesses may share AI models, this will need to build in protections to prevent/deter reverse engineering.

  1. ExplAInability

We could spend the whole briefing on this, but building in a mechanism by which you can explain your model, whether by looking under the hood or providing counterfactuals, is important from the get-go. (Working with lawyers every day, we have to be able to explain how and why we ended up with our answers.) We have found that, sometimes, once we've built a machine learning model, we can use that model to revert to a rules-based version.  For example, the process of labelling data is simply to train an algorithm (in an NLP context) language patterns that the human eye might miss across hundreds of thousands of copies of data of a similar context. Once those patterns have been detected, and the model has strong signals on specific words (using NLP as an example) it might be more sensible for an organisation to convert that "Explainability" into a simple rules based system. In doing so, we avoid having to retain the source labelled data or the original AI model. We think of this approach as training data > model > output > explAInability > key words and phrases > convert to rules system > delete model > delete source data > fully explainable!

  1. Bias, Fairness, justifiability and ethics

…and let's not forget that it isn't always enough to be able to explain how the AI model got its answer, but we need to be thinking about discrimination, bias and fairness! Again, this is one we could talk about for hours, but appreciate this topic will be explored more in other techUK sessions today. Although our industry means that we don't tend to have personal data in our datasets, we are always thinking about whether we have enough diverse data, if we've accidentally excluded any situations or included any inappropriate labels.

  1. Power over datasets – competition, intermediaries, open source and access

The elephant in the room is that who owns the data will be able to train and retrain AI models – let's not forget the (predominantly red, white and blue) elephants. We believe that it is vital that everyone has a seat at the table and an opportunity to create their own training data.  That's why we open sourced Etatonna so every business could build their own datasets.  However, we also see a role for intermediaries in allowing other businesses and individuals access to data, building trust and reducing barriers to entry. Furthermore, "model deployment" is key here as models will need to be plugged into any variety of systems for them to operate against specific use cases or workflows, and we expect there might be some kind of model standardisation that would support this – whether the intermediary is simply a model store or a specific product that also acts as a model store. 

  1. Skills gap and inequality

The Alan Turing institute's report on the Gender AI Gap in March 2021 showed just how far the divergence has grown since the days of NASA's predominant number of women computers (see Hidden Figures if you're missing us here).  However, we're not just looking at a gender gap here, we’re looking as a global economic divergence between those with the skills and resources to use AI and those without.  We have a responsibility to be training up those whose jobs might be lost in the AI revolution and making sure that we're not automatically discarding whole populations along the way.  We're tackling this by an internal technology training course that is open to everyone across the firm, including specific modules on machine learning and AI, and having an open door policy on being involved in Etatonna's development and datasets.



Shawn Curran, Head of Legal Technology at Travers Smith LLP; and 

Eleanor Hobson, Associate at Travers Smith LLP


You can read all insights from techUK's AI Week here

Katherine Holden

Katherine Holden

Associate Director, Data Analytics, AI and Digital ID, techUK

Katherine joined techUK in May 2018 and currently leads the Data Analytics, AI and Digital ID programme. 

Prior to techUK, Katherine worked as a Policy Advisor at the Government Digital Service (GDS) supporting the digital transformation of UK Government.

Whilst working at the Association of Medical Research Charities (AMRC) Katherine led AMRC’s policy work on patient data, consent and opt-out.    

Katherine has a BSc degree in Biology from the University of Nottingham.

[email protected]
020 7331 2019

Read lessmore