How to train your AI
One of the key areas and most challenging aspects of introducing AI techniques into an intelligence setting is training. In this blog, I’ll specifically look at one aspect of AI called Natural Language Processing (NLP) as this is an area that can yield great benefits in the intelligence arena.
Knowledge extraction
One of the key benefits of NLP is knowledge extraction. This is where we take written forms of data, commonly called unstructured data, and then give it some structure so it can be queried. For example, putting it into a spreadsheet or database, or even better, a knowledge graph. A knowledge graph is used to visually represent related entities which are real world objects, and as the data is now structured, questions can be asked of it in a structured query format. For example, if working on a Country Lines operation a question such as ‘find all people who deliver drugs to Cambridge’ can be asked. This would not be possible if the data is left in an unstructured form like WhatsApp messaging.
Traditionally in intelligence, written forms of data were assessed for inclusion in an investigation by time consuming manual processing. Readers accessed and read data such as statements, interviews, and intelligence reports but in a world where data volumes are increasing, rapidly the task can become overwhelming and prone to error.
Helpful NLP
This is where NLP can help. Extracting People, Objects, Locations, and Events (POLE) from any unstructured data is an essential part of the intelligence piece. Let’s consider the following chat downloaded from a suspected dealer’s device.
From David Jones 9:34 06/11/20
Adam and I to deliver the drugs to Cambridge next Tuesday
Here, we have all the elements of POLE.
- People - Adam, David Jones, I
- Object - Drugs
- Locations - Cambridge
- Events - Delivery with a timestamp
Using anaphoric referencing (a word or a phrase that links to another word or phrase which was used before in the same text) we can resolve the identities of ‘I’ to ‘David Jones’ and using temporal reasoning we can resolve ‘next Tuesday’ to ‘10th November 2020’. Temporal reasoning is an approach where the next Tuesday can be reasoned to be an exact date by using the combination of the exact date and time in the timestamp of the message with the relative offset of next Tuesday. As a knowledge graph this would potentially look like this:

This is all very achievable as the English is plain, all words are easily identifiable and have common English definitions. The collection of rules and machine leaning algorithms used here are grouped together into a concept called a ‘Model’.
Real-world issues
Meanwhile, in the real-world, drugs chat messages will use slang and abbreviations partly because that is the culture of messaging but also to try an obfuscate their language to avoid detection. The final difficultly is that people don’t often use names in chat messages but nick names or screen names. So, taking all these factors into account the chat would probably look more like this:
From KrazyKane 9:34 06/11/20
Me n Skilz is dropping Cambztwn Tuesday
This is effectively sublanguage which can also be highly dynamic which most traditional NLP with a rule-based approach would not be able to produce meaningful knowledge from. This is where more advanced machine leaning becomes important.
While there are numerous models available in the public domain for processing the first example, our second example is culturally and demographically specific, so we need to start training in this specific area or domain.
Training a model for a new domain would usually involve a large corpus of training data and human expertise in that domain. But as this is a domain using a dynamic sublanguage, a large corpus of training data just isn’t available and if there were, it would be quickly outdated. Here the next generation of machine learning comes more into play. Advanced techniques like neural networks which mimic the way the human brain operates will play a key part in enabling law enforcement to generate insights that can be used operationally from dynamic unstructured data.
Shared responsibility
These new advanced trained models can’t be built by industry alone. Law enforcement has the domain knowledge and the data that is required to effectively train machines. So, I propose a partnership to produce these next generation models which should be highly valued and protected in the intelligence community the same way data is currently treated.
One of the ambitions of the National Policing Digital Strategy is the empowerment of the private sector. Many of my Chorus colleagues are ex-law enforcement and the sense of shared responsibility for public safety is stronger than ever. They can see the ‘art of the possible’ using advanced techniques and technology, but we must first be empowered to help and perhaps this is our first training hurdle.