Blog / Machine learning, Ravelin product

Why fraud prevention without Natural Language Processing is missing a trick

Machine learning for fraud detection and prevention is accurate, efficient and fast. But most models can only handle numeric inputs. Ravelin Machine Learning Engineer, Shayan Sadeghieh, and Data Scientist, Antons Tocilins-Ruberts, explain the importance of NLP and text signals in catching fraud.

29 November 2022

Why fraud prevention without Natural Language Processing is missing a trick

Machine learning systems are perfect for the dynamic and fast-paced nature of fraud. They can make real-time decisions and assess customer behavior as it happens. All you have to do is feed the model as much customer and transaction data as possible.

The only snag is that machine learning models can only understand numbers. For example, order value or size. But, as we know, fraud signals don't always appear in numeric form. Text information, like delivery notes or item descriptions, can be key indicators of fraud.

So if your model is only trained on examples of numeric fraud, you’re missing out on valuable information. How can you ensure that your machine learning model catches all fraudulent behavior?

How does a machine learning model learn to spot fraud?

A machine learning model is able to spot strange customer activity, which it then automatically blocks or flags for analyst review. But how does it know what "normal" customer behavior is to begin with?

Machine learning models go through training cycles. During these cycles, the model is fed examples of what genuine and fraudulent behavior looks like for your customers.

The more examples it receives, the better it becomes at telling the difference. And ultimately make accurate predictions.

When your customer is completing their checkout, we calculate thousands of features about that customer. These features can be broken down into: identity, orders, payment information, location and network.

This information is fed into your model to produce a risk score on a scale of 1 to 100. The higher the score, the higher the probability of fraud. This is extremely effective – but only to a certain point.

Why are text signals so important?

Let’s use the example of a large online marketplace. Marketplaces allow customers and sellers to create custom names and descriptions, so they have a lot of text data. And these free-text fields carry a lot of unique fraud signals.

First of all, some items are just more likely to be fraudulent because they're popular and high value. So the item name is a valuable indicator.

Secondly, specific feature listed in the item’s name or description can raise flags. For instance, if the description for an iPhone says that it’s jailbroken.

Finally, the overall quality of text can often point to fraudulent suppliers. Typos, short sentences, suspicious links... All of these could suggest fraud.

We want to ensure that fraudulent behavior in all forms is picked up by your machine learning model. And overlooking text data limits its performance and capacity to do so. We need to be able to feed all of this data into the machine learning model during training and in a production environment. But how?

How does Natural Language Processing apply to fraud prevention?

The solution and challenge is converting these text signals into numeric form. This is where Natural Language Processing (NLP) comes into play.

NLP is a branch of artificial intelligence that works to give computers the ability to understand written text or spoken language. In our case, we send text fields to an NLP model during the feature extraction process. The NLP model returns numbers that represent those text fields.

Those numerically encoded text features can then be fed into your CNP model along with other features to get a recommendation. The process is illustrated below.

Without the NLP model:

With the NLP model:

How does Ravelin’s NLP model work?

Under the hood, our NLP models use state-of-the-art embedding techniques. Word embeddings are number representations of text that encode the meaning of the words.

It does this by grouping similar words closer together and dissimilar words further apart. This allows us to take things like context and word ordering into consideration – factors a simple model might miss.

Let’s imagine that our NLP model has learned that the two most important features of an item are its price and the popularity of the item category. The item embeddings might look something like this:

Using this two-dimensional embedding method, we’re able to easily separate the most fraudulent items. In this case, iPhones and sneakers.

Of course, real-life embeddings are more complicated and have higher dimensionality. But the motivation is the same. Using these embeddings we’re able to meaningfully encode text and then use it in our models.

What does it look like in practice?

For our gaming merchants, we can now distinguish between harder-to-sell items and those that are easier to shift. For example, prepaid cards raise a bigger red flag than game activation codes because they’re easier to sell or cash out. So fraudsters are big fans.

For our retail merchants, we can factor in the popularity of an item. We’ve found this to be quite an important signal for fraud.

For food delivery merchants, fraudsters love to order expensive alcohol and junk food (who knew!). So item names are incredibly useful signals.

Across industries, discount and shipping type information is frequently provided to us in the text fields. Now we can efficiently use this information to catch bad actors.

Future-proof fraud prevention

There are, of course, challenges when introducing an extra model into your fraud detection solution. But Ravelin has the necessary infrastructure in place to handle multiple parallel calls to different models. So latency isn’t an issue and we’re still able to support fast and frictionless predictions.

Fraud detection is an ever-evolving field and we’re constantly improving our models. NLP massively expands the capabilities and effectiveness of machine learning. But the work doesn’t stop there.

Adding new languages and increasing the number of text fields it can process are just a couple developments on the horizon. Fraudsters are smart and adaptable, so staying one step ahead is not enough!

Learn more about machine learning fraud detection.

Antons Tocilins-Ruberts, Data Scientist

Shayan Sadeghieh, Machine Learning Engineer

Why fraud prevention without Natural Language Processing is missing a trick

How does a machine learning model learn to spot fraud?

Why are text signals so important?