Machine Learning: higher accuracy at lower cost

Machine Learning (ML) is the most efficient way for a business to predict which of its transactions are likely to result in chargebacks. This efficiency means substantially lower costs and higher accuracy than any other approach to fraud prediction and detection.

However to get these results Machine Learning requires a great deal of data. Online businesses generate a lot of data by the nature of their operations. This makes them very well suited to this approach and, generally speaking, the more data a merchant has the more accurate the result.

Online businesses also have experience of chargebacks in the past. This provides a very useful training set to train the algorithms and models. Most, if not all online businesses should be able to see an extremely positive impact from a machine learning strategy when it comes to fraud.

Sending data to an external party is not always an easy decision to make especially when by its nature, a great deal of that data is sensitive. There is a degree of trust involved. In order to earn that trust a vendor should be held to certain standards:

  • To treat the data with the utmost security
  • To use the data with the utmost efficiency
  • To provide results with the utmost accuracy
  • To explain how those results were reached
  • To enable those results to be improved by expert input

This paper explores the real-world experience of taking an ML approach, how it works, how the results are optimised and how a a fraud team interacts with an ML fraud detection system


Feature Extraction Service: data into fraud predictions

Ravelin has a feature extraction service that is accessed via API. Simply put, this service turns the various features that are contained within the data into numbers that a machine can read. By turning them into numbers it is possible for the algorithms to experiment and assign a weight or importance to each of the features. The algorithms in effect run millions of experiments to judge what the right weight might be in order to generate the most accurate result.

Ravelin breaks the data down in three main categories:

Behaviour - what someone does

Identity - who someone is

Network - who someone is connected to.

Within these very large categories are 100s or perhaps 1000s of features. The business of extracting these features is a significant undertaking and one that a machine-learning fraud vendor needs to spend a considerable amount of effort in doing well.

Feature: A feature is an individual measurable property or characteristic of a date set. For fraud it will usually be a subset of the identity, network or behaviour of a customer.

Screenshot 2018 11 07 At 15 42 37

ML models: from features to a prediction

A machine learning model is effectively a number of algorithms that combine all the features together to produce a score. You can think of a model as something that continually queries the extracted features in order to make a fraud prediction. A model runs these algorithms to provide as accurate of a score as it possibly can.

Deploying machine learning models is the equivalent of analysts running hundreds of thousands of queries and comparing the outcomes to find the best result. With machine learning this is done in milliseconds.

It is possible to consider the service that a merchant buys from a fraud vendor as the skills and knowledge to get the models to ask the right questions and the ability to assess if the answers are correct.

To get these correct answers a multitude of different techniques are used to build different models as different techniques provide better results. For example, a different model would be used for email verification than one that would be used for payment types. Ravelin calls the  combination of these techniques and approaches a micro-model architecture.

Micromodel architecture: the combination of a variety of machine learning models to provide a more accurate result. Each model can deploy a different ML technique that is best suited to the features it is querying.


Fraud Score: distill insight into a single number

A fraud score in fraud detection is the probability of a merchant's customer being a fraudster on a scale of 0 to 100.  How this score is reached is the distillation of an enormous amount of maths.

The score is a result of how the machines are trained to analyse the features that are provided to it. Machines are trained by a labelled set of data, often called a training set. This training set is usually chargebacks or confirmed fraud that a merchant has seen in the past. These are incredibly valuable labels and can accelerate the accuracy of prediction for a merchant.

Labels and labelled data: Labels are how machines are taught about the significance of a piece of data and in fraud how train it to predict if a new customer or transaction is likely to be good or bad.

It is also possible to give accurate predictions in the absence of historical data. This will often mean the application of a model built on similar data to a merchant. This will often provide a good result that is then rapidly improved upon the receipt of verified fraud (usually chargebacks) into the business. The ability to maximise the value of this labelled data to ensure a significant reduction in future chargebacks and improved conversion is again, something that a merchant should challenge a fraud vendor to provide.


Risk Threshold: determine the priority for a business

The fraud score gives a probability between 1 and 100 then there is a decision for a merchant. Where on that scale is the right risk threshold for my business?

There are two core concepts in ML called precision and recall. Not to go into details here (good article on wikipedia) except to say that precision is how many prediction a model got right and recall is how many of the total target set did the model catch?

It’s complicated but when translated into the fraud world it can be considered as  - how many chargebacks did the model predict correctly (i.e. it recommended to stop the transaction and it would have have resulted in a chargeback), and how many did it get wrong (false positives)? Alongside this is the consideration (recall) of how many transactions it needed to prevent in order to get those results.

This is the background calculation and modelling that goes on in order to provide a merchant with a choice. Where do they want to set the fraud threshold in their business to optimise for stopping fraud or minimising false positives?

N.B. It is important to add a little scale here; the usual acceptance rate for a Ravelin merchant at least is usually higher than 98 or 99%. That’s to say almost all transactions are approved. It is within the small band of rejected transactions where the optimisation occurs. One could frame the problem as: how close to 100% acceptance can a business get without the cost of fraud becoming too high. That balance is the threshold in effect.

Precision and Recall in ML

Precision and recall.

The best answer would be to stop all fraud and never have a false positive of course. But this is not reality - and why there is a curve.

So, a choice is made based on data and a threshold set at say 20%. This means that all transactions below that risk threshold will be prevented and above it will be blocked or challenged. A challenge might be a 3DS check of an additional ID check in some scenarios. That is a policy choice for the user.

This threshold and its impact is under constant observation and is continually discussed between Ravelin and its merchant’s analysts.


The role of the analyst: unpicking uncertainty

Making the job of the analyst more effective is at the core of Ravelin’s value proposition. The goal of a successful implementation is to remove the heavy burden of data analysis and display the results in a way that aids investigation, insights and reporting.

One of the key tenets of the obligations Ravelin assumes around a client’s data is explainability. That’s to say any merchant needs to know how and why a decision was reached by a fraud vendor. Without this information it is not possible to improve the performance of the machines, quite apart from any customer service considerations.

This, then, is how analysts engage with, improve and optimise machine learning fraud detection systems. Machines are exceptionally good at doing the heavy lifting in data analysis, number crunching and output. They work tirelessly through the night and never complain at working weekends.

Machines are less good at dealing with uncertainty. There are cases that are new, or that are difficult, or somehow different. Edge cases are those that require more attention and may be difficult to determine.

Ideally the expert human intervention here is not at the point of approval for a transaction. It’s more a case of analysis after the event and labelling the data in a way that gives rapid feedback to a machine. It is worth re-emphasising that labelled data is the ultimate training set for a machine. So the more confirmed behaviour labels it can receive the more accurate a result there is likely to be.


Custom models per client: the key to optimal accuracy

Ravelin build custom models for each client. As discussed, Ravelin uses a micro-model architecture approach. One value in this approach is the ability to use different ML techniques to query features. Another is that it allows a model to inherit some learnings and observations from one data set and apply it to another. To be clear, this is not about sharing data, but about the re-use of algorithms .

This is often the approach used when first engaging a new dataset. With new labelled data the models quickly change and adapt in effect making the model set used with any merchant custom-built for them. This is a key determinant in getting the most accurate results for a specific merchant. While a lot can be gained from the general model approach used by many machine learning fraud vendors (where all clients are served by a shared model), this will only go so far and often has to be heavily supported with a large number of manual reviews and second-guessed by an extensive ruleset.


Validating results: how to ensure all is well

The validation process is assisted by the use of a control or golden set of data. This allows us to do a counter-factual view of ‘what would have happened’ had we allowed a sample set of transactions go through. This statistically valid sample validates the chargeback and false positive rates and allows the detection team to confirm that performance is normal.

Additionally, proposed changes for any model are run side-by-side with an existing model to ensure that the impact of  the new model will positively impact the conversion and/or fraud rates. One secondary effect of this is that it can show the wider impact of a knee-jerk response to a fraud issue. In ML terms this would be akin to ‘over-fitting’ i.e. giving too much weight a single feature. This is an issue that plagues rules-based systems, even weighted ones, as it is not possible to run the number of scenarios that a machine learning model does to assess impact.

Model improvement is done through the combination of chargebacks data, confirmed fraud and good transactions through manual review, as well as the continual monitoring of model performance against expected norms. Improving and maintaining the improvement are key to the interaction between the detection team and the client’s analysts.