Machine Learning and Fraud Prediction: what data is required to make it work?

Machine Learning and Fraud Prediction: what data is required to make it work?

There is more noise than signal when it comes to machine learning (ML) and its role in fraud detection, or more accurately, fraud prediction. Opinions vary from the deeply cynical to the almost magical and the net result is a great deal of confusion over what it is and is not capable of doing.  

For an introduction to ML and how we use it at Ravelin, listen to my colleague Dr. Eddie Bell’s excellent podcast. In this blog post I’ll attempt to tackle one aspect of the ML - what data does it need to be effective, why does it need this data and what does ‘effective’ mean anyway?

To start at the end, effective means accurate and accurate means getting most predictions right most of the time about which orders and customers are likely to result in chargebacks. To  make those predictions accurately we need access to a merchant’s data. 

I’ll say straight away that in this blog post we’re simplifying matters. Any merchant will have data specific to their own business and that’s great, more  data is always better than less. However there are general truths that we can talk about.

Types of Data

One consequence of magical thinking around ML is that somehow the models will come up with an accurate prediction with minimal inputs. Unfortunately, this is not the case. Ravelin requires a reasonable amount of data to make good predictions and the better the data the better the response. This is managed through the integration process at the start of an engagement where the data is consumed through the API. 

Identity, Behaviour and Networks

 

Screen Shot 2018-04-06 at 15.16.10.png

Ravelin uses a micro-model architecture, which is lots of little discrete models that, in aggregate, combine to give a prediction, But for clarity we an bundle them into three categories. 

Identity  - who the customer is
Behaviour - what the customers does
Network  - who the customer associates with. 

The percentages in the diagram are purely indicative of what a % of a merchant's data would contribute to a  prediction and a determination. We can dig into a little more detail on each. 

Identity Model


The identity model is everything a merchant can tell us about the customer on their system. From the initial sign-up, email location, device, timestamps  -this can be anything up to 100 attributes, but usually much less. You can read the API here

Screen Shot 2018-04-06 at 15.34.35.png

Ensemble Machine Learning

It is tempting to see these models as discrete and atomic but important to realise that they are not. These models can predict individually but not as effectively as when they are combined into what is called ensemble models. Combined models are multiples more effective than models working in isolation. 

Behaviour Model

This is the big one. This is everything a customer does, orders, and pays with on the site. For the technically-minded there is more API documentation to explore here.

Screen Shot 2018-04-06 at 15.35.17.png

This is a rich seam for fraud detection. It is where we find the most variety in data types that are available, but equally where we find the most compelling contributions to fraud prediction accuracy. This can easily reach to 200 or so attributes, and within those attributes, the models can mine 1000s of features. 

Orders versus Customers
An important point to make here is that while in this blog post we’ve focused on a customer-centric view of predictions, it is equally possible to do it with an order-centric view. It uses slightly different models, and a different dashboard, but we’ve found the prediction results to be very good. Customer history however, definitely provides a richer resolution in fraud prediction.

Network Model 

Finally we have the network model. This is especially well-developed in Ravelin as it is something we know adds significant marginal gain when combined with the other models. Here we pull information such as a device ID, location information and quickly map out connections in the data that look highly suspicious. This model is less data-intense as it pulls from other sources. There are also JS snippets available that pull data from your site and app, making it a very straightforward part of the integration process. 

Why does Ravelin need this data? 

The algorithms that underpin Ravelin are built on the experiences of our existing clients. They evolve constantly and we continually update the models for our clients based on the chargeback data we receive, the review feedback from the client’s analysts,  and what we are learning across our client base in general. 

We also employ investigations analysts to look into specific anomalies or errors and in aggregate their findings can result in model adaptations. 

In short, Ravelin is a powerful chargeback prediction engine from the get-go. However for a new client, the engine is highly reliant on the quality and quantity of the data that is fed into it. Where there are gaps, the performance simply is not as good. 

Trust. Then analyse.

Now, data quality is a hard thing to define. No-one wants to admit that their baby is ugly and there are often hard conversations about how and why certain data is missing. This is something that the integrations team is very used to. Equally the detection team at Ravelin is creative about working around gaps and ensuring optimal performance (ergo recommending the optimal block and acceptance rates for a business). 

This is an open, productive and valuable process, worth investing the time and energy on getting it right. We believe the integration phase is the bedrock of trust is engaging with Ravelin. It’s essential for a successful relationship that the client trusts the predictions that Ravelin makes. 
 

 

 

Subscribe to the Ravelin Blog

For the latest in fraud prevention, machine learning, artificial intelligence and graph databases, subscribe today.