Let’s imagine we’re building a machine learning model to detect fraud for a food delivery business. Our fictional business is called DeliverDinner.
When DeliverDinner joins Ravelin as a new client, they start to send live transaction traffic to our API.
Every time a customer registers, adds an item to their basket, or does anything on the DeliverDinner website, it sends a JSON request to the API. This means we store lots of data about DeliverDinner customers and everything they’ve ever done in their account. We bundle these into customer profiles.
To use this data for machine learning we need to do three things:
- Label the customers as fraud/not fraud
- Describe the customers in computer language
- Train the model
Step 1: assign labels
We look at any customer which has had a chargeback or which has been manually reviewed as fraudulent by the merchant - and label them as fraud.
Step 2: create features
Creating features is basically describing each customer in a way that the computer can understand. We want to describe the characteristics of a customer which indicate if they would be fraudy or genuine - this is based on the same aspects that a fraud analyst would look at to make the decision.
Examples of features which could be good indicators of fraud are:
- Order rate - fraudsters order at a much more rapid pace, we quantify this as number of orders per week.
- Email - fraudster might have a dodgy-looking email, we quantify as % of digits in the email address
- Delivery location - it could be somewhere typically genuine/unlikely to be fraud like a penthouse apartment, or it could be somewhere fraudy like a park. We quantify this as the location fraud rate %
All features are created as a number as the model can’t absorb raw text. We build up our features and categorize them into groups.
Step 3: train the model
We need to feed the algorithm the data so that it can learn how to solve the problem. At this stage, we feed in the training data.
The training data is a bunch of DeliverDinner data about customers, described in terms of their features and labels to let the algorithm know if they are a fraudster or a genuine customer. This helps the model learn how to tell the difference between genuine/fraudulent.
Within DeliverDinner’s dataset, this might show that genuine customers tend to order around once a week, they tend to use the same card each time and the billing + delivery address are often the same. Fraudsters might show that they order several times a week, use lots of different cards, that their cards have failed registration and that the billing and delivery address don’t often match.
The algorithm will take this at face value, and learn the perfect way to check if a customer features look more like the genuine customer pile or the fraudulent customer pile.
When we show the model a new customer it hasn’t seen before, it compares it with the genuine/fraudy customers it has seen before and produces a fraud score. This score represents how likely the new customer is to be fraudulent.
For the majority of customers, the fraud score will be quite low, as there are many more genuine customers than fraudsters. When it’s a low score, we recommend allowing the customer and the transaction to go through. If it’s a medium score, we recommend a Review of the transaction, eg. sending the customer a 3D secure challenge to authenticate. If the score is very high we’d recommend blocking the customer from making the transaction.
How do you decide the limits to allow/review/prevent?
Setting the right limits for Allow/Review/Prevent thresholds depends on precision/recall.
Precision asks: of all the prevented customers, what proportion were fraudsters?
Recall asks: of all the fraudsters, what proportion did we prevent?
Putting precision & recall in context
If your prevent threshold is at 95, you’re blocking a very small % of customers. You’d have very high precision - you’re only blocking a few customers that you’re fairly sure are fraudsters. You’ll have a very low false-positive rate. However, recall is likely to be low as there are likely to be fraudsters with scores under 95 which you’re not preventing.
If we look at the opposite situation - if you have a block threshold of 5. You’re preventing a huge amount of your traffic and so you’re likely to have very poor precision - and probably end up with lots of false positives. You will have high recall - as you’re going to block most if not all of the fraudsters.
Setting the right risk threshold
It’s a bit of a balancing act between the two, and where you set your thresholds depends on your individual business priorities. It’s easy to tweak these depending on your risk appetite, or if you are more concerned about chargebacks or false positives.
Understanding precision, recall and setting risk thresholds is important for us to understand how we can assess our model accuracy and make sure it is improving.
To learn more about why we use custom models for every business, check out our guide to Machine Learning at Ravelin here.