At Ravelin, we train and assemble individual micromodels for every business - resulting in each client having a solution suited to their specific data and fraud trends. The solution is made up of a micromodel architecture - meaning it's a combination of different models for different data in one. Each individual business will have a different weighting on the different models depending on their data.
These individually specialized models learn the best representations of specific parts of the dataset. Here's a simple representation of this:
Three expert micromodels we use
Three prominent micromodel architectures we use are:
- Natural Language Processing (NLP)
- Anomaly detection autoencoders
Besides these three models, we use a variety of customisable experts to build a ‘mixture of experts’ model. Each of the models is a specialist in its own right - we’ll explain the three above in more detail…
To introduce the TabNet architecture model, it’s useful to look at the more well-known random forest architecture first.
Random forest & decision trees
A random forest architecture is made up of many decision trees - so first, what is a decision tree?
A decision tree is a series of questions with yes/no answers which aims to divide data into different classes. The goal is for the tree to divide the data into groups which are as different from each other as possible, and for the members of each group to be as similar to each other as possible.
Random Forest is an ensemble of decision trees which are blended at the classification level.
Advantages for fraud detection
Random forest architecture is already commonly used in fraud detection. The benefits of this architecture are:
- Rapid deployment
- Multiple trees reduce overall bias
- Easy to interpret and understand the internal workings
- Stable algorithm - not disrupted by adding a new datapoint
Disadvantages for fraud detection
Random forest is blended at the classification level, not representation level. This means the classification model doesn’t know which trees carry more/less weight in the decision, which gives it less predictive power.
One problem with random forest is that it can be prone to overfitting. Overfitting occurs when the model or the algorithm fits the training data too well - eg. divides the data into too many distinct classes.
This means it is learning too much about specific training customers vs. performing a more general classification which can apply to many customers. This can lead to poor performance in the live environment on customers that aren’t in the training dataset.
Why is TabNet an improvement?
TabNet is great for explainability - it allows us to understand which features carry the most weight on the prediction and enables further analysis - unlike the random forest.
The purpose of this micromodel for Ravelin is to learn about:
- Numerical data
- Low cardinality categorical features (features with low number of categories eg. continents rather than countries)
TabNet results in a high performance model which outperforms other neural network and decision tree variants. What makes it so efficient?
Tabnet model design
More than yes/no
Another key difference with TabNet is that it is not just a yes/no decision and answer. Instead of simply classifying data into true/false classes at each stage, the model can classify according to a value eg. transaction amount. It has two key elements below.
The attentive transformer directs the model’s attention. It’s a powerful way of prioritising which features to look at for each ‘decision step’. Attentive transformers may ask questions about hundreds of different features at each step. It also has long-term memory built in - it remembers the outcome of previous decision steps and the actual data behind the decisions.
The feature transformer looks at all the features assessed and decides which ones are indicative of fraudulent/genuine behavior. The feature transformer has decision-making processes internally built into its architecture.
Architecture limits overfitting
TabNet architecture can prevent overfitting issues which occur in random forest. It does this in two ways: through loss function and through the feature transformer.
We can limit the granularity of the model so so it learns the repetitive features of fraudulence rather than the individual features of a single fraudulent transaction. This makes the model more general so it can make predictions on new data which doesn’t look identical to what it has seen before.
The feature transformer emphasises reusable decision-making processes - in other words it ‘remembers’ how it makes decisions. This means that if the feature transformer sees the same feature data more than once it will try to make a decision in the same way each time - preventing further granularity down the decision-making chain.
Natural language processing (NLP) micromodel
NLP models are often used in applications such as voice recognition, email spam filters or translation services.
At Ravelin, the NLP model purpose is to assess text features, such as email, order basket content, delivery notes and other text items.
NLP model design
An example NLP architecture for a food delivery order basket could look like this:
Tokenization/padding is how we convert the text into numbers that the model can understand as data. In our model we tokenize at the character level to enhance model flexibility towards new items as well as new features.
This means it’s easier to find repetitions in the data - for example a repetition of a string of random letters in fraudulent email addresses with incremental changes.
Embedding and item BiLSTM - items are individually embedded and passed through a bidirectional long short-term memory (biLSTM) block to encourage the model to learn similar items, eg. latte and cappuccino. At the item level, the LSTM adds extra context to the text, grouping text where appropriate. For example, it may group ‘black’ with ‘coffee’ and ‘tea’ with ‘milk’.
Order BiLSTM - a final order biLSTM learns typical order baskets - eg. food item and a drink. Having a bidirectional LSTM means the chronology of the order doesn’t have an adverse effect eg. it doesn't adversely affect the mode if someone orders drink first then food, or food then a drink. This also promotes learning of orders and other user behaviors.
Advantages of our NLP architecture
NLP can remove the element of human bias which goes into building text features - for example an English-speaker might only be familiar with English text and build features based on their own knowledge and the rules of English language. NLP can encompass many different languages without the human resource overhead and reduce this type of bias.
With conventional text feature extraction, a new feature or new text field requires a person to manually build this feature into the model. NLP will do this automatically, reducing the human burden when introducing new products, email domains etc.
Anomaly detection model
Another architecture we use is a deep autoencoding gaussian mixture model (DAGMM). The model condenses the data and learns a compressed representation, allowing it to confirm new customers as typical of the existing dataset or identify outliers.
This type of model is sometimes used in image compression such as for live sports/video streaming. If the stream is poor quality and pixels are lost, the model can try to construct the image from the learned representation.
At the blue nodes, the model asks questions to compress the data, then on the other side it must ask the questions in reverse so that the data comes out looking the same. The model will try to distill the key signals in the customer information.
The anomaly detection model assimilates all the data and based on the combination of all features, it decides if it should allow or prevent the transaction. We continuously update the anomalous behavior model to ensure it is reflecting the current data.
This model’s purpose is to classify outliers and anomalous new user behavior. It asks the question - “Does this new customer fit the typical customer profile from my dataset?”
If the answer is no it doesn’t necessarily mean the customer is fraudulent. This is interpreted as a signal which is passed to the classification model (along with all the other model outcomes) and blended, before giving the customer a fraud score.
Blending the models
As mentioned above, these are our main three architectures, but we do use other models on top of these.
Each model contributes signals to the classification model, and these are blended before producing a final score. One model on its own cannot make a total decision, this is only done by the classification model. The classification model uses a simple neural network structure to assess all the signals from the various models. Learn more about the machine learning practices and model architectures we use in our guides here or get in touch to find out more.