Generator of large transaction graphs with patterns of criminal activity

Tomcat

Professional
Messages
2,695
Reaction score
1,071
Points
113
Good day.
A couple of years ago, our team (compliance at a Swiss bank) was faced with a very interesting task - we had to generate a large graph of transactions between clients, companies and ATMs, add patterns to this graph similar to patterns of money laundering and other criminal activities, and also add minimal information about the nodes of this graph - names, addresses, times, etc. Of course, all data had to be generated from scratch, without using existing customer data.
To solve this problem, a generator was written, which I would like to share with you. Under the cut you will find a story explaining why we needed this, and a description of how the generator works. For the impatient, here is the code. I will be glad if our experience is useful to someone.

Why are we doing such nonsense?​

Our team decided to participate as sponsors at the LauzHack hackathon

One of the conditions for participation in the sponsor format was the provision of a real business task for the participants. Just at that time, we had a very interesting project related to the automation of the search for financial crimes and money laundering among our clients’ transactions, and without hesitation we decided to offer the same task to the hackathon participants.
For obvious reasons, we couldn't use real data, so we had to create it. In order for the task to be as close to reality as possible, we looked at the statistics of real data and tried, as best we could, to bring the generated data closer to real distributions, and also did not skimp on the quantity and complexity of the data - we did not need a solution working on a graph of 100 nodes and 200 connections, we were looking for a solution that could handle graphs of millions of nodes and billions of connections, and take into account all available information about the nodes and connections.

What we got​

And what we have turned out to be is quite a fast (adjusted for the amount of data), interesting and configurable generator! Let's take a closer look

Data types​

We want to have a graph of financial transactions, so the possible participants in this graph are:
  • Client - one might say, an account of an abstract client of a bank. Described by name, email, age, work activity, political views, nationality, education and residential address
  • A company is a business entity in the financial system. Determined by company type, name and country.
  • An ATM is, roughly speaking, the point where money exits the graph under our control. Defined by geographic coordinates.
  • Transaction - The fact of transferring money from one graph node to another. Defined by start and end node, amount, currency and time.

To create this data we use Mimesis, an excellent library for creating fake data.

Creating a Graph: Basic Entities​

First, you need to create all the basic entities - clients, companies and ATMs. The script takes the number of clients that need to be created, and based on this it calculates the number of companies and ATMs. According to our data, the number of companies having any large number of transactions with customers is approximately 2.5% of the number of customers, and the number of ATMs is 0.05% of the number of customers. These values are very general and non-configurable (hard-coded in the generator code).

All information is saved in .csv files. Writing to these files occurs in batches, k lines at a time. This value is configured by script arguments. Also, three types of nodes are generated in parallel.

Creating a Graph: Connections Between Entities​

After creating the basic entities, we begin to connect them together. At this stage, we are not yet generating the transactions themselves, but simply the very fact of the existence of a connection between nodes. This was done to speed up the process of generating the entire graph and works approximately as follows: if two nodes are connected, then we generate a certain number of transactions between them, spread out over time. If not connected, but there are no transactions between these nodes.

The probability of a connection between two nodes is configured through arguments, the standard values are listed below.

Possible connection types:
  • Client -> Client (p = 0.4%)
  • Client -> Company (p = 1%)
  • Client -> ATM (p = 3%)
  • Company -> Client (p = 0.5%)

Like nodes, all connection types are generated in parallel and written to their files in batches.

Graph creation: transactions​

With the graph nodes and connections between them falling under the desired distribution, we can start generating transactions. The process is quite simple in itself, but parallelizing it is quite difficult. Therefore, at this stage there are only two independent flows - transactions originating from the client and transactions originating from the company.

Nothing particularly interesting happens at this stage: the script runs through the list of connections and generates a random number of transactions for each connection. This is all written in exactly the same way - in .csv files in batches.

Graph creation: patterns​

But here there are interesting points. The types of behavior patterns we wanted to get in the final graph:
  • Flow - a large amount goes from one node to m others, each of these m nodes forwards the money to the next level of n nodes, and so on until the last level sends all the money to one recipient.
  • Circular - the amount of money goes in a circle and returns to the source.
  • Time - a certain amount of money moves from one node to another with some fixed frequency.

Let's look at each of these patterns in more detail:

Flow​

To begin with, select the number of levels through which the money will have to pass. In our implementation, this is a random number between 2 and 6, not configurable and hardcoded. Next, two nodes of the graph are selected - the sender and the recipient. A random amount is also selected, which the sender will send to the recipient (according to a very tricky formula 50000 * random() + 50000 * random()).

Each participant in this network charges some kind of fee for their services. In our implementation, the maximum price for passing money through the network will be 10% of the amount transferred by the sender.

The generated transactions have a time shift relative to the transactions of the previous network level - that is, money first comes to level n-1, and only then goes to level n. Delays are randomly selected within 4-5 days. Also generated transactions have pseudo-random amounts (limited to the original amount and taking into account the fee to each node)

Circular​

It is generated according to a similar principle as Flow, only instead of different sender and recipient and several levels in this pattern, money goes in a circle and returns to the original node. All intermediate nodes charge a fee, just like Flow, and transactions are also time-shifted.

Time​

The simplest pattern. A certain amount is sent from the sender to the recipient a random number of times (from 5 to 50, not configurable) with pseudo-random time shifts.

All new transactions are written in the same way to .csv files in batches.

Randomization of the graph and collection of all transactions into one file​

At this stage we have several .csv files:
  • 3 files with nodes (clients, companies and ATMs)
  • 4 files with transactions: one for regular transactions and 3 containing patterns.

An additional script mixes pattern transactions together with regular transactions so that it is not possible to see patterns in the graph based on the order in which the transactions were recorded in the file.

And what to do with all this?​

In the end we have 4 beautiful files with graph nodes and transactions between them. You can import into Neo4J, you can distribute via REST, and you can do whatever your heart desires with them.

As for us, we received very positive feedback from the hackathon participants, and several very interesting solutions for finding patterns in massive graphs.
 
Top