ATM Fraud Detection with Kafka Streams API

By April 16, 2019Kafka

Overview:

Stream processing is about processing data as it arrives. Unlike “batch processing” businesses don’t have to wait a certain amount of time (usually in hours to a day based on the volume of the data) to store, analyze and get results on the incoming data. This is particularly useful in cases like fraud detection, system monitoring, stock market analysis, website activity tracking and many more. Early detection is key for all these cases hence stream processing is a perfect fit here.

Apache Kafka (http://kafka.apache.org/), a distributed streaming platform, has served as a reliable source of data for stream processing solutions like Apache Spark Streaming, Apache Storm, Apache Samza, Apache Flink. Starting with version 0.10.0, Kafka client libraries also include a powerful stream processing library. Which means there is no need for an external processing framework just to process incoming data which is already available in Kafka.

Here we will see how to implement fraud detection using Apache Kafka streams API. ATM transactions, as they happen, can be streamed to a Kafka Topic. These can then be processed to detect any suspicious activity.

Prerequisites:

Before continuing, you should be aware of the following concepts, core to stream processing:

  • Stream – Table Duality
  • Time windows in stream processing
  • Stream Repartitioning
  • Stream-Stream, Stream-Table Join
  • Topology

If you are not aware of these concepts, you can read a free book by confluent.io  Kafka The Definitive Guide.

Approach:

We will consider a transaction to be suspicious, if

  • Two or more transactions have happened on the same account within 10 mins.
  • They are done on different ATMs
  • The ATMs are located too far apart and cannot be reached in 10 mins or less.

A typical ATM transaction data would contain the following data:

  • Account no
  • Atm Id
  • Time of transaction
  • Amount
  • Transaction Id.

The first two criteria above should be fairly easy to check with this data. For the last criteria, we will calculate the required speed to reach next atm. If the speed seems to be too great, we will consider it as a suspicious activity.

To calculate the speed between ATMs, we also need geo locations (co-ordinates) of the ATMs. We can use them & find the distance between the two ATMs “as the crow flies”. The distance & timestamps on the transactions can give us the required speed.

Sample Data:

Here is a sample of the data we expect to receive on our topic. It should include the geo location of the ATM in addition to the data mentioned above.

Implementation:

Create Project:

First, we need to create a project for our application. I am using Maven as my build tool. So, I created a pom.xml file inside my project directory & added kafka-streams as a dependency.

You can find the entire project on github.

Main Class:

ATMFraud.java contains the main method for the application. It starts off by creating a list of property entries needed for connecting to message broker.

Note that, I am running Kafka separately on my machine. The application code here does not start/stop Kafka with it. That is something I plan to add later. As of now, it is not included.

StreamsBuilder class lets us connect to a Kafka Topic & read its contents as a Stream. Below we connect to the topic my_atm_txns_gess as a stream & filter any null messages. We then re-partition the stream to have a useful key i.e. Account id.

The stream method above accepts

  • Name of the topic to connect to.
  • Key & Value Serdes (A short name by Kafka for serializer & deserializer pair)
  • TimestampExtractorimplementor class object; required for correct windowing of the messages.
  • And an AutoOffsetReset which lets you choose where to start reading messages from i.e., from the beginning or end.

AtmTransaction is the class representing data in the Topic. AtmTransactionSerde is serde for this class. To find two transactions on the same account, we join the stream with a copy of itself. We use map method to create a copy of the stream.

Now we can join these two streams to get a JoinedAtmTransactions. The join should consider:

  • all transactions after 10 mins of the current transaction in the first stream
  • no transaction before the timestamp on current transaction.

This will ensure that we join the later transaction with earlier one. We ignore the join other way round.

Joined Atm Transactions:

I structured the JoinedAtmTransactions as below. The method calculateDistanceInKM calculates the distance between the given co-ordinates.

Filter the join results:

We are now ready to filter the joinedStream to find the suspicious transactions. We will use the criteria mentioned in the approach section.

Entries in the filtered stream are suspicious transactions. We will post them to my_atm_txns_fraudulent stream. Some other app can listen to this topic & take needed actions. For demo purpose, I will print these transactions before posting to the my_atm_txns_fraudulent stream. I can do that using mapValues operation, as below.

Connect and start listening to messages:

This last piece of code is very important. This will enable us to start listening to the input topic. We first build the topology of the streams & print the same to the console. Then we use the topology & properties to create an instance of KafkaStreams. The shutdown hook lets us close the streams before exiting the application. Lastly, we start the kafkaStreams & wait till the program is asked to stop.

Result:

The application can be run using mvn clean package exec:java -Dexec.mainClass=myapps.ATMFraud command. Image below shows the output that you should get when Kafka is posted with ATM transaction as shown. The console on the top is running Kafka console consumer, listening to the source topic i.e. my_atm_txns_gess. The console on the bottom is doing the same, listening to the result topic i.e. my_atm_txns_fraudulent.

There are 4 transactions posted to the source topic out of which last two are on the same account & fit the suspicious activity criteria. As you can see in the bottom console, our application correctly identified the suspicious activity & posted on result topic.

References:

On an unrelated note, if you want to learn more about auditing Kubernetes clusters check out this blog post.

 

Sameer Kulkarni

Author Sameer Kulkarni

More posts by Sameer Kulkarni

Leave a Reply