Fraudulent Account Detection
The file users.csv contains data on a subset of fictional banking users. The abbreviation 'KYC' stands for 'Know Your Customer' - a process of identifying and verifying the client's identity when opening an account and periodically over time. The variable IS_FRAUDSTER from this dataset is your target variable. The file transactions.csv contains details of fictional transactions of these users.
Project and data procured from collected from StrataScratch.
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, precision_score
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics, preprocessing
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.neighbors import KNeighborsClassifier
from datetime import datetime
#date time
now = datetime.now()
print('Analysis on', now.strftime('%Y-%m-%d'), 'at', now.strftime('%H:%M'))
Analysis on 2024-07-27 at 09:09
Analysis on 2024-07-27 at 09:09
Data Exploration
As shown through our initial exploration, below, we are starting with two dataframes derived from provided CSV files. I'm also including a third dataframe which provides details on currencies vs. country of origin. This was generated from the UPS Country/Territory and Currency Codes document. By doing so, we will have three starting dataframes:
Dataframe | Source | Description |
---|---|---|
transactions | transactions.csv | Each row represents an individual transation with various details, including user, currency type, amount, and other relevant information. |
users | users.csv | Each row represents a single user with various details related to the user's account. included is the 'is_fraudster' identifier, which is our target. |
ccy | ups_currencies.csv | Each row represents a single currency type used in different countries |
The goal of this section and the next is to understand the relevant features in each dataframe, and figure out a way to combine the relevant details into a singular resource for analysis.
# load data
transactions = pd.read_csv('data/FRAUD-transactions.csv')
users = pd.read_csv('data/FRAUD-users.csv')
ccy = pd.read_csv('data/FRAUD-ups_currencies.csv')
# set names for dataframes for reference
transactions.name = 'Transactions'
users.name = 'Users'
ccy.name = 'CCY'
# create a list of dataframes for iteration
dataframes = [transactions, users, ccy]
# get info for each dataframe
for df in dataframes: print(df.info(),'\n')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 688651 entries, 0 to 688650
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CURRENCY 688651 non-null object
1 AMOUNT 688651 non-null int64
2 STATE 688651 non-null object
3 CREATED_DATE 688651 non-null object
4 MERCHANT_CATEGORY 223065 non-null object
5 MERCHANT_COUNTRY 483055 non-null object
6 ENTRY_METHOD 688651 non-null object
7 USER_ID 688651 non-null object
8 TYPE 688651 non-null object
9 SOURCE 688651 non-null object
10 ID 688651 non-null object
11 AMOUNT_USD 635328 non-null float64
dtypes: float64(1), int64(1), object(10)
memory usage: 63.0+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9944 entries, 0 to 9943
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 9944 non-null object
1 HAS_EMAIL 9944 non-null int64
2 PHONE_COUNTRY 9944 non-null object
3 IS_FRAUDSTER 9944 non-null bool
4 TERMS_VERSION 8417 non-null object
5 CREATED_DATE 9944 non-null object
6 STATE 9944 non-null object
7 COUNTRY 9944 non-null object
8 BIRTH_YEAR 9944 non-null int64
9 KYC 9944 non-null object
10 FAILED_SIGN_IN_ATTEMPTS 9944 non-null int64
dtypes: bool(1), int64(3), object(7)
memory usage: 786.7+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 COUNTRY 250 non-null object
1 UPS_Code 249 non-null object
2 IATA_Code 249 non-null object
3 CCY 250 non-null object
dtypes: object(4)
memory usage: 7.9+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 688651 entries, 0 to 688650
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CURRENCY 688651 non-null object
1 AMOUNT 688651 non-null int64
2 STATE 688651 non-null object
3 CREATED_DATE 688651 non-null object
4 MERCHANT_CATEGORY 223065 non-null object
5 MERCHANT_COUNTRY 483055 non-null object
6 ENTRY_METHOD 688651 non-null object
7 USER_ID 688651 non-null object
8 TYPE 688651 non-null object
9 SOURCE 688651 non-null object
10 ID 688651 non-null object
11 AMOUNT_USD 635328 non-null float64
dtypes: float64(1), int64(1), object(10)
memory usage: 63.0+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9944 entries, 0 to 9943
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 9944 non-null object
1 HAS_EMAIL 9944 non-null int64
2 PHONE_COUNTRY 9944 non-null object
3 IS_FRAUDSTER 9944 non-null bool
4 TERMS_VERSION 8417 non-null object
5 CREATED_DATE 9944 non-null object
6 STATE 9944 non-null object
7 COUNTRY 9944 non-null object
8 BIRTH_YEAR 9944 non-null int64
9 KYC 9944 non-null object
10 FAILED_SIGN_IN_ATTEMPTS 9944 non-null int64
dtypes: bool(1), int64(3), object(7)
memory usage: 786.7+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 COUNTRY 250 non-null object
1 UPS_Code 249 non-null object
2 IATA_Code 249 non-null object
3 CCY 250 non-null object
dtypes: object(4)
memory usage: 7.9+ KB
None
# function to get unique and null values for each column in a dataframe
def matrix(dataframe):
temp = [[dataframe[column].nunique() for column in dataframe.columns],
[dataframe[column].isnull().sum() for column in dataframe.columns]]
print(dataframe.name,'has',len(dataframe),'total values; column details:\n')
print(pd.DataFrame(temp, columns=dataframe.columns, index=['Unique Values', 'Null Values']).transpose())
# iterate through dataframes to get unique and null values for each column
for df in dataframes: print(matrix(df),'\n--------------------------\n')
Transactions has 688651 total values; column details:
Unique Values Null Values
CURRENCY 29 0
AMOUNT 32257 0
STATE 7 0
CREATED_DATE 688640 0
MERCHANT_CATEGORY 115 465586
MERCHANT_COUNTRY 344 205596
ENTRY_METHOD 6 0
USER_ID 8021 0
TYPE 5 0
SOURCE 11 0
ID 688651 0
AMOUNT_USD 37402 53323
None
--------------------------
Users has 9944 total values; column details:
Unique Values Null Values
ID 9944 0
HAS_EMAIL 2 0
PHONE_COUNTRY 83 0
IS_FRAUDSTER 2 0
TERMS_VERSION 7 1527
CREATED_DATE 9944 0
STATE 2 0
COUNTRY 46 0
BIRTH_YEAR 69 0
KYC 4 0
FAILED_SIGN_IN_ATTEMPTS 5 0
None
--------------------------
CCY has 250 total values; column details:
Unique Values Null Values
COUNTRY 250 0
UPS_Code 249 1
IATA_Code 224 1
CCY 139 0
None
--------------------------
Transactions has 688651 total values; column details:
Unique Values Null Values
CURRENCY 29 0
AMOUNT 32257 0
STATE 7 0
CREATED_DATE 688640 0
MERCHANT_CATEGORY 115 465586
MERCHANT_COUNTRY 344 205596
ENTRY_METHOD 6 0
USER_ID 8021 0
TYPE 5 0
SOURCE 11 0
ID 688651 0
AMOUNT_USD 37402 53323
None
--------------------------
Users has 9944 total values; column details:
Unique Values Null Values
ID 9944 0
HAS_EMAIL 2 0
PHONE_COUNTRY 83 0
IS_FRAUDSTER 2 0
TERMS_VERSION 7 1527
CREATED_DATE 9944 0
STATE 2 0
COUNTRY 46 0
BIRTH_YEAR 69 0
KYC 4 0
FAILED_SIGN_IN_ATTEMPTS 5 0
None
--------------------------
CCY has 250 total values; column details:
Unique Values Null Values
COUNTRY 250 0
UPS_Code 249 1
IATA_Code 224 1
CCY 139 0
None
--------------------------
transactions.head(3).transpose()
0 | 1 | 2 | |
---|---|---|---|
CURRENCY | GBP | PLN | GBP |
AMOUNT | 4420 | 1500 | 191 |
STATE | COMPLETED | COMPLETED | COMPLETED |
CREATED_DATE | 2017-12-10 16:38:55.577 | 2017-12-10 16:37:24.792 | 2017-12-10 16:37:16.234 |
MERCHANT_CATEGORY | NaN | point_of_interest | airport |
MERCHANT_COUNTRY | NLD | POL | PRT |
ENTRY_METHOD | chip | manu | chip |
USER_ID | 3ff52b92-d416-4e22-8cad-018f500d4bbc | 76cbaad3-4721-4a3b-92b9-3eb9e9319565 | 7bcaa34e-b889-4582-9c29-0b3bab34fb8c |
TYPE | ATM | CARD_PAYMENT | CARD_PAYMENT |
SOURCE | GAIA | GAIA | GAIA |
ID | 367bf5f9-7cce-4683-90b9-d3c011bf4c87 | ff6802b9-360d-4efe-b09b-f99c6cac3383 | ddb4a930-7d8a-4f38-9079-ddc4b0db980e |
AMOUNT_USD | 3268.0 | NaN | 141.0 |
users.head(3).transpose()
0 | 1 | 2 | |
---|---|---|---|
ID | 1872820f-e3ac-4c02-bdc7-727897b60043 | 545ff94d-66f8-4bea-b398-84425fb2301e | 10376f1a-a28a-4885-8daa-c8ca496026bb |
HAS_EMAIL | 1 | 1 | 1 |
PHONE_COUNTRY | GB||JE||IM||GG | GB||JE||IM||GG | ES |
IS_FRAUDSTER | False | False | False |
TERMS_VERSION | 2018-05-25 | 2018-01-01 | 2018-09-20 |
CREATED_DATE | 2017-08-06 07:33:33.341000 | 2017-03-07 10:18:59.427000 | 2018-05-31 04:41:24.672000 |
STATE | ACTIVE | ACTIVE | ACTIVE |
COUNTRY | GB | GB | ES |
BIRTH_YEAR | 1971 | 1982 | 1973 |
KYC | PASSED | PASSED | PASSED |
FAILED_SIGN_IN_ATTEMPTS | 0 | 0 | 0 |
ccy.head(3).transpose()
0 | 1 | 2 | |
---|---|---|---|
COUNTRY | Afghanistan | Aland Islands | Albania |
UPS_Code | AF | AX | AL |
IATA_Code | AF | AX | AL |
CCY | USD | EUR | EUR |
Before getting too much further, I'm going to check the correlation between the users's state, number of failed sign-in attempts, and KYC status as compared to if the user is a fraudster. The goal is to identify if there is a 1:1 ratio, in which case the variable would be meaningless.
As shown below, there is a 1:1 correlation for if the account is locked, so we won't use that variable. However, the other two (KYC and failed sign in attempts) are still potential options.
# check for connection between STATE and IS_FRAUDSTER
fraudsters = users[users['IS_FRAUDSTER'] == True]
not_fraudsters = users[users['IS_FRAUDSTER'] == False]
print('Account states for fraud accounts:',fraudsters['STATE'].unique(),
'\nAccount states for non-fraud accounts:', not_fraudsters['STATE'].unique(),
'\n\nKYC for fraud accounts:',fraudsters['KYC'].unique(),
'\nKYC for non-fraud accounts:', not_fraudsters['KYC'].unique(),
'\n\nFailed Sign In for fraud accounts:',fraudsters['FAILED_SIGN_IN_ATTEMPTS'].unique(),
'\nFailed Sign In for non-fraud accounts:', not_fraudsters['FAILED_SIGN_IN_ATTEMPTS'].unique())
Account states for fraud accounts: ['LOCKED']
Account states for non-fraud accounts: ['ACTIVE']
KYC for fraud accounts: ['NONE' 'PASSED' 'PENDING' 'FAILED']
KYC for non-fraud accounts: ['PASSED' 'NONE' 'FAILED' 'PENDING']
Failed Sign In for fraud accounts: [0 1 2]
Failed Sign In for non-fraud accounts: [0 1 3 2 6]
Account states for fraud accounts: ['LOCKED']
Account states for non-fraud accounts: ['ACTIVE']
KYC for fraud accounts: ['NONE' 'PASSED' 'PENDING' 'FAILED']
KYC for non-fraud accounts: ['PASSED' 'NONE' 'FAILED' 'PENDING']
Failed Sign In for fraud accounts: [0 1 2]
Failed Sign In for non-fraud accounts: [0 1 3 2 6]
From here we'll begin feature engineering using the three remaining dataframes, and continue unifying all relevant data into a single dataframe.
Feature Engineering
For the analysis, I'm proposing the following initial features as potential predictors: | Column| Type | Description| |------------- |------------- |-------------| | IS_FRAUDSTER | bool | user IDed as fraudster | | HAS_EMAIL | bool | user has registered email | | FAILED_SIGN_IN_ATTEMPTS | cont | count of failed sign in attempts | | KYC_[STATUS] | dummy/bool | user's stage in identification verification | | PHONE_COUNTRY_MATCH | bool | user's listed country matches their phone's country code | | PHONE_MULTIPLE_COUNTRIES | bool | user has phone numbers in multiple countries | | RECENT_TERMS | bool | user has accepted the most recent TOC| | CREATED_TO_FIRST | cont | minutes between account creation to first transaction| | TOTAL_TRANSACTION_COUNT | int | count of user's total transactions | | TOTAL_TRANSACTION_SUM | cont | sum of user's total transaction amounts | | MEDIAN_TRANSACTION_AMOUNT | cont | median transaction amount | | AVG_DAILY_TRANSACTION_COUNT | cont | average transactions made per day by user | | INTL_TRANSACTION_PCT | cont | percent of user's transactions that don't match their listed country's currency | | TYPE_[TYPE] | cont | percent of transactions made of each transaction type | | METHOD_[METHOD] | cont | percent of transactions attributed to different entry methods |
Note: Each sample accounts for a single user. Additionally, all these features will not be used in the final analysis. We are simply generating them as a starting point to identify the most relevant features.
This first section takes care of the low-hanging fruit: * Updating data types and renaming columns as needed * Creating the analysis dataframe with features that are already ready for use * Generating new features according to the following: * If the user's country and the country code for their phone match * If the user has more than one phone country listed * If the user has accepted the most recent terms of service
# update date types to datetime for calculating
users['TERMS_VERSION'] = pd.to_datetime(users['TERMS_VERSION'])
users['CREATED_DATE'] = pd.to_datetime(users['CREATED_DATE'])
transactions['CREATED_DATE'] = pd.to_datetime(transactions['CREATED_DATE'], format='mixed')
# begin dataframe with columns that don't need modification
df = pd.DataFrame(users.loc[:,['ID', 'IS_FRAUDSTER', 'HAS_EMAIL',
'KYC', 'FAILED_SIGN_IN_ATTEMPTS']])
# rename ID to simplify later merging
df.rename(columns={'ID':'USER_ID'}, inplace=True)
# update data types & create dummy variables
df['HAS_EMAIL'] = df['HAS_EMAIL'].astype(bool)
df = pd.get_dummies(df, columns=['KYC'], prefix='KYC', drop_first=False)
# get details for columns that need simple adjustments
df['PHONE_COUNTRY_MATCH'] = users['COUNTRY'] == users['PHONE_COUNTRY']
df['PHONE_MULTIPLE_COUNTRIES'] = users['PHONE_COUNTRY'].str.len() > 2
df['RECENT_TERMS'] = users['TERMS_VERSION'] == users['TERMS_VERSION'].max()
df.head(3).transpose()
0 | 1 | 2 | |
---|---|---|---|
USER_ID | 1872820f-e3ac-4c02-bdc7-727897b60043 | 545ff94d-66f8-4bea-b398-84425fb2301e | 10376f1a-a28a-4885-8daa-c8ca496026bb |
IS_FRAUDSTER | False | False | False |
HAS_EMAIL | True | True | True |
FAILED_SIGN_IN_ATTEMPTS | 0 | 0 | 0 |
KYC_FAILED | False | False | False |
KYC_NONE | False | False | False |
KYC_PASSED | True | True | True |
KYC_PENDING | False | False | False |
PHONE_COUNTRY_MATCH | False | False | True |
PHONE_MULTIPLE_COUNTRIES | True | True | False |
RECENT_TERMS | False | False | True |
This next section deals with slightly more complex, yet still straightforward, feature engineering, either used directly or to assist with other feature engineering: * Calculating the time from account creation to the first transaction * Getting the total number of transactions for each user * Getting the total sum (in USD) of each user's transactions (note: we're using USD to avoid unit discrepencies by using multiple currencies) * Identifying the median transaction amount for the user (note: we're avoiding the mean transaction amount since that can be determined by the sum and count of transactions, and thereby will be overly influential on the analysis) * Getting the average number of transactions made per day by the user
# created to first transaction
create_to_first = transactions.groupby('USER_ID')['CREATED_DATE'].min().reset_index()
create_to_first['USER_CREATED'] = create_to_first['USER_ID'].map(users.set_index('ID')['CREATED_DATE'])
create_to_first['CREATED_TO_FIRST'] = \
((create_to_first['CREATED_DATE'] - create_to_first['USER_CREATED']).dt.total_seconds())/60
create_to_first = create_to_first.drop(columns=['CREATED_DATE', 'USER_CREATED'])
df = pd.merge(df, create_to_first, on='USER_ID', how='left')
# total transaction count
total_transaction_count = transactions.groupby('USER_ID')['ID'].count().reset_index()
total_transaction_count.rename(columns={'ID': 'TOTAL_TRANSACTION_COUNT'}, inplace=True)
df = pd.merge(df, total_transaction_count, on='USER_ID', how='left')
# sum of transaction amounts
total_transaction__sum = transactions.groupby('USER_ID')['AMOUNT_USD'].sum().reset_index()
total_transaction__sum.rename(columns={'AMOUNT_USD': 'TOTAL_TRANSACTION_SUM'}, inplace=True)
df = pd.merge(df, total_transaction__sum, on='USER_ID', how='left')
# median transaction amount
median_transaction_amount = transactions.groupby('USER_ID')['AMOUNT_USD'].median().reset_index()
median_transaction_amount.rename(columns={'AMOUNT_USD': 'MEDIAN_TRANSACTION_AMOUNT'}, inplace=True)
df = pd.merge(df, median_transaction_amount, on='USER_ID', how='left')
# avg daily transaction count
avg_daily_transaction_count = transactions.groupby(['USER_ID', 'CREATED_DATE'])['ID'].count().reset_index()
avg_daily_transaction_count = avg_daily_transaction_count.groupby('USER_ID')['ID'].mean().reset_index()
avg_daily_transaction_count.rename(columns={'ID': 'AVG_DAILY_TRANSACTION_COUNT'}, inplace=True)
df = pd.merge(df, avg_daily_transaction_count, on='USER_ID', how='left')
# review results of all the merges
df.head(3).transpose()
0 | 1 | 2 | |
---|---|---|---|
USER_ID | 1872820f-e3ac-4c02-bdc7-727897b60043 | 545ff94d-66f8-4bea-b398-84425fb2301e | 10376f1a-a28a-4885-8daa-c8ca496026bb |
IS_FRAUDSTER | False | False | False |
HAS_EMAIL | True | True | True |
FAILED_SIGN_IN_ATTEMPTS | 0 | 0 | 0 |
KYC_FAILED | False | False | False |
KYC_NONE | False | False | False |
KYC_PASSED | True | True | True |
KYC_PENDING | False | False | False |
PHONE_COUNTRY_MATCH | False | False | True |
PHONE_MULTIPLE_COUNTRIES | True | True | False |
RECENT_TERMS | False | False | True |
CREATED_TO_FIRST | 8923.9711 | 1.894117 | 15479.922883 |
TOTAL_TRANSACTION_COUNT | 6.0 | 11.0 | 60.0 |
TOTAL_TRANSACTION_SUM | 61876.0 | 43922.0 | 114328.0 |
MEDIAN_TRANSACTION_AMOUNT | 1458.0 | 750.0 | 156.0 |
AVG_DAILY_TRANSACTION_COUNT | 1.0 | 1.0 | 1.0 |
Now we move into the more in-depth feature engineering, starting with identifying the percent of international transactions per user. To do so, we first need to get counts of the purchases made per user, divided by unique currency codes.
# group transactions by user and currency
international_transactions = transactions.groupby(['USER_ID', 'CURRENCY'])['ID'].nunique().reset_index()
international_transactions.rename(columns={'ID': 'TRANSACTION_COUNT', 'CURRENCY': 'CCY'}, inplace=True)
# merge with relevant columns from users dataframe
international_transactions = pd.merge(international_transactions,
users[['ID', 'COUNTRY']], left_on='USER_ID', right_on='ID', how='left')
international_transactions.rename(columns={'COUNTRY': 'USER_COUNTRY'}, inplace=True)
# check for null values
international_transactions.isnull().sum()
USER_ID 0
CCY 0
TRANSACTION_COUNT 0
ID 581
USER_COUNTRY 581
dtype: int64
# get unique user_ids for grouped transactions with null values
no_id = international_transactions[international_transactions['ID'].isnull()]
no_id = no_id['USER_ID'].unique()
# compare to users dataframe
users[users['ID'].isin(no_id)]
ID | HAS_EMAIL | PHONE_COUNTRY | IS_FRAUDSTER | TERMS_VERSION | CREATED_DATE | STATE | COUNTRY | BIRTH_YEAR | KYC | FAILED_SIGN_IN_ATTEMPTS |
---|
Based on the above information, there are 581 grouped transactions with listed User IDs that don't match the provided list of users. As such, we would not have identifiers for if these are fraudulent accounts, and so we're dropping those rows.
# drop null values; transactions without registered user information
international_transactions.dropna(inplace=True)
# merge with ccy data for country information, and update column name
international_transactions = pd.merge(international_transactions, ccy[['CCY', 'UPS_Code']], on='CCY', how='left')
international_transactions.rename(columns={'UPS_Code': 'CCY_COUNTRY'}, inplace=True)
# check for null values
international_transactions.isnull().sum()
USER_ID 0
CCY 0
TRANSACTION_COUNT 0
ID 0
USER_COUNTRY 0
CCY_COUNTRY 341
dtype: int64
# identify CCY for which CCY_COUNTRY is null
ccy_fix = list(international_transactions[international_transactions['CCY_COUNTRY'].isnull()]['CCY'].unique())
ccy_fix
['RON', 'BTC', 'ETH', 'LTC', 'XRP']
A quick Google Search identified the RON as being the Romanian new leu, and the other four as being cryptocurrency.
# update country codes for cryptocurrencies
crypto = ['BTC', 'ETH', 'LTC', 'XRP']
international_transactions.loc[international_transactions['CCY'].isin(crypto), 'CCY_COUNTRY'] = 'Crypto'
# update country codes for RON
international_transactions.loc[international_transactions['CCY'] == 'RON', 'CCY_COUNTRY'] = 'RO'
# check for null values
international_transactions.isnull().sum()
USER_ID 0
CCY 0
TRANSACTION_COUNT 0
ID 0
USER_COUNTRY 0
CCY_COUNTRY 0
dtype: int64
Now that the transactions are grouped and all needed information is connected, we can figure out the percent of international transactions per user.
# check for matching country codes, group
international_transactions['MATCHES_USER_COUNTRY'] = \
international_transactions['USER_COUNTRY'] == international_transactions['CCY_COUNTRY']
# group by user and country match bool
international_transactions = \
international_transactions.groupby(['USER_ID', 'MATCHES_USER_COUNTRY'])['ID'].count().reset_index()
international_transactions.rename(columns={'ID': 'INTL_TRANSACTIONS'}, inplace=True)
# limit to international transactions
international_transactions = \
international_transactions[international_transactions['MATCHES_USER_COUNTRY'] != True]
# integrate with total transaction count
international_transactions = \
pd.merge(international_transactions, total_transaction_count, on='USER_ID', how='left')
# calculate international transaction percent
international_transactions['INTL_TRANSACTION_PCT'] = \
international_transactions['INTL_TRANSACTIONS'] / international_transactions['TOTAL_TRANSACTION_COUNT']
international_transactions = \
international_transactions[['USER_ID', 'INTL_TRANSACTION_PCT']]
# merge with main dataframe
df = pd.merge(df, international_transactions, on='USER_ID', how='left')
df.head(3).transpose()
0 | 1 | 2 | |
---|---|---|---|
USER_ID | 1872820f-e3ac-4c02-bdc7-727897b60043 | 545ff94d-66f8-4bea-b398-84425fb2301e | 10376f1a-a28a-4885-8daa-c8ca496026bb |
IS_FRAUDSTER | False | False | False |
HAS_EMAIL | True | True | True |
FAILED_SIGN_IN_ATTEMPTS | 0 | 0 | 0 |
KYC_FAILED | False | False | False |
KYC_NONE | False | False | False |
KYC_PASSED | True | True | True |
KYC_PENDING | False | False | False |
PHONE_COUNTRY_MATCH | False | False | True |
PHONE_MULTIPLE_COUNTRIES | True | True | False |
RECENT_TERMS | False | False | True |
CREATED_TO_FIRST | 8923.9711 | 1.894117 | 15479.922883 |
TOTAL_TRANSACTION_COUNT | 6.0 | 11.0 | 60.0 |
TOTAL_TRANSACTION_SUM | 61876.0 | 43922.0 | 114328.0 |
MEDIAN_TRANSACTION_AMOUNT | 1458.0 | 750.0 | 156.0 |
AVG_DAILY_TRANSACTION_COUNT | 1.0 | 1.0 | 1.0 |
INTL_TRANSACTION_PCT | 8.333333 | 0.545455 | 0.716667 |
This next section deals with getting the percent of transactions per user of each transaction type.
# group by user and transaction type
transaction_types = transactions.groupby(['USER_ID', 'TYPE'])['ID'].count().reset_index()
# pivot table to get transaction type counts
transaction_types = transaction_types.pivot(index='USER_ID',
columns='TYPE', values='ID')
transaction_types.columns = [f'TYPE_{col}'
for col in transaction_types.columns]
transaction_types.fillna(0, inplace=True) # in case no transactions of a certain type
# calculate the sum of all types for each user
row_totals = transaction_types.sum(axis=1)
# calculate the percentage for each type per user
transaction_types = transaction_types.div(row_totals, axis=0).reset_index()
# merge with main dataframe
df = pd.merge(df, transaction_types, on='USER_ID', how='left')
df.head(3).transpose()
0 | 1 | 2 | |
---|---|---|---|
USER_ID | 1872820f-e3ac-4c02-bdc7-727897b60043 | 545ff94d-66f8-4bea-b398-84425fb2301e | 10376f1a-a28a-4885-8daa-c8ca496026bb |
IS_FRAUDSTER | False | False | False |
HAS_EMAIL | True | True | True |
FAILED_SIGN_IN_ATTEMPTS | 0 | 0 | 0 |
KYC_FAILED | False | False | False |
KYC_NONE | False | False | False |
KYC_PASSED | True | True | True |
KYC_PENDING | False | False | False |
PHONE_COUNTRY_MATCH | False | False | True |
PHONE_MULTIPLE_COUNTRIES | True | True | False |
RECENT_TERMS | False | False | True |
CREATED_TO_FIRST | 8923.9711 | 1.894117 | 15479.922883 |
TOTAL_TRANSACTION_COUNT | 6.0 | 11.0 | 60.0 |
TOTAL_TRANSACTION_SUM | 61876.0 | 43922.0 | 114328.0 |
MEDIAN_TRANSACTION_AMOUNT | 1458.0 | 750.0 | 156.0 |
AVG_DAILY_TRANSACTION_COUNT | 1.0 | 1.0 | 1.0 |
INTL_TRANSACTION_PCT | 8.333333 | 0.545455 | 0.716667 |
TYPE_ATM | 0.166667 | 0.0 | 0.0 |
TYPE_BANK_TRANSFER | 0.0 | 0.0 | 0.0 |
TYPE_CARD_PAYMENT | 0.333333 | 0.454545 | 0.5 |
TYPE_P2P | 0.0 | 0.181818 | 0.383333 |
TYPE_TOPUP | 0.5 | 0.363636 | 0.116667 |
Similar to how we generated the transaction type values, we're doing the same thing with entry methods
# group by user and entry method
transaction_method = transactions.groupby(['USER_ID', 'ENTRY_METHOD'])['ID'].count().reset_index()
# pivot table to get transaction method counts
transaction_method = transaction_method.pivot(index='USER_ID', columns='ENTRY_METHOD', values='ID')
transaction_method.columns = [f'METHOD_{col}' for col in transaction_method.columns]
transaction_method.fillna(0, inplace=True) # in case no transactions of a certain entry method
# calculate the sum of all types for each user
row_totals = transaction_method.sum(axis=1)
# calculate the percentage for each type per user
transaction_method = transaction_method.div(row_totals, axis=0).reset_index()
# merge with main dataframe
df = pd.merge(df, transaction_method, on='USER_ID', how='left')
df.head(3).transpose()
0 | 1 | 2 | |
---|---|---|---|
USER_ID | 1872820f-e3ac-4c02-bdc7-727897b60043 | 545ff94d-66f8-4bea-b398-84425fb2301e | 10376f1a-a28a-4885-8daa-c8ca496026bb |
IS_FRAUDSTER | False | False | False |
HAS_EMAIL | True | True | True |
FAILED_SIGN_IN_ATTEMPTS | 0 | 0 | 0 |
KYC_FAILED | False | False | False |
KYC_NONE | False | False | False |
KYC_PASSED | True | True | True |
KYC_PENDING | False | False | False |
PHONE_COUNTRY_MATCH | False | False | True |
PHONE_MULTIPLE_COUNTRIES | True | True | False |
RECENT_TERMS | False | False | True |
CREATED_TO_FIRST | 8923.9711 | 1.894117 | 15479.922883 |
TOTAL_TRANSACTION_COUNT | 6.0 | 11.0 | 60.0 |
TOTAL_TRANSACTION_SUM | 61876.0 | 43922.0 | 114328.0 |
MEDIAN_TRANSACTION_AMOUNT | 1458.0 | 750.0 | 156.0 |
AVG_DAILY_TRANSACTION_COUNT | 1.0 | 1.0 | 1.0 |
INTL_TRANSACTION_PCT | 8.333333 | 0.545455 | 0.716667 |
TYPE_ATM | 0.166667 | 0.0 | 0.0 |
TYPE_BANK_TRANSFER | 0.0 | 0.0 | 0.0 |
TYPE_CARD_PAYMENT | 0.333333 | 0.454545 | 0.5 |
TYPE_P2P | 0.0 | 0.181818 | 0.383333 |
TYPE_TOPUP | 0.5 | 0.363636 | 0.116667 |
METHOD_chip | 0.333333 | 0.181818 | 0.15 |
METHOD_cont | 0.166667 | 0.272727 | 0.033333 |
METHOD_mags | 0.0 | 0.0 | 0.016667 |
METHOD_manu | 0.0 | 0.0 | 0.3 |
METHOD_mcon | 0.0 | 0.0 | 0.0 |
METHOD_misc | 0.5 | 0.545455 | 0.5 |
The last steps in this portion are to: * Remove USER_ID, which up to now was used for merging purposes but is now superfluous. We are also replacing na with 0, as those would have beeen generated from a lack of values to calculate. * Convert bool values to int to ensure we can calculate appropriately
# identify bool columns
bool_columns = df.select_dtypes(include=bool).columns
# convert bool columns to integers
df[bool_columns] = df[bool_columns].astype(int)
# clean up dataframe for next steps
df.drop('USER_ID', axis=1, inplace=True)
df = df.dropna(how='any', axis=0) # drop rows with inf values
df.fillna(0, inplace=True)
df.isna().sum()
IS_FRAUDSTER 0
HAS_EMAIL 0
FAILED_SIGN_IN_ATTEMPTS 0
KYC_FAILED 0
KYC_NONE 0
KYC_PASSED 0
KYC_PENDING 0
PHONE_COUNTRY_MATCH 0
PHONE_MULTIPLE_COUNTRIES 0
RECENT_TERMS 0
CREATED_TO_FIRST 0
TOTAL_TRANSACTION_COUNT 0
TOTAL_TRANSACTION_SUM 0
MEDIAN_TRANSACTION_AMOUNT 0
AVG_DAILY_TRANSACTION_COUNT 0
INTL_TRANSACTION_PCT 0
TYPE_ATM 0
TYPE_BANK_TRANSFER 0
TYPE_CARD_PAYMENT 0
TYPE_P2P 0
TYPE_TOPUP 0
METHOD_chip 0
METHOD_cont 0
METHOD_mags 0
METHOD_manu 0
METHOD_mcon 0
METHOD_misc 0
dtype: int64
# check head as a final precaution
df.head(3).transpose()
0 | 1 | 2 | |
---|---|---|---|
IS_FRAUDSTER | 0.000000 | 0.000000 | 0.000000 |
HAS_EMAIL | 1.000000 | 1.000000 | 1.000000 |
FAILED_SIGN_IN_ATTEMPTS | 0.000000 | 0.000000 | 0.000000 |
KYC_FAILED | 0.000000 | 0.000000 | 0.000000 |
KYC_NONE | 0.000000 | 0.000000 | 0.000000 |
KYC_PASSED | 1.000000 | 1.000000 | 1.000000 |
KYC_PENDING | 0.000000 | 0.000000 | 0.000000 |
PHONE_COUNTRY_MATCH | 0.000000 | 0.000000 | 1.000000 |
PHONE_MULTIPLE_COUNTRIES | 1.000000 | 1.000000 | 0.000000 |
RECENT_TERMS | 0.000000 | 0.000000 | 1.000000 |
CREATED_TO_FIRST | 8923.971100 | 1.894117 | 15479.922883 |
TOTAL_TRANSACTION_COUNT | 6.000000 | 11.000000 | 60.000000 |
TOTAL_TRANSACTION_SUM | 61876.000000 | 43922.000000 | 114328.000000 |
MEDIAN_TRANSACTION_AMOUNT | 1458.000000 | 750.000000 | 156.000000 |
AVG_DAILY_TRANSACTION_COUNT | 1.000000 | 1.000000 | 1.000000 |
INTL_TRANSACTION_PCT | 8.333333 | 0.545455 | 0.716667 |
TYPE_ATM | 0.166667 | 0.000000 | 0.000000 |
TYPE_BANK_TRANSFER | 0.000000 | 0.000000 | 0.000000 |
TYPE_CARD_PAYMENT | 0.333333 | 0.454545 | 0.500000 |
TYPE_P2P | 0.000000 | 0.181818 | 0.383333 |
TYPE_TOPUP | 0.500000 | 0.363636 | 0.116667 |
METHOD_chip | 0.333333 | 0.181818 | 0.150000 |
METHOD_cont | 0.166667 | 0.272727 | 0.033333 |
METHOD_mags | 0.000000 | 0.000000 | 0.016667 |
METHOD_manu | 0.000000 | 0.000000 | 0.300000 |
METHOD_mcon | 0.000000 | 0.000000 | 0.000000 |
METHOD_misc | 0.500000 | 0.545455 | 0.500000 |
Feature Selection
# establish features and target
all_features = list(df.columns[1:])
target = df.columns[0]
# set X and y
y = df[target]
X = df[all_features]
# get variance inflation factor for each feature
vif = pd.DataFrame()
vif['Predictor'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i)
for i in range(X.shape[1])]
cr = df.corr()[target].round(3)
vif['Relevance'] = [cr.iloc[i]
for i in range(X.shape[1])]
round(vif, 2).sort_values(by='Relevance', ascending=False)
Predictor | VIF | Relevance | |
---|---|---|---|
0 | HAS_EMAIL | 1.40 | 1.00 |
17 | TYPE_CARD_PAYMENT | inf | 0.17 |
16 | TYPE_BANK_TRANSFER | inf | 0.16 |
8 | RECENT_TERMS | 1.13 | 0.16 |
6 | PHONE_COUNTRY_MATCH | 8.35 | 0.11 |
13 | AVG_DAILY_TRANSACTION_COUNT | 1.00 | 0.11 |
3 | KYC_NONE | inf | 0.05 |
24 | METHOD_mcon | inf | 0.05 |
1 | FAILED_SIGN_IN_ATTEMPTS | 1.01 | 0.04 |
14 | INTL_TRANSACTION_PCT | 1.67 | 0.03 |
12 | MEDIAN_TRANSACTION_AMOUNT | 1.08 | 0.02 |
2 | KYC_FAILED | inf | 0.01 |
20 | METHOD_chip | inf | 0.00 |
5 | KYC_PENDING | inf | -0.00 |
10 | TOTAL_TRANSACTION_COUNT | 1.33 | -0.01 |
21 | METHOD_cont | inf | -0.01 |
25 | METHOD_misc | inf | -0.02 |
11 | TOTAL_TRANSACTION_SUM | 1.06 | -0.05 |
19 | TYPE_TOPUP | inf | -0.05 |
4 | KYC_PASSED | inf | -0.06 |
23 | METHOD_manu | inf | -0.07 |
15 | TYPE_ATM | inf | -0.08 |
22 | METHOD_mags | inf | -0.08 |
18 | TYPE_P2P | inf | -0.10 |
9 | CREATED_TO_FIRST | 1.02 | -0.12 |
7 | PHONE_MULTIPLE_COUNTRIES | 8.38 | -0.15 |
#multivariate feature selection
estimator = DecisionTreeClassifier()
selector = RFE(estimator, n_features_to_select=10, step=1).fit(X,y)
rnk = pd.DataFrame()
rnk['Feature'] = X.columns
rnk['Rank']= selector.ranking_
rnk.sort_values('Rank')
Feature | Rank | |
---|---|---|
12 | MEDIAN_TRANSACTION_AMOUNT | 1 |
23 | METHOD_manu | 1 |
20 | METHOD_chip | 1 |
19 | TYPE_TOPUP | 1 |
16 | TYPE_BANK_TRANSFER | 1 |
15 | TYPE_ATM | 1 |
9 | CREATED_TO_FIRST | 1 |
10 | TOTAL_TRANSACTION_COUNT | 1 |
11 | TOTAL_TRANSACTION_SUM | 1 |
14 | INTL_TRANSACTION_PCT | 1 |
17 | TYPE_CARD_PAYMENT | 2 |
21 | METHOD_cont | 3 |
2 | KYC_FAILED | 4 |
25 | METHOD_misc | 5 |
18 | TYPE_P2P | 6 |
5 | KYC_PENDING | 7 |
6 | PHONE_COUNTRY_MATCH | 8 |
7 | PHONE_MULTIPLE_COUNTRIES | 9 |
8 | RECENT_TERMS | 10 |
3 | KYC_NONE | 11 |
1 | FAILED_SIGN_IN_ATTEMPTS | 12 |
24 | METHOD_mcon | 13 |
4 | KYC_PASSED | 14 |
13 | AVG_DAILY_TRANSACTION_COUNT | 15 |
0 | HAS_EMAIL | 16 |
22 | METHOD_mags | 17 |
For this next section, I manually adjusted the number of features until we removed all the extremely high VIF scores, to maintain the maximum number of possible variables prior to hyperparameter tuning.
# isolate top features
selected_features = [feature for feature in
all_features if feature in
list(rnk.sort_values('Rank').head(8)['Feature'])]
# reset X
X = df[selected_features]
# rerun vf
vif = pd.DataFrame()
vif['Predictor'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i)
for i in range(X.shape[1])]
cr = df.corr()[target].round(3)
vif['Relevance'] = [cr.iloc[i]
for i in range(X.shape[1])]
round(vif, 2).sort_values(by='Relevance', ascending=False)
Predictor | VIF | Relevance | |
---|---|---|---|
0 | CREATED_TO_FIRST | 1.08 | 1.00 |
6 | METHOD_chip | 2.21 | 0.11 |
3 | TYPE_ATM | 1.84 | 0.05 |
1 | TOTAL_TRANSACTION_COUNT | 1.28 | 0.04 |
2 | MEDIAN_TRANSACTION_AMOUNT | 1.11 | 0.01 |
5 | TYPE_TOPUP | 1.27 | -0.00 |
4 | TYPE_BANK_TRANSFER | 1.18 | -0.06 |
7 | METHOD_manu | 1.13 | -0.15 |
# establish correlation matrix for intercorrelation
corr_matrix = X.corr().abs() # absolute values to make visualizing easier
# visualize a simple correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='Blues', fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.show()
# review correlation between target and features
df[selected_features + [target]].corr()[target].sort_values().round(2)
TOTAL_TRANSACTION_COUNT -0.05
METHOD_chip -0.01
CREATED_TO_FIRST -0.01
TYPE_TOPUP 0.00
METHOD_manu 0.05
MEDIAN_TRANSACTION_AMOUNT 0.11
TYPE_ATM 0.16
TYPE_BANK_TRANSFER 0.17
IS_FRAUDSTER 1.00
Name: IS_FRAUDSTER, dtype: float64
There are a few variables that appear to have intercorrelations: * The groupings of TYPE and METHOD of transaction: Since TYPE provides more specificity, and has higher correlations with the target, we'll maintain that for the analysis. * KYC_PASSED and KYC_NONE: We didn't drop the first dummy for KYC, to figure out which would be most useful. Since KYC_NONE has the stronger relationship with our target, we're going to remove KYC_PASSED. * PHONE_COUNTRY_MATCH and PHONE_MULTIPLE_COUNTRIES: These strongly correlate with each other, and also have decent correlations with the target. However, PHONE_MULTIPLE_COUNTRIES has a slightly stronger correlation, so we'll keep that variable.
# update features to remove multicollinear features
selected_features = [feature for feature in
selected_features if feature not in
['METHOD_chip', 'METHOD_cont', 'METHOD_mags',
'METHOD_manu', 'METHOD_mcon', 'METHOD_misc',
'KYC_PASSED', 'PHONE_COUNTRY_MATCH']]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X[selected_features],
y, test_size=0.4,
random_state=17)
# determine hyperparameters
param_grid = {
'max_depth': list(range(5,20)),
'min_samples_leaf': [10],
'max_features': list(range(2,10))
}
gridSearch = GridSearchCV(DecisionTreeClassifier(random_state=17),
param_grid, cv=5, n_jobs=-1)
gridSearch.fit(X_train, y_train)
print('Score: ', gridSearch.best_score_)
best_params = gridSearch.best_params_
print('Parameters: ', best_params)
Score: 0.9596387177949902
Parameters: {'max_depth': 13, 'max_features': 2, 'min_samples_leaf': 10}
Score: 0.9596387177949902
Parameters: {'max_depth': 13, 'max_features': 2, 'min_samples_leaf': 10}
# train on tree model using best parameters
tree_model = DecisionTreeClassifier(random_state=17,
max_depth=best_params['max_depth'],
min_samples_leaf=best_params['min_samples_leaf'],
max_features=best_params['max_features'])
tree_model.fit(X_train, y_train)
# get feature importances & list of features used
feature_importances = tree_model.feature_importances_
# generate a list of features used
features_used = X_train.columns[feature_importances > 0]
list(features_used)
['CREATED_TO_FIRST',
'TOTAL_TRANSACTION_COUNT',
'MEDIAN_TRANSACTION_AMOUNT',
'TYPE_ATM',
'TYPE_BANK_TRANSFER',
'TYPE_TOPUP']
# isolate top features
selected_features = [feature for feature in
all_features if feature in
list(features_used)]
# reset X with top features
X = df[selected_features]
# resplit the data
X_train, X_test, y_train, y_test = train_test_split(X[selected_features],
y, test_size=0.4,
random_state=17)
# scale the data
scaler = preprocessing.MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# show distribution
print('Training samples:',len(X_train),'\n',
y_train.value_counts())
print('\n')
print('Testing samples:',len(X_test),'\n',
y_test.value_counts())
Training samples: 4311
IS_FRAUDSTER
0 4128
1 183
Name: count, dtype: int64
Testing samples: 2875
IS_FRAUDSTER
0 2761
1 114
Name: count, dtype: int64
Training samples: 4311
IS_FRAUDSTER
0 4128
1 183
Name: count, dtype: int64
Testing samples: 2875
IS_FRAUDSTER
0 2761
1 114
Name: count, dtype: int64
Modeling
When doing the modeling, I began with Random Forest Regression, Logistic Regression, the Decision Tree Classifier, a KNN classifier, and a stacked ensemble combining a few of these together. Each model worked fairly well, however the best performer was the KNN classifier. The stacked ensemble performed similarily well, however it didn't improve on teh KNN classifier so to minimize computational efforts I decided to limit modeling to KNN classification.
This is all assuming that we want to focus on precision to minimize false positives in our target, for the best customer experience.
# Define the parameter grid to search
param_grid = {
'n_neighbors': [3, 5, 7, 9],
'weights': ['uniform', 'distance'],
'leaf_size': [10, 30, 50]
}
# Initialize the k-NN classifier
knn_classifier = KNeighborsClassifier()
# Initialize GridSearchCV with k-NN classifier and parameter grid
grid_search = GridSearchCV(knn_classifier, param_grid, cv=5,
scoring='accuracy')
# Perform GridSearchCV to find the best combination of hyperparameters
grid_search.fit(X_train, y_train)
# Get the best k-NN classifier model
best_knn_classifier = grid_search.best_estimator_
# Use the best model to make predictions
train_predictions = best_knn_classifier.predict(X_train)
test_predictions = best_knn_classifier.predict(X_test)
# Accuracy scores
train_accuracy = accuracy_score(y_train, train_predictions)
test_accuracy = accuracy_score(y_test, test_predictions)
train_precision = round(precision_score(y_train, train_predictions), 3)
test_precision = round(precision_score(y_test, test_predictions), 3)
print('Training Scores:')
print(' Accuracy: %.3f' % train_accuracy)
print(' Precision: %.3f' % train_precision)
print('\nTesting Scores:')
print(' Accuracy: %.3f' % test_accuracy)
print(' Precision: %.3f' % test_precision)
Training Scores:
Accuracy: 0.960
Precision: 0.647
Testing Scores:
Accuracy: 0.959
Precision: 0.381
Training Scores:
Accuracy: 0.960
Precision: 0.647
Testing Scores:
Accuracy: 0.959
Precision: 0.381
I want to also compare this to the null model as that will determine the overall benefit of this model.
# compare to null model
target_mean = y.mean()
max_target_mean = np.max([y.mean(), 1-y.mean()]) # establish accuracy of null model
print('Proportion of 0\'s (Not Fraud Acct): %.3f' % (1-target_mean))
print('Proportion of 1\'s (Fraud Acct): %.3f' % target_mean)
print('Null model accuracy: %.3f' % max_target_mean)
print('Null model precision: %.3f' % target_mean)
Proportion of 0's (Not Fraud Acct): 0.959
Proportion of 1's (Fraud Acct): 0.041
Null model accuracy: 0.959
Null model precision: 0.041
Proportion of 0's (Not Fraud Acct): 0.959
Proportion of 1's (Fraud Acct): 0.041
Null model accuracy: 0.959
Null model precision: 0.041
Conclusion
While the overall accuracy of the null model is comporable to that of the KNN Classifier, the KNN Classifier has much better precision. This would ensure we reduce the occurance of false positives, thereby supporting the customer experience. However, there is a noted disadvantage to using this classifier as it is not as transparent in decision making as, say, a decision tree classifier would be. As such, while it would be beneficial for implementation, it may not support business needs if there's a desire to develop a better understanding of the factors that help us identify fraudulent accounts.