Note

Go to the end to download the full example code. or to run this example in your browser via JupyterLite or Binder

AggJoiner on a credit fraud dataset#

Many problems involve tables whose entities have a one-to-many relationship. To simplify aggregate-then-join operations for machine learning, we can include the AggJoiner in our pipeline.

In this example, we are tackling a fraudulent loan detection use case. Because fraud is rare, this dataset is extremely imbalanced, with a prevalence of around 1.4%.

The data consists of two distinct entities: e-commerce “baskets”, and “products”. Baskets can be tagged fraudulent (1) or not (0), and are essentially a list of products of variable size. Each basket is linked to at least one products, e.g. basket 1 can have product 1 and 2.

Our aim is to predict which baskets are fraudulent.

The products dataframe can be joined on the baskets dataframe using the basket_ID column.

Each product has several attributes:

a category (marked by the column "item"),
a model ("model"),
a brand ("make"),
a merchant code ("goods_code"),
a price per unit ("cash_price"),
a quantity selected in the basket ("Nbr_of_prod_purchas")

from skrub import TableReport
from skrub.datasets import fetch_credit_fraud

bunch = fetch_credit_fraud()
products, baskets = bunch.products, bunch.baskets
TableReport(products)

TableReport(baskets)

Naive aggregation#

Let’s explore a naive solution first.

Note

Click here to skip this section and see the AggJoiner in action!

The first idea that comes to mind to merge these two tables is to aggregate the products attributes into lists, using their basket IDs.

products_grouped = products.groupby("basket_ID").agg(list)
TableReport(products_grouped)

Then, we can expand all lists into columns, as if we were “flattening” the dataframe. We end up with a products dataframe ready to be joined on the baskets dataframe, using "basket_ID" as the join key.

import pandas as pd

products_flatten = []
for col in products_grouped.columns:
    cols = [f"{col}{idx}" for idx in range(24)]
    products_flatten.append(pd.DataFrame(products_grouped[col].to_list(), columns=cols))
products_flatten = pd.concat(products_flatten, axis=1)
products_flatten.insert(0, "basket_ID", products_grouped.index)
TableReport(products_flatten)

Look at the “Stats” section of the TableReport above. Does anything strike you?

Not only did we create 144 columns, but most of these columns are filled with NaN, which is very inefficient for learning!

This is because each basket contains a variable number of products, up to 24, and we created one column for each product attribute, for each position (up to 24) in the dataframe.

Moreover, if we wanted to replace text columns with encodings, we would create \(d \times 24 \times 2\) columns (encoding of dimensionality \(d\), for 24 products, for the "item" and "make" columns), which would explode the memory usage.

AggJoiner#

Let’s now see how the AggJoiner can help us solve this. We begin with splitting our basket dataset in a training and testing set.

from sklearn.model_selection import train_test_split

X, y = baskets[["ID"]], baskets["fraud_flag"]
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1)
X_train.shape, y_train.shape

Before aggregating our product dataframe, we need to vectorize our categorical columns. To do so, we use:

MinHashEncoder on “item” and “model” columns, because they both expose typos and text similarities.
OrdinalEncoder on “make” and “goods_code” columns, because they consist in orthogonal categories.

We bring this logic into a TableVectorizer to vectorize these columns in a single step. See this example for more details about these encoding choices.

from sklearn.preprocessing import OrdinalEncoder

from skrub import MinHashEncoder, TableVectorizer

vectorizer = TableVectorizer(
    high_cardinality=MinHashEncoder(),  # encode ["item", "model"]
    specific_transformers=[
        (OrdinalEncoder(), ["make", "goods_code"]),
    ],
)
products_transformed = vectorizer.fit_transform(products)
TableReport(products_transformed)

Our objective is now to aggregate this vectorized product dataframe by "basket_ID", then to merge it on the baskets dataframe, still on the "basket_ID".

AggJoiner can help us achieve exactly this. We need to pass the product dataframe as an auxiliary table argument to AggJoiner in __init__. The aux_key argument represent both the columns used to groupby on, and the columns used to join on.

The basket dataframe is our main table, and we indicate the columns to join on with main_key. Note that we pass the main table during fit, and we discuss the limitations of this design in the conclusion at the bottom of this notebook.

The minimum (“min”) is the most appropriate operation to aggregate encodings from MinHashEncoder, for reasons that are out of the scope of this notebook.

from skrub import AggJoiner
from skrub import _selectors as s

# Skrub selectors allow us to select columns using regexes, which reduces
# the boilerplate.
minhash_cols_query = s.glob("item*") | s.glob("model*")
minhash_cols = s.select(products_transformed, minhash_cols_query).columns

agg_joiner = AggJoiner(
    aux_table=products_transformed,
    aux_key="basket_ID",
    main_key="ID",
    cols=minhash_cols,
    operations=["min"],
)
baskets_products = agg_joiner.fit_transform(baskets)
TableReport(baskets_products)

Now that we understand how to use the AggJoiner, we can now assemble our pipeline by chaining two AggJoiner together:

the first one to deal with the MinHashEncoder vectors as we just saw
the second one to deal with the all the other columns

For the second AggJoiner, we use the mean, standard deviation, minimum and maximum operations to extract a representative summary of each distribution.

DropCols is another skrub transformer which removes the “ID” column, which doesn’t bring any information after the joining operation.

from scipy.stats import loguniform, randint
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import make_pipeline

from skrub import DropCols

model = make_pipeline(
    AggJoiner(
        aux_table=products_transformed,
        aux_key="basket_ID",
        main_key="ID",
        cols=minhash_cols,
        operations=["min"],
    ),
    AggJoiner(
        aux_table=products_transformed,
        aux_key="basket_ID",
        main_key="ID",
        cols=["make", "goods_code", "cash_price", "Nbr_of_prod_purchas"],
        operations=["sum", "mean", "std", "min", "max"],
    ),
    DropCols(["ID"]),
    HistGradientBoostingClassifier(),
)
model

We tune the hyper-parameters of the HistGradientBoostingClassifier to get a good performance.

from time import time

from sklearn.model_selection import RandomizedSearchCV

param_distributions = dict(
    histgradientboostingclassifier__learning_rate=loguniform(1e-3, 1),
    histgradientboostingclassifier__max_depth=randint(3, 9),
    histgradientboostingclassifier__max_leaf_nodes=[None, 10, 30, 60, 90],
    histgradientboostingclassifier__max_iter=randint(50, 500),
)

tic = time()
search = RandomizedSearchCV(
    model,
    param_distributions,
    scoring="neg_log_loss",
    refit=False,
    n_iter=10,
    cv=3,
    verbose=1,
).fit(X_train, y_train)
print(f"This operation took {time() - tic:.1f}s")

The best hyper parameters are:

pd.Series(search.best_params_)

To benchmark our performance, we plot the log loss of our model on the test set against the log loss of a dummy model that always output the observed probability of the two classes.

As this dataset is extremely imbalanced, this dummy model should be a good baseline.

The vertical bar represents one standard deviation around the mean of the cross validation log-loss.

import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.dummy import DummyClassifier
from sklearn.metrics import log_loss

results = search.cv_results_
best_idx = search.best_index_
log_loss_model_mean = -results["mean_test_score"][best_idx]
log_loss_model_std = results["std_test_score"][best_idx]

dummy = DummyClassifier(strategy="prior").fit(X_train, y_train)
y_proba_dummy = dummy.predict_proba(X_test)
log_loss_dummy = log_loss(y_true=y_test, y_pred=y_proba_dummy)

fig, ax = plt.subplots()
ax.bar(
    height=[log_loss_model_mean, log_loss_dummy],
    x=["AggJoiner model", "Dummy"],
    color=["C0", "C4"],
)
for container in ax.containers:
    ax.bar_label(container, padding=4)

ax.vlines(
    x="AggJoiner model",
    ymin=log_loss_model_mean - log_loss_model_std,
    ymax=log_loss_model_mean + log_loss_model_std,
    linestyle="-",
    linewidth=1,
    color="k",
)
sns.despine()
ax.set_title("Log loss (lower is better)")

Conclusion#

With AggJoiner, you can bring the aggregation and joining operations within a sklearn pipeline, and train models more efficiently.

One known limitation of both the AggJoiner and Joiner is that the auxiliary data to join is passed during the __init__ method instead of the fit method, and is therefore fixed once the model has been trained. This limitation causes two main issues:

1. Bigger model serialization: Since the dataset has to be pickled along with the model, it can result in a massive file size on disk.

2. Inflexibility with new, unseen data in a production environment: To use new auxiliary data, you would need to replace the auxiliary table in the AggJoiner that was used during fit with the updated data, which is a rather hacky approach.

These limitations will be addressed later in skrub.

Gallery generated by Sphinx-Gallery

AggJoiner on a credit fraud dataset#

Naive aggregation#

AggJoiner#

Conclusion#

This Page