Clustering for Instacart Users

8 min readDec 14, 2020

Dataset

I will be using the dataset “Instacart Market Basket Analysis” from Kaggle’s competition. It contains the anonymized data with the customers’ orders from Instacart. Each user is identified by its unique number user_id, each order — by order_id, and each product — by product_id. The products are grouped by categories — aisles or by larger groups — departments.

Motivation

Grouping users based on their puchase history might be helpful to understand their behaviour and what they are inclined to buy. It is also interesting to see what products are the most popular among different groups of users.

While analyzing this dataset, I found this notebook where the author was performing the PCA (Principal component analysis) to group the users based on the aisles of the purchased products, and then K-means in the 2D space of two chosen PC (principal components). This article is also similar to it. While I learnt some new things from this analysis, I was wondering about the parameter choice, i.e. the number of the PC and K in K-means.

Grouping by aisles

In the dataset, there are 134 aisles. Below, you can see the rearranged dataset where the rows correspond to the users’ numbers and the columns correspond to the aisles’ numbers. The value in the table with the location (i,j) shows the total number of times when the user i bought the product j.

Number of purchases of users in different aisles

We can consider this as a space where the number of dimensions is the number of the aisles (134), and try to see if we could reduce this number (maybe there are some aisles A and B so that people who buy products from A, buy also products from B, then we could consider only the aisle A).

I performed the PCA with the PCA from sklearn on the normalized data not specifying the desired PC number so that I could get all 134 components and analyse them.

I need to emphasize that I normalized the data, which was not done in the notebooks that I found. A good example of a dataset that gives a “useless” PC for non-normalized data, can be found here. Thus I decided to normalize the dataset using normalize from sklearn.preprocessing.

The choice of the number of the PC can be done in different ways. Here, I am using the cumulative variance. In the plot below, cumulative variance is shown on the y-axis vs the corresponing number of PCs on the x-axis. It thus shows the number of PCs needed to explain a certain variance. Here, I decided to choose the 85% cut-off threshold. We can see that for this threshold shown by the red line, we need to take around 45 components.

If we look at the plot of eigenvalues vs eigenvector number, we will see that the first few values differ from the others, starting from the eigenvector ~6. Nevertheless, other eigenvectors still give a large contribution.

I decided to see if we can get better results (with lower number of the needed PCs) if we use the departments instead of the aisles. Using aisles might be also not very helpful because we have very few highly-correlated variables for the aisles. I calculated the correlations between the aisles and the highest correlations were:

‘fresh herbs’ vs ‘fresh vegetables’ = 0.65;
‘fresh fruits’ vs ‘fresh vegetables’ = 0.68;
‘fresh fruits’ vs ‘packaged vegetables fruits’ = 0.73;
‘packaged vegetables fruits’ vs ‘fresh vegetables’ = 0.68.

But there were no other variables with the correlations higher than 0.65 and thus maybe we should better try using larger groups i.e. departments.

Grouping by departments

In this part, we would like to group users based on the purchased products.

We will perform the PCA on the normalized dataset. We will represent the data in the following way: users vs departments. We have 206209 users and 21 department, and the values (i,j) show the number of purchases of the user i in the department j.

Number of purchases of users in different departments

The PCA is used to reduce the dimensionality. It is helpful when there are correlated features. Let’s see the correlations in our table:

In this plot, only the values with the absolute value of correlation greater than 0.5 are shown. As we can see, there are multiple correlations between the departments. Thus, a user who, for example, buys products in the department ‘frozen’, often buys also products in the departments ‘dry goods and pasta’.

First, we need to define how many PC we should use.

We wil make a decision based on the explained variance. In this plot, we show the number of components in the x-axis (from 0 to 20, corresponding to 21 departments), and the cumulative variance in the y-axis. The red line shows the 0.85 or 85% threshold, so if we take 9 PC (from 0 to 8), the corresponding explained variance will 85%.

Another way is to see the eigenvalues.

Here, all the eigenvalues are less than 1 because the data was normalized. We see that if we take the eigenvectors from 0 to 8, the eigenvalues will be greater than ~0.01.

Now we have 9 PCs, thus reducing the dimensions of the data from 21 to 9. We want to see what departments “contribute” to those dimensions.

Contribution of different departments in the first 9 dimensions

In this plot (made similarly to this code), the contributions of the departments to each of the considered dimensions is shown and the corresponding explained variances are given. For example, ‘snacks’, ‘beverages’ and ‘produce’ are the three largest contributions in the dimension #0.

Another way to see this is to look at the 2D projections (also based on this code).

2D projection: Dimension 0 vs dimension 1

This is the projections onto the dimensions 0 and 1 (only the largest contributions are shown, i.e. those that have the “length” in the corresponding 2D space greater than 0.2). As seen before, ‘snacks’, ‘beverages’ and ‘produce’ are the three largest contributions in the dimension #0, and the ‘dairy and eggs’ is the largest contribution in the dimension 1.

2D projection: Dimension 3 vs dimension 4

This is the projection onto the dimensions 3 and 4. ‘Frozen’, ‘snacks’, and ‘dairy and eggs’ are the largest contributions for the dimension 3, while ‘Frozen’, ‘beverages’, ‘pantry’, ‘dairy and eggs’, ‘snacks’, and ‘produce’ are the largest for the dimension 4.

Now, we would like to choose two PCs, let it be 0 and 1, and perform K-means clustering to separate the points into groups.

To define the number of clusters, we will use the elbow method. This method is based on calculating Within-Cluster-Sum-of-Squares (WCSS) i.e. the sum of squares of the distances of each data point in all clusters to their respective centroids, for each number of clusters.

A similar plot was obtained for other PCs. We can thus take K=4 for the K-means method.

PC0 vs PC1: obtained clusters. Red points indicate the centroids

In this figure, clusters for PC0-PC1 plane are shown. The red dots represent the clusters’ centroids, i.e. approximately the averages of the data points in the corresponding clusters.

As we want to group the customers, we can take a cluster’s center point as the average customer of that cluster. We can associate every user to a cluster by transforming the original table to the new dimensions and then we can find the correspondence user vs cluster.

We can take the original table that we had (users vs departments), and divide it into 4 parts corresponding to our clusters. Then for each of the new 4 tables/clusters we can find the average number of purchases in this department, i.e. its popularity among the users of this cluster.

10 most popular products in each of the four obtained clusters

This table shows the occurrences of the categories in different clusters (0 to 3 are the cluster numbers):

We can see that such departments as ‘dairy and eggs’, ‘produce’, ‘snacks’ etc. are popular in all 4 obtained clusters. However, ‘household’ and ‘breakfast’ departments are more popular in the clusters #0 and #3, and ‘canned goods’ and ‘dry goods and pasta’ are popular in the clusters #1 and #2. ‘Personal care’ department is more popular in the cluster #3.

Conclusions

A way of data exploration for the Instacart users was shown. PCA has been done based on the departments where the products were purchased, as the departments are more correlated than the aisles. It allowed to reduce the dimensionality and choose a 2D space in which K-means clustering has been performed. The users were grouped in 4 main clusters, and the top departments for those clusters were shown.

References

[1] Instacart dataset

[2] PCA and K-means for Instacart dataset on Kaggle

[3] Similar analysis for the same dataset on Towardsdatascience

[4] Code for plotting the contributions to each dimension

I have done this work is a part of my final project for the bootcamp Jedha Lyon in Sept-Dec 2020, where I and my colleague worked on the Instacart dataset. Her work on data visualization in Dash as well as other results from our common work can be found here.