The efficient matrix and DataFrame functions in Python used in recommendation system data processing

np.setdiff1d, np.where and unstack etc.

Published in

Geek Culture

4 min readMar 26, 2021

In my previous story, some NumPy functions have been used in Recommendation System data processing. Because the emphasis is on the content-based recommendation system, it is a pity that these functions haven’t be displayed efficient usage in detail there. Now in this story, I would like to explain them in detail.

For your information, for function 3,4,5, you might want to check the dataframe I used. It is the code:

np.setdiff1d(a1,a2,assume_unique=default)

This function finds the difference of two arrays and returns the unique values in a1 that are not in a2. It can be used to compare lists and arrays. Let’s take some examples:

Compare lists:

From the result, you might see the difference:

The comparison is based on a1, so it shows Nemo from the first two cases and the other three movies when a1 is in the first place.
If assume_unique is equal to True, it assumes that the array is unique, no duplicated element, if it is False, then it will unique the result.
It seems assume_unique=False also sorts the values ascending, but it is just from this case, and it isn’t mentioned in the official docu.

Compare arrays:

It also works.

How about sets:

As I expect that it can compare the elements in the two sets, the result doesn’t match it. My conclusion is that this function doesn’t fit for sets.

2. np.unique(np.concatenate([list1, list2], axis=0))

This function can combine the two lists together, like new recommended movies and previous movies together, and get the unique movies. The order of lists doesn’t matter.

3. np.where(condition[, x, y])[0][0]:Return the index of the elements chosen from x or y depending on condition

In the previous story, we have a movies_df as below. If we want to get the index through the known column value, like the movie name as follow me, how it works?

You might wonder when this function is used. Let’s check the example in the previous story:

4. np.dot(np.transpose()): the mathematics meaning is to get a scalar product, the business meaning here is to get the similarity (how they are close to each other) of the movies, and can be used for further recommendation.

This method works if the value has been one-hot encoded.

How about if the value is not One-Hot encoded, like below:

If I want to know whether the user and movie have interacted, then the below method can be used:

5. DataFrame.unstack()

Different from the previous four functions, unstack is a kind of DataFrame function. When I check the official docu, it shows that ‘Pivot a level of the (necessarily hierarchical) index labels’, I am lost. But it is very useful in the recommendation system to check how the user-user or item-user interacts.

I use two methods to show the different usage of this function:

Method 1 shows whether the user and the movie have the interaction, method 2 shows what the interactions are. Is it efficient?

In this story, I explain in detail for below 5 functions and how they are used in recommendation system data processing:

np.setdiff1d: can be used to get the difference between two lists or arrays, but not for sets.
np.unique(np.concatenate([list1, list2], axis=0)): can combine the two lists together and get the unique list.
np.where(condition[, x, y])[0][0]: can get the index based on the condition
np.dot(np.transpose()): can get the scalar product of two array, which can be used for similarity or giving weight to the parameters
unstack(): can be used to check the interaction between item-item or item-user.

If you want to check how these functions are used in a recommendation system, this is the story link. Thank you for your reading.

The efficient matrix and DataFrame functions in Python used in recommendation system data processing

np.setdiff1d, np.where and unstack etc.

Written by Annie Wang