sort_values issue



0

Hi All,

I am running into issue wile executing following command for content based recoomention system:

>> sorted_df = similarity_df.sort_values(by=['both', 'selected', 'other'], ascending=[False, True, True])

The error is : TypeError: sort_values() got an unexpected keyword argument 'by'

and when i check the type of similart_df its a series. I think it should be a DataFrame for sort_values() to work but not sure why we are getting Series instead of DataFrame. 

Anybody got any idea?

Thanks!

 


7 Answer(s)


0

With respect to the 2 cells preceeding the cell in question:

Can you share how you completed the cell that originally looked like:

# select the genre columns (column names start with 'genre_')
cols = 
def find_similar_genre_count(row):
    # Subtracting the values for the present and the selected movies
    diff = row[cols] - movie[cols]
    # Count how many 0s: This is the number of cases where both have the same value
    # This means eaither both movies belong to the genre or neither does
    n_both = (diff == 0).sum()
    # Now hunt for -1: This is the genres selected movie belongs to but not the other movie
    n_selected = 
    # You know what to do (hint: 1 = present movie but not selected movie)
    n_other = 
    return({
        'movieId': row['movieId'],
        'title': row['title'],
        'both': n_both,
        'selected': n_selected,
        'other': n_other
    })
similarity_df = movies_df.apply(find_similar_genre_count, axis=1)

and also can you share the result of:

similarity_df.head()

 

We may be able to look at the problem based on this.


0

Sure!

Here is the completed cell to calculate similarity data frame, first i tried the commented line to calculate 'cols' then chnaged it to what was given in solution.

# select the genre columns (column names start with 'genre_')
#cols = ['genre_{}'.format(x.lower()) for x in genres if x != '(no genres listed)']
cols = [col for col in movies_df.columns if col[:6] == 'genre_']
def find_similar_genre_count(row):
    # Subtracting the values for the present and the selected movies
    diff = row[cols] - movie[cols]
#     print(diff)
    # Count how many 0s: This is the number of cases where both have the same value
    # This means eaither both movies belong to the genre or neither does
    n_both = (diff == 0).sum()
    # Now hunt for -1: This is the genres selected movie belongs to but not the other movie
    n_selected =  (diff == -1).sum()
    # You know what to do (hint: 1 = present movie but not selected movie)
    n_other =  (diff == 1).sum()
    return({
        'movieId': row['movieId'],
        'title': row['title'],
        'both': n_both,
        'selected': n_selected,
        'other': n_other
    })
similarity_df = movies_df.apply(find_similar_genre_count, axis=1)

and the output of following code is:

similarity_df.head()

 
0    {'movieId': 1, 'title': 'Toy Story (1995)', 'b...
1    {'movieId': 2, 'title': 'Jumanji (1995)', 'bot...
2    {'movieId': 3, 'title': 'Grumpier Old Men (199...
3    {'movieId': 4, 'title': 'Waiting to Exhale (19...
4    {'movieId': 5, 'title': 'Father of the Bride P...
dtype: object

 

type(similarity_df)

pandas.core.series.Series

 

 


1

Can you update the return statement to:

    return(pd.Series({
        'movieId': row['movieId'],
        'title': row['title'],
        'both': n_both,
        'selected': n_selected,
        'other': n_other
    }))

That should solve the problem.


0

Yes, It did work. Thanks a bunch!!

So my understanding is: movies_df.apply(...) function was applied row-wise and the function returns a series that becomes one row in 'similarity_df' dataframe.

Wthout pd.Series() it was just list of dictionaries.

Thank you GPS for helping to resolve this issue.


0

Now I am getting a different error for sort function:

sorted_df = similarity_df.sort_values(by=['both', 'selected', 'other'], ascending=[False, True, True])

Error :

TypeError: 'Series' objects are mutable, thus they cannot be hashed

1

n_both = (diff == 0).values.sum()
n_selected = (diff == -1).values.sum()
n_other = (diff == 1).values.sum()

Adding the .values in the above statements may help you. This forces the similarity_df output to look like:

	both	movieId	other	selected	title
0	19	    1	    0	    0	        Toy Story (1995)
1	19	    2	    0	    0	        Jumanji (1995)
2	19	    3	    0	    0	        Grumpier Old Men (1995)
3	19	    4	    0	    0	        Waiting to Exhale (1995)
4	19	    5	    0	    0	        Father of the Bride Part II (1995)

I am assuming that presently, depending on the version of pandas, you may be getting the similarity_df as:

	both	                                movieId	other	                                 selected	                            title
0	genre_comedy 1 genre_documentary 1 ...	1	    genre_comedy 0 genre_documentary 0 ...	genre_comedy 0 genre_documentary 0 ...	Toy Story (1995)
1	genre_comedy 1 genre_documentary 1 ...	2	    genre_comedy 0 genre_documentary 0 ...	genre_comedy 0 genre_documentary 0 ...	Jumanji (1995)
2	genre_comedy 1 genre_documentary 1 ...	3	    genre_comedy 0 genre_documentary 0 ...	genre_comedy 0 genre_documentary 0 ...	Grumpier Old Men (1995)
3	genre_comedy 1 genre_documentary 1 ...	4	    genre_comedy 0 genre_documentary 0 ...	genre_comedy 0 genre_documentary 0 ...	Waiting to Exhale (1995)
4	genre_comedy 1 genre_documentary 1 ...	5	    genre_comedy 0 genre_documentary 0 ...	genre_comedy 0 genre_documentary 0 ...	Father of the Bride Part II (1995)

I hope this change of .values should do it.


0

Yes, it certainly did resolve the issue. Thank you GPS!

I am going to unpack this code because didn't get why adding .values resolved it and why both views differ. Will reach out again if i am unable to figure it out.

Thanks a bunch!!