## 基于内容的电影推荐：用户画像



用户画像构建步骤：

- 根据用户的评分历史，结合物品画像，将有观影记录的电影的画像标签作为初始标签反打到用户身上
- 通过对用户观影标签的次数进行统计，计算用户的每个初始标签的权重值，排序后选取TOP-N作为用户最终的画像标签

#### 用户画像建立

```python
import pandas as pd
import numpy as np
from gensim.models import TfidfModel

from functools import reduce
import collections

from pprint import pprint

# ......

'''
user profile画像建立：
1. 提取用户观看列表
2. 根据观看列表和物品画像为用户匹配关键词，并统计词频
3. 根据词频排序，最多保留TOP-k个词，这里K设为100，作为用户的标签
'''

def create_user_profile():
    watch_record = pd.read_csv("datasets/ml-latest-small/ratings.csv", usecols=range(2), dtype={"userId":np.int32, "movieId": np.int32})

    watch_record = watch_record.groupby("userId").agg(list)
    # print(watch_record)

    movie_dataset = get_movie_dataset()
    movie_profile = create_movie_profile(movie_dataset)

    user_profile = {}
    for uid, mids in watch_record.itertuples():
        record_movie_prifole = movie_profile.loc[list(mids)]
        counter = collections.Counter(reduce(lambda x, y: list(x)+list(y), record_movie_prifole["profile"].values))
        # 取出出现次数最多的前50个词
        interest_words = counter.most_common(50)
        # 取出出现次数最多的词 出现的次数
        maxcount = interest_words[0][1]
        # 利用次数计算权重 出现次数最多的词权重为1
        interest_words = [(w,round(c/maxcount, 4)) for w,c in interest_words]
        user_profile[uid] = interest_words

    return user_profile

user_profile = create_user_profile()
pprint(user_profile)
```

- reduce函数

  - **描述**

    **reduce()** 函数会对参数序列中元素进行累积。

    函数将一个数据集合（链表，元组等）中的所有数据进行下列操作：用传给 reduce 中的函数 function（有两个参数）先对集合中的第 1、2 个元素进行操作，得到的结果再与第三个数据用 function 函数运算，最后得到一个结果。

  - **语法**

    reduce() 函数语法：

    ```python
    reduce(function, iterable[, initializer])
    ```

  - **参数**

    - function -- 函数，有两个参数
    - iterable -- 可迭代对象
    - initializer -- 可选，初始参数

  - **返回值**

    返回函数计算结果。

  - **示例**

    ```python
    >>>def add(x, y) :            # 两数相加
    ...     return x + y
    ... 
    >>> reduce(add, [1,2,3,4,5])   # 计算列表和：1+2+3+4+5
    15
    >>> reduce(lambda x, y: x+y, [1,2,3,4,5])  # 使用 lambda 匿名函数
    15
    ```

- 使用collections.Counter类统计列表元素出现次数

  ```python
  from collections import Counter
  names = ["Stanley", "Lily", "Bob", "Well", "Peter", "Bob", "Well", "Peter", "Well", "Peter", "Bob","Stanley", "Lily", "Bob", "Well", "Peter", "Bob", "Bob", "Well", "Peter", "Bob", "Well"]
  names_counts = Counter(names)
  ```