CoST: Contrastive Quantization based Semantic Tokenization for Generative Recommendation
Aug. 1st, 2024: Got accepted by RecSys ‘24.
Jieming ZHU*, Mengqun JIN*, Qijiong LIU, Zexuan QIU, Zhenhua DONG, and Xiu LI#
*Equal contribution (co-first authors)
[Code] [Paper]
Abstract
Embedding-based retrieval serves as a dominant approach to candidate item matching for industrial recommender systems. With the success of generative AI, generative retrieval has recently emerged as a new retrieval paradigm for recommendation, which casts item retrieval as a generation problem. Its model consists of two stages: semantic tokenization and autoregressive generation. The first stage involves item tokenization that constructs discrete semantic codes to index items, while the second stage autoregressively generates semantic codes of candidate items. Therefore, semantic tokenization serves as a crucial preliminary step for training generative recommendation models. Existing research usually adopts a quantizier with reconstruction loss (e.g., RQ-VAE) to obtain semantic codes of items. But such a method fails to capture the proximity information among items that is essential in modeling item relationships in recommender sytems. In this paper, we propose a contrastive quantization based semantic tokenization approach (dubbed CoST), which leverages both item relationships and semantic information to learn semantic codes. Our experimental results show that semantic tokenization makes a large effect on generative recommendation and CoST brings up to 40$%$ improvements in NDCG@5 and Recall@5 on the MIND dataset over the previous baselines.
Citation
1 | @misc{zhu2024cost, |
CoST: Contrastive Quantization based Semantic Tokenization for Generative Recommendation