c***z 发帖数: 6348 | 1 【 以下文字转载自 DataSciences 讨论区 】
发信人: chaoz (面朝大海,吃碗凉皮), 信区: DataSciences
标 题: [Data Science Project Case] Fuzzy matching on names
发信站: BBS 未名空间站 (Fri Apr 4 13:04:18 2014, 美东)
We have two data sets, one for product views and one for actual
purchases. We don't have all the shopping cart information and need to
infer the missing ones.
To make a training case we need to join the two sets, and the cart id
and item names are the only available keys. The problem is the items
can have many names in both sets, e.g. Dell 17" XPS and Dell XPS
Laptop 17 inch mean the same item.
I am thinking of two ways: tf-idf to identify the first three words of
item names; or clustering using edit distance.
This would be the first time I am doing a text analysis project, so I
am wondering if I need a lot of data, instead of just a smaller
sample, as well as what would be the best approach and tools. I am
familiar with R, Matlab, Pig and some Scala, and am willing to learn
other languages as well.
Thanks a lot! |
|