[Data Science Project Case] Fuzzy matching on names (转载) - Statistics版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Statistics版 - [Data Science Project Case] Fuzzy matching on names (转载)

相关主题
● 关于MATCH HOSPITAL NAME	● [合集] k-mean clustering
● 面试问题紧急求助！	● 请教一个R:K-means的问题
● need help on bias correction	● 请教一个频率优化问题（相关性？）
● 请教关于SAS fuzzy match merge的问题	● Reject Inference question in Credit Scoring
● 诚心请教：这样的背景适合什么样的工作？	● 这种情况应该用什么hypothesis test。
● 请帮忙看看这份简历怎么样，怎么进一步修改。多谢！	● 用什么可以画这个clustering 图? R?
● help:eRROR MESSAGE INVALID sas NAME	● AR(1) and clustering by firms
● [合集] 公司的第一轮面试一般都问什么？	● Clustered Data能用GEE或Mixed Model吗？

相关话题的讨论汇总
话题: names话题: data话题: am话题: fuzzy话题: science

进入Statistics版参与讨论

(共1页)

c***z
发帖数: 6348

【以下文字转载自 DataSciences 讨论区】
发信人: chaoz (面朝大海，吃碗凉皮), 信区: DataSciences
标题: [Data Science Project Case] Fuzzy matching on names
发信站: BBS 未名空间站 (Fri Apr 4 13:04:18 2014, 美东)
We have two data sets, one for product views and one for actual
purchases. We don't have all the shopping cart information and need to
infer the missing ones.
To make a training case we need to join the two sets, and the cart id
and item names are the only available keys. The problem is the items
can have many names in both sets, e.g. Dell 17" XPS and Dell XPS
Laptop 17 inch mean the same item.
I am thinking of two ways: tf-idf to identify the first three words of
item names; or clustering using edit distance.
This would be the first time I am doing a text analysis project, so I
am wondering if I need a lot of data, instead of just a smaller
sample, as well as what would be the best approach and tools. I am
familiar with R, Matlab, Pig and some Scala, and am willing to learn
other languages as well.
Thanks a lot!

(共1页)

进入Statistics版参与讨论

相关主题
● Clustered Data能用GEE或Mixed Model吗？	● 诚心请教：这样的背景适合什么样的工作？
● 请问哪里有PCA的SAS code 啊	● 请帮忙看看这份简历怎么样，怎么进一步修改。多谢！
● 在线等，请教一个SAS关于cluster命令的输出结果问题	● help:eRROR MESSAGE INVALID sas NAME
● very simple question about Cluster data	● [合集] 公司的第一轮面试一般都问什么？
● 关于MATCH HOSPITAL NAME	● [合集] k-mean clustering
● 面试问题紧急求助！	● 请教一个R:K-means的问题
● need help on bias correction	● 请教一个频率优化问题（相关性？）
● 请教关于SAS fuzzy match merge的问题	● Reject Inference question in Credit Scoring

相关话题的讨论汇总
话题: names话题: data话题: am话题: fuzzy话题: science

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天