c***z 发帖数: 6348 | 1 Hi all,
Currently I am working on building a uniform product category for the
products at various websites.
I can think of several approaches:
1. clustering using Jaccard index
2a. decision tree based on a manually built dictionary
2b. decision tree based on entropy (real machine learning)
3. neural network (I am least familiar with this approach but my boss is all
into it)
It would be great if you guys can give some suggestions/comments! I can
provide more details if needed.
Thanks alot! |
c****t 发帖数: 19049 | 2 没看明白。深吗样的NN? 你老板要搞deep learning? |
c***z 发帖数: 6348 | 3 yes, my boss wanna do deep learning, and wanna do it in one week... |
c****t 发帖数: 19049 | 4 现成的codes好像matlab的居多。要么自己改成python,要么用lisa lab那个吧。lisa
lab那好像是做boltzmann machine的,没有bayesian network。都是api programming
,不用怕
【在 c***z 的大作中提到】 : yes, my boss wanna do deep learning, and wanna do it in one week...
|
c***z 发帖数: 6348 | 5 thanks alot for the information! will check it up and keep you updated |
c***z 发帖数: 6348 | 6 Some update: clustering didn't work well
I knew that k-mean won't work since Jaccard doesn't follow triangular
inequality - hence convergence of mean doesn't guarantee convergence of
variance.
I tried hierarchical agglomerative and it didn't work well. I believe the
reason is feature selection - I should have used trigrams and such, instead
of words, as trivial words led to mis-clustering.
I am working on trigrams as well as NN, will keep updating here. Thanks a
lot casact and guys! |
l*******m 发帖数: 1096 | 7 is this problem supervised or unsupervised?
instead
【在 c***z 的大作中提到】 : Some update: clustering didn't work well : I knew that k-mean won't work since Jaccard doesn't follow triangular : inequality - hence convergence of mean doesn't guarantee convergence of : variance. : I tried hierarchical agglomerative and it didn't work well. I believe the : reason is feature selection - I should have used trigrams and such, instead : of words, as trivial words led to mis-clustering. : I am working on trigrams as well as NN, will keep updating here. Thanks a : lot casact and guys!
|
c***z 发帖数: 6348 | 8 It is unsupervised. Even though we have Y label, there is no easy way to
check accuracy if I understand it correctly. But I could be wrong, and I
would be very glad to know that! :) |
c***z 发帖数: 6348 | 9 Some update:
The requirement has been changed to match with NPD categories, so it became
supervised learning.
I used hand coded dictionary of keywords and hand labeled items. The model
used was decision tree, with two stages: first filter out irrelevant items (
90% accuracy), then assign labels (82% accuracy).
The good thing is that this can be iterative: we can improve the dictionary
using the confusion matrix, and then repeat until high accuracy is achieved.
Thanks a lot guys! |