c***z 发帖数: 6348 | 1 This is something I am working on and would like to hear if you have any
clue.
Say we have millions of product names, such as "Xbox 360", "Playstation 4",
etc.
We want to extract (tokenize) meaningful information from billions of URLs (
click history), and want to distinguish the 360 in "Xbox 360" (useful) and
the 360 in session ids (garbage).
For example, given
www.amazon.com/nike/running-shoes%09mens/buy?q=abc&x=123&ref=hello%09there
The first 09 is size (keep) and the second 09 is garbage (drop)
We want: amazon nike running shoes 09 mens buy hello there; but we want to
drop: abc 123, as well as the second 09
Due to the size of the data, manually checking the names is impossible. Does
anyone have a clue?
I am thinking about hashing table, but that means the parsing time raises
from O(1) to O(N), and N is millions!
Thanks! | I******y 发帖数: 176 | 2 不知道理解对不对,胡说两句:
感觉可以根据url的pattern来分类然后extract things you want
按你那个例子,比如同属amazon domain的url pattern都是domain/brand/item%size/
... 那么已知这个pattern就可以把你需要的提出来。 | c***z 发帖数: 6348 | 3 Sounds good, will take a look at the patterns. Thanks a lot! | l******n 发帖数: 9344 | 4 就用regular expression match就好了
size/
【在 I******y 的大作中提到】 : 不知道理解对不对,胡说两句: : 感觉可以根据url的pattern来分类然后extract things you want : 按你那个例子,比如同属amazon domain的url pattern都是domain/brand/item%size/ : ... 那么已知这个pattern就可以把你需要的提出来。
| c***z 发帖数: 6348 | 5 Can you give more details?
I did regular expression match in R on company names, it was a pain in the
butt, and it's only 10k names... | r*******y 发帖数: 626 | 6 One way is to establish a clickstream, which leads to a sale. From the item
sold, make sense of URLs clicked during the process.
,
(
【在 c***z 的大作中提到】 : This is something I am working on and would like to hear if you have any : clue. : Say we have millions of product names, such as "Xbox 360", "Playstation 4", : etc. : We want to extract (tokenize) meaningful information from billions of URLs ( : click history), and want to distinguish the 360 in "Xbox 360" (useful) and : the 360 in session ids (garbage). : For example, given : www.amazon.com/nike/running-shoes%09mens/buy?q=abc&x=123&ref=hello%09there : The first 09 is size (keep) and the second 09 is garbage (drop)
| b**L 发帖数: 646 | 7 running-shoes%09mens 不是size9, %dd 是ascii码
所以用regex 应该很容易parse 这些url | c***z 发帖数: 6348 | 8 谢谢大家的input,我确实对url不熟,呵呵,还是要多多学习啊 |
| b*****o 发帖数: 715 | 9 完全不懂你在说什么。
你给的那个例子里%09都是escaped unicode:
import urllib
urllib.unquote("www.amazon.com/nike/running-shoes%09mens/buy?q=abc&x=123&ref
=hello%09there")
'www.amazon.com/nike/running-shoes\tmens/buy?q=abc&x=123&ref=hello\tthere'
另外,为什么drop q=...和x=...,但是保留ref=...? 这两者就功能而言没有任何不同
呀,都是GET request里的param。还是说你有一个param的whitelist/blacklist?
,
(
【在 c***z 的大作中提到】 : This is something I am working on and would like to hear if you have any : clue. : Say we have millions of product names, such as "Xbox 360", "Playstation 4", : etc. : We want to extract (tokenize) meaningful information from billions of URLs ( : click history), and want to distinguish the 360 in "Xbox 360" (useful) and : the 360 in session ids (garbage). : For example, given : www.amazon.com/nike/running-shoes%09mens/buy?q=abc&x=123&ref=hello%09there : The first 09 is size (keep) and the second 09 is garbage (drop)
| c***z 发帖数: 6348 | 10 Ah, I just realized that I know too little for URL parse, I will ask the
engineers so that I can ask the question more intelligently.
ref
【在 b*****o 的大作中提到】 : 完全不懂你在说什么。 : 你给的那个例子里%09都是escaped unicode: : import urllib : urllib.unquote("www.amazon.com/nike/running-shoes%09mens/buy?q=abc&x=123&ref : =hello%09there") : 'www.amazon.com/nike/running-shoes\tmens/buy?q=abc&x=123&ref=hello\tthere' : 另外,为什么drop q=...和x=...,但是保留ref=...? 这两者就功能而言没有任何不同 : 呀,都是GET request里的param。还是说你有一个param的whitelist/blacklist? : : ,
| | | l******0 发帖数: 244 | 11 we have millions of product names, such as "Xbox 360"
--- The 'millions of product names' are known and in your database, or
unknown?
You want to extract company name --> product name from URL, or anything else?
First impression is to sort all the URL lines alphabetically so that it
would be much easier to identify different URL patterns from different sites.
,
(
【在 c***z 的大作中提到】 : This is something I am working on and would like to hear if you have any : clue. : Say we have millions of product names, such as "Xbox 360", "Playstation 4", : etc. : We want to extract (tokenize) meaningful information from billions of URLs ( : click history), and want to distinguish the 360 in "Xbox 360" (useful) and : the 360 in session ids (garbage). : For example, given : www.amazon.com/nike/running-shoes%09mens/buy?q=abc&x=123&ref=hello%09there : The first 09 is size (keep) and the second 09 is garbage (drop)
| l*******s 发帖数: 1258 | 12 this is a sequence labeling task:
a url is a sequence, your task is to find out terms within the url.
It's similar with named entity recognition task.
You can read some paper about it.
Model: CRF, MEMM, HMM
training data: manually label them | l*******s 发帖数: 1258 | 13 cont:
Use tag B I O to indicate beginning, inside, and outside of a word.
Each character in URL will be assigned a tag, B, or I, or O.
Then this becomes a classification task, just with 3 class labels: BIO.
Grab any classifier you want, mine is MaxEnt
Feature engineering:
convert each character to a feature vector. The most helpful features will
be: n gram character before or after current charactor, length of url,
whether there is a digit or letter in neighboring characters, and of course
current character.
Model training and decoding:
This step is pretty simple, exactly the same with any other classification
tasks.
tips: use some post-processing rules to improve. | l*******s 发帖数: 1258 | 14 cont:
Use tag B I O to indicate beginning, inside, and outside of a word.
Each character in URL will be assigned a tag, B, or I, or O.
Then this becomes a classification task, just with 3 class labels: BIO.
Grab any classifier you want, mine is MaxEnt
Feature engineering:
convert each character to a feature vector. The most helpful features will
be: n gram character before or after current charactor, length of url,
whether there is a digit or letter in neighboring characters, and of course
current character.
Model training and decoding:
This step is pretty simple, exactly the same with any other classification
tasks.
tips: use some post-processing rules to improve. | l*******s 发帖数: 1258 | 15 cont:
Use tag B I O to indicate beginning, inside, and outside of a word.
Each character in URL will be assigned a tag, B, or I, or O.
Then this becomes a classification task, just with 3 class labels: BIO.
Grab any classifier you want, mine is MaxEnt
Feature engineering:
convert each character to a feature vector. The most helpful features will
be: n gram character before or after current charactor, length of url,
whether there is a digit or letter in neighboring characters, and of course
current character.
Model training and decoding:
This step is pretty simple, exactly the same with any other classification
tasks.
tips: use some post-processing rules to improve. |
|