[Data Science Project Case] Parsing URLS - DataSciences版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

DataSciences版 - [Data Science Project Case] Parsing URLS

相关主题
● 请推荐一个NLP的data set (转载)
● [Data Science Project Case] Marketing Return
● [Road map] From ClickStream to ConsumerInsight
● 希拉里脸部加屎API
● 【免费讲座】2/28 Session: Introducing SQL Server on Linux (转载)
● 机器学习需要自己搞算法吗
● 凑热闹转发一篇自己写的博文，轻拍
● 恭喜开版，发个刚看到的好玩的machine learning的图
● 机器学习日报2015年2月楼
● DS需要会的手艺真不少

相关话题的讨论汇总
话题: url话题: character话题: urls话题: names话题: 360

进入DataSciences版参与讨论

(共1页)

c***z
发帖数: 6348

This is something I am working on and would like to hear if you have any
clue.
Say we have millions of product names, such as "Xbox 360", "Playstation 4",
etc.
We want to extract (tokenize) meaningful information from billions of URLs (
click history), and want to distinguish the 360 in "Xbox 360" (useful) and
the 360 in session ids (garbage).
For example, given
www.amazon.com/nike/running-shoes%09mens/buy?q=abc&x=123&ref=hello%09there
The first 09 is size (keep) and the second 09 is garbage (drop)
We want: amazon nike running shoes 09 mens buy hello there; but we want to
drop: abc 123, as well as the second 09
Due to the size of the data, manually checking the names is impossible. Does
anyone have a clue?
I am thinking about hashing table, but that means the parsing time raises
from O(1) to O(N), and N is millions!
Thanks!

I******y
发帖数: 176

不知道理解对不对，胡说两句：
感觉可以根据url的pattern来分类然后extract things you want
按你那个例子，比如同属amazon domain的url pattern都是domain/brand/item％size/
... 那么已知这个pattern就可以把你需要的提出来。

c***z
发帖数: 6348

Sounds good, will take a look at the patterns. Thanks a lot!

l******n
发帖数: 9344

就用regular expression match就好了

size/

【在 I******y 的大作中提到】

: 不知道理解对不对，胡说两句：
: 感觉可以根据url的pattern来分类然后extract things you want
: 按你那个例子，比如同属amazon domain的url pattern都是domain/brand/item％size/
: ... 那么已知这个pattern就可以把你需要的提出来。

c***z
发帖数: 6348

Can you give more details?
I did regular expression match in R on company names, it was a pain in the
butt, and it's only 10k names...

r*******y
发帖数: 626

One way is to establish a clickstream, which leads to a sale. From the item
sold, make sense of URLs clicked during the process.

,
(

【在 c***z 的大作中提到】

: This is something I am working on and would like to hear if you have any
: clue.
: Say we have millions of product names, such as "Xbox 360", "Playstation 4",
: etc.
: We want to extract (tokenize) meaningful information from billions of URLs (
: click history), and want to distinguish the 360 in "Xbox 360" (useful) and
: the 360 in session ids (garbage).
: For example, given
: www.amazon.com/nike/running-shoes%09mens/buy?q=abc&x=123&ref=hello%09there
: The first 09 is size (keep) and the second 09 is garbage (drop)

b**L
发帖数: 646

running-shoes%09mens 不是size9, %dd 是ascii码
所以用regex 应该很容易parse 这些url

c***z
发帖数: 6348

谢谢大家的input，我确实对url不熟，呵呵，还是要多多学习啊

b*****o
发帖数: 715

完全不懂你在说什么。
你给的那个例子里％09都是escaped unicode：
import urllib
urllib.unquote("www.amazon.com/nike/running-shoes%09mens/buy?q=abc&x=123&ref
=hello%09there")
'www.amazon.com/nike/running-shoes\tmens/buy?q=abc&x=123&ref=hello\tthere'
另外，为什么drop q=...和x=...,但是保留ref=...? 这两者就功能而言没有任何不同
呀，都是GET request里的param。还是说你有一个param的whitelist/blacklist？

,
(

【在 c***z 的大作中提到】

c***z
发帖数: 6348

Ah, I just realized that I know too little for URL parse, I will ask the
engineers so that I can ask the question more intelligently.

ref

【在 b*****o 的大作中提到】

: 完全不懂你在说什么。
: 你给的那个例子里％09都是escaped unicode：
: import urllib
: urllib.unquote("www.amazon.com/nike/running-shoes%09mens/buy?q=abc&x=123&ref
: =hello%09there")
: 'www.amazon.com/nike/running-shoes\tmens/buy?q=abc&x=123&ref=hello\tthere'
: 另外，为什么drop q=...和x=...,但是保留ref=...? 这两者就功能而言没有任何不同
: 呀，都是GET request里的param。还是说你有一个param的whitelist/blacklist？
:
: ,

相关主题
● 希拉里脸部加屎API
● 【免费讲座】2/28 Session: Introducing SQL Server on Linux (转载)
● 机器学习需要自己搞算法吗
● 凑热闹转发一篇自己写的博文，轻拍
进入DataSciences版参与讨论

l******0
发帖数: 244

we have millions of product names, such as "Xbox 360"
--- The 'millions of product names' are known and in your database, or
unknown?
You want to extract company name --> product name from URL, or anything else?
First impression is to sort all the URL lines alphabetically so that it
would be much easier to identify different URL patterns from different sites.

,
(

【在 c***z 的大作中提到】

l*******s
发帖数: 1258

this is a sequence labeling task:
a url is a sequence, your task is to find out terms within the url.
It's similar with named entity recognition task.
You can read some paper about it.
Model: CRF, MEMM, HMM
training data: manually label them

l*******s
发帖数: 1258

cont:
Use tag B I O to indicate beginning, inside, and outside of a word.
Each character in URL will be assigned a tag, B, or I, or O.
Then this becomes a classification task, just with 3 class labels: BIO.
Grab any classifier you want, mine is MaxEnt
Feature engineering:
convert each character to a feature vector. The most helpful features will
be: n gram character before or after current charactor, length of url,
whether there is a digit or letter in neighboring characters, and of course
current character.
Model training and decoding:
This step is pretty simple, exactly the same with any other classification
tasks.
tips: use some post-processing rules to improve.

l*******s
发帖数: 1258

(共1页)

进入DataSciences版参与讨论

相关主题
● DS需要会的手艺真不少
● 借版面问个machine learning的问题
● 求职要求clearance
● [Data Science Project Case] Topic Learning
● Data scientist / Machine Learning Engineer 相关面试题 (转载)
● random forest 有没有可能保证某几个变量一直被选上
● 一个面试题（predictive model） (转载)
● data science 面试求教
● 请教大家一个做feature的问题
● pyspark subtract 如何使用？

相关话题的讨论汇总
话题: url话题: character话题: urls话题: names话题: 360

boards