由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Database版 - question on large tables (>=800 million records, 10 G bytes of data)
相关主题
Will you against CLR UDF?A simple DB design question
问个查询问题。如何决定index
SQL, recruiter发过来的面试题 (转载)question: copy first N rows from table B to table A (DB2)
请问哪儿有parallel database的简明教程?求教个MS SQL的问题
Any suggestions on online SQL courses?Help with database design
想熟悉一下teradata, 有什么资料可以推荐一下么?搜索效率问题请教
上星期看到一個有趣的資料倉庫銷售演MS T-SQL 问题
DELL PARALLEL DW APPLIANCEIn MySQL, 如何在procedure里create trigger?谢谢了?
相关话题的讨论汇总
话题: data话题: cabid话题: table话题: query话题: tables
进入Database版参与讨论
1 (共1页)
b*****y
发帖数: 26
1
Hello, this is a data set for data mining.
I believe the experiences on this case should be helpful in general.
The questions is, how to make fast queries on large tables
(>=800 million records, 10G bytes of data)
with ordinary machines ?
Below are some details:
There is only one table, with the following fields:
cabId CHAR(8), timestamps DATETIME, longitude FLOAT, latitude FLOAT, status
CHAR(1)
We want to be able to query on cabId, timestamps, latitude, and longitude.
We also need to find out a
t*****g
发帖数: 1275
2
Have you ever considered a non-DB approach?
The data structure looks pretty straightforward. You may find a way (need
some compression) to put all data including a unique key (8 bytes) into a
16GB 64bit machine.
Now think about the reverse index.You need to build index on cabid/timestamp
/longitude/latitude, each index could fit into a single machine.
When queries come in, get sets of unique keys from all indices and find the
intersect.
Do your final query on the datahost to return result.
I wou

【在 b*****y 的大作中提到】
: Hello, this is a data set for data mining.
: I believe the experiences on this case should be helpful in general.
: The questions is, how to make fast queries on large tables
: (>=800 million records, 10G bytes of data)
: with ordinary machines ?
: Below are some details:
: There is only one table, with the following fields:
: cabId CHAR(8), timestamps DATETIME, longitude FLOAT, latitude FLOAT, status
: CHAR(1)
: We want to be able to query on cabId, timestamps, latitude, and longitude.

b*****y
发帖数: 26
3
Thanks ! You are so kind and helpful ....
We have considered this approach.
We also considered with the "user-defined storage engine" approach.
Both mysql and postgresql support user defined storage engines.
That is to say, one can use c/c++ to write a dynamically linked module,
implementing some interface functions like fetching a record, retching a
field,
coping with the keys, etc., and put the module in a particular server.
We normally use either Splus/R or matlab to do machine learning /
sta
a*******t
发帖数: 891
4
how is the data collected, or updated/inserted? do you get a data feed at
certain time of the day, or is this going to be a static set of data you are
working with?
have you considered breaking the data into smaller groups of files, and load
only the needed data into a temp table when requested.
cabID_0000.txt
cabID_0001.txt
....
each of those files would contain all the data associated with that cabID.
and when a request comes in, read the file(s) and write them to a table for
query
you can do
b*****y
发帖数: 26
5
Thanks! I like this suggestion.
This is actually the approach we are currently using.
It is pretty ad hoc, however, it saves a lot of software-engineering time.
We dislike software-engineering since we are not accredited for doing it.
The current query time is normally 2~5 minutes.
This query time is not good for webapps,
but is acceptable for data mining.
Since we do not update/insert,
the data integrity issue of having several copies of the same data
is not a problem.
If a database storage en

【在 a*******t 的大作中提到】
: how is the data collected, or updated/inserted? do you get a data feed at
: certain time of the day, or is this going to be a static set of data you are
: working with?
: have you considered breaking the data into smaller groups of files, and load
: only the needed data into a temp table when requested.
: cabID_0000.txt
: cabID_0001.txt
: ....
: each of those files would contain all the data associated with that cabID.
: and when a request comes in, read the file(s) and write them to a table for

b*****y
发帖数: 26
6
1.
Well, as I said, indexing large tables is not a good idea.
If an index cannot be all loaded into the memory,
then we are going to have O(log n) disk access,
compared with O(n) memory access.
This explains why data base servers require small index files and small
tables.
I can cite this link if you do not understand
why we normally do table scans for large tables.
http://www.mysqlperformanceblog.com/2006/06/09/why-mysql-could-be-slow-with-large-tables/
2.
fit
You severely underestimated the si
b*****y
发帖数: 26
7
There was several ambiguities in my post:
1.
"...
then we are going to have O(log n) disk access,
compared with O(n) memory access...."
Should be
"compared with O(n) memory access when we do a table scan.
Records can be loaded block by block for table scans.
In comparison, B-trees are normally accessed not sequentially"
2.
"...
Of course an index file of cabId can be much smaller by suffix compression.
But we need to uncompress the file in the memory when we use it..."
The reason why an index on
w*r
发帖数: 2421
8
I do not quite catch what you guys are discussing here,
you are just saying that you have a 800M record table with no active insert/
update/delete and you want to query it by PK?
800M line does not sound like a extremely huge one consider your data file
is
only 10G.
If you just want to find result, then you may build a customized B-Tree to
store your PK . You can easily implement parition or hash here to help
compress the tree, right? With this implementation, you can shrink the
physicalsize of

【在 b*****y 的大作中提到】
: Thanks! I like this suggestion.
: This is actually the approach we are currently using.
: It is pretty ad hoc, however, it saves a lot of software-engineering time.
: We dislike software-engineering since we are not accredited for doing it.
: The current query time is normally 2~5 minutes.
: This query time is not good for webapps,
: but is acceptable for data mining.
: Since we do not update/insert,
: the data integrity issue of having several copies of the same data
: is not a problem.

b*****y
发帖数: 26
9
Question # 1:
Why build a customized B-tree/Hash table ?
How is it different from the B-tree implementation in a database server?
Why the B-tree/Hash table implemented in mysql server is NOT good ?
How can a customized B-tree/Hash table benefit ?
Somebody cannot drive a car from Boston to S.F. in one hour
does not necesserily mean you can do it if you drive by yourself.
Question # 2:
How upgrading hardware will make the application faster ...
say from 5 minutes per query to 1 minute per query ?

【在 w*r 的大作中提到】
: I do not quite catch what you guys are discussing here,
: you are just saying that you have a 800M record table with no active insert/
: update/delete and you want to query it by PK?
: 800M line does not sound like a extremely huge one consider your data file
: is
: only 10G.
: If you just want to find result, then you may build a customized B-Tree to
: store your PK . You can easily implement parition or hash here to help
: compress the tree, right? With this implementation, you can shrink the
: physicalsize of

w*r
发帖数: 2421
10
customized hash algorithm to help you partition your data based on the
features of your query. I do not know how mysql implement their algorithm.
Here is my 2 cents based on my understanding of Teradata ..
If you have a primary index (unique or not unique), you starting from trying
to distribute your data evenly into several segmentatoins using these
columns(I am assuming your query condition is primarily based on these
columns). let say 10 .
Then based on your PK, you build a hash algorithm to

【在 b*****y 的大作中提到】
: Question # 1:
: Why build a customized B-tree/Hash table ?
: How is it different from the B-tree implementation in a database server?
: Why the B-tree/Hash table implemented in mysql server is NOT good ?
: How can a customized B-tree/Hash table benefit ?
: Somebody cannot drive a car from Boston to S.F. in one hour
: does not necesserily mean you can do it if you drive by yourself.
: Question # 2:
: How upgrading hardware will make the application faster ...
: say from 5 minutes per query to 1 minute per query ?

相关主题
想熟悉一下teradata, 有什么资料可以推荐一下么?A simple DB design question
上星期看到一個有趣的資料倉庫銷售演如何决定index
DELL PARALLEL DW APPLIANCEquestion: copy first N rows from table B to table A (DB2)
进入Database版参与讨论
b*****y
发帖数: 26
11
Thanks for the help offered and the clarification. I appreciate it !
Yes, splitting large data sets into smaller files based on keys greatly
helps.
In this way, we have keys implicitly implemented (in terms of application
specific semantics)
without using any storage.
It also helps parallelism on a SMP machine or a cluster.
This is exactly what Assailant pointed out (see previous posts).
Assailant also suggested splitting the data set in different ways by
different keys.
Both suggestions are act

【在 w*r 的大作中提到】
: customized hash algorithm to help you partition your data based on the
: features of your query. I do not know how mysql implement their algorithm.
: Here is my 2 cents based on my understanding of Teradata ..
: If you have a primary index (unique or not unique), you starting from trying
: to distribute your data evenly into several segmentatoins using these
: columns(I am assuming your query condition is primarily based on these
: columns). let say 10 .
: Then based on your PK, you build a hash algorithm to

w*r
发帖数: 2421
12
Teradata is so far the best platform for Terabyte data warehouse. The
parallelism implementation from bottom up in Teradata ensures a quite linear
performance.
This product is quite expensive in terms of license and service. I doubt
academic institutions will be able to easily afford its product and service,
however, academic license might be fairly cheap and you can always have
students willing to learn maintain the software/hardware.
What I wanted to clarify is how the parallelism is designed

【在 b*****y 的大作中提到】
: Thanks for the help offered and the clarification. I appreciate it !
: Yes, splitting large data sets into smaller files based on keys greatly
: helps.
: In this way, we have keys implicitly implemented (in terms of application
: specific semantics)
: without using any storage.
: It also helps parallelism on a SMP machine or a cluster.
: This is exactly what Assailant pointed out (see previous posts).
: Assailant also suggested splitting the data set in different ways by
: different keys.

w******n
发帖数: 692
13
I would consider berkley DB.

status

【在 b*****y 的大作中提到】
: Hello, this is a data set for data mining.
: I believe the experiences on this case should be helpful in general.
: The questions is, how to make fast queries on large tables
: (>=800 million records, 10G bytes of data)
: with ordinary machines ?
: Below are some details:
: There is only one table, with the following fields:
: cabId CHAR(8), timestamps DATETIME, longitude FLOAT, latitude FLOAT, status
: CHAR(1)
: We want to be able to query on cabId, timestamps, latitude, and longitude.

b*****y
发帖数: 26
14
Can you give us the reason why you would consider BDB ?
I would consider BDB as the worse choice. Here is my reason:
According to the features provided by BDB,
http://dev.mysql.com/doc/refman/5.0/en/bdb-characteristics.html
* It provides transactional tables.
Transaction is not need at all for me for data mining.
What is more, I need to pay the price for having transactions,
and I cannot disable transaction feature.
* Each table has to have a primary key, which is stored with the table data
(to

【在 w******n 的大作中提到】
: I would consider berkley DB.
:
: status

c**t
发帖数: 2744
15
I would recommand this: http://en.wikipedia.org/wiki/Skip_list

data

【在 b*****y 的大作中提到】
: Can you give us the reason why you would consider BDB ?
: I would consider BDB as the worse choice. Here is my reason:
: According to the features provided by BDB,
: http://dev.mysql.com/doc/refman/5.0/en/bdb-characteristics.html
: * It provides transactional tables.
: Transaction is not need at all for me for data mining.
: What is more, I need to pay the price for having transactions,
: and I cannot disable transaction feature.
: * Each table has to have a primary key, which is stored with the table data
: (to

1 (共1页)
进入Database版参与讨论
相关主题
In MySQL, 如何在procedure里create trigger?谢谢了?Any suggestions on online SQL courses?
check time table created in IBM SQL Aginity workbench ? (转载)想熟悉一下teradata, 有什么资料可以推荐一下么?
urgent help! insert value into table上星期看到一個有趣的資料倉庫銷售演
求教...初级问题DELL PARALLEL DW APPLIANCE
Will you against CLR UDF?A simple DB design question
问个查询问题。如何决定index
SQL, recruiter发过来的面试题 (转载)question: copy first N rows from table B to table A (DB2)
请问哪儿有parallel database的简明教程?求教个MS SQL的问题
相关话题的讨论汇总
话题: data话题: cabid话题: table话题: query话题: tables