An interview question: data store schema design - Database版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Database版 - An interview question: data store schema design

相关主题
● 如何决定index	● file based database
● 紧急求助，关于SQL Server	● 我经常问的几道SQL SERVER DBA的面试题，图省事不问编程
● indexing就是设置primary key吗？	● 做DW的，如何估计/衡量一项任务的工作量？
● SQL, recruiter发过来的面试题 (转载)	● 被人鄙视了
● Question about SQL server lock	● 出错了
● [转载] 问个datamining clustering的问题。	● 请教三个Key的property,
● 请教 sql server index问题	● 包子：SQL Server如何将一个字段的内容复制到另外一个字段？
● Re: sql server 面试题（5）	● 数据 migrating中的风险，注意事项

相关话题的讨论汇总
话题: null话题: userid话题: key话题: files

进入Database版参与讨论

(共1页)

y****3
发帖数: 131

Design a backend store for storing metadata about files and directories
belonging to different users as part of an online storage service. The
logical data model is as follows:
--a user has zero or more files or directories
--a directory contains zero or more files or sub-directories
--a file or directory has a relative path that is up to 32KB in length
--a file has a size and last modified time
This store should handle the following online queries efficiently:
--lookups on single files or sub-directory given a user and relative path
--listing all files and sub-directories belonging to a user or at a given
level of the hierarchy inserting new files and directories
It should also support the following batch operations that run on a daily
basis:
--count the total number of files per user, and the aggregate file size.
--count the number of files and sub-directories added per day for each user.
I can think of creating such a schema:
1. User table (userName, userId, etc)
2. File table (userId, file/dirName, relativePath, size, lastModified,
isFile) whereas as key, and is a foreign key
to User table.
Not sure if this the right answer. Also, if the data grow to say, billions
of rows, would sharding a good enough solution?

i****a
发帖数: 36252

What database?
Sql server 2008 and up has organization structure data type, can use for
this to store each folder and file as nodes.

[发表自未名空间手机版 - m.mitbbs.com]

【在 y****3 的大作中提到】

: Design a backend store for storing metadata about files and directories
: belonging to different users as part of an online storage service. The
: logical data model is as follows:
: --a user has zero or more files or directories
: --a directory contains zero or more files or sub-directories
: --a file or directory has a relative path that is up to 32KB in length
: --a file has a size and last modified time
: This store should handle the following online queries efficiently:
: --lookups on single files or sub-directory given a user and relative path
: --listing all files and sub-directories belonging to a user or at a given

g***l
发帖数: 18555

是个FILE MANAGEMENT SYSTEM啊,我是这样DESIGN的，注意FOLDER和FILE都可以改名字
，所以要用ID,不要说FOLDER改了名字，就找不到FILE了。有一个SUBFOLDER存储FOLDER
和SUBFOLDER之间的HIERACHY关系，可以有好多LELVEL。
FILE
FILE_ID, FILE_NAME, FILE_ATTRIBUTES（SIZE, DATE,OWNER）..
FILE_FOLDER
FILE_ID, FOLDER_ID(last folder)
FOLDER
FOLDER_ID, FOLDER_NAME, FOLDER_ATTRIBUTES..
SUBFOLDER
FOLDER_ID SUB_FLODER_ID, LEVEL
USER
USER_ID, USER_NAME ETC
USER_FILE
USEER_ID, FILE_ID,TIME,ACTION(CREATE, MODIFY,DELETE，RENAME,MOVE)

y****3
发帖数: 131

多谢指点。
如果把每一个folder都加为一个row，那么insert和lookup时，是否都要把file path都
parse开，每个folder都要在folder table 里做一个lookup？

FOLDER

【在 g***l 的大作中提到】

: 是个FILE MANAGEMENT SYSTEM啊,我是这样DESIGN的，注意FOLDER和FILE都可以改名字
: ，所以要用ID,不要说FOLDER改了名字，就找不到FILE了。有一个SUBFOLDER存储FOLDER
: 和SUBFOLDER之间的HIERACHY关系，可以有好多LELVEL。
: FILE
: FILE_ID, FILE_NAME, FILE_ATTRIBUTES（SIZE, DATE,OWNER）..
: FILE_FOLDER
: FILE_ID, FOLDER_ID(last folder)
: FOLDER
: FOLDER_ID, FOLDER_NAME, FOLDER_ATTRIBUTES..
: SUBFOLDER

B*****g
发帖数: 34098

这个和department-manager-employee设计有神马区别？

【在 y****3 的大作中提到】

y****3
发帖数: 131

多谢回复。我是初涉DB design，所以如果说的不对，请多指教。
我想主要区别还是在scale上吧。这个系统要求支持billion files with millions
users。而且path最大值是32KB，理论上可以有上千层的folder。所以如果用
department-manager-employee，在lookup 和 insert时是否会太expensive?

【在 B*****g 的大作中提到】

: 这个和department-manager-employee设计有神马区别？

i****a
发帖数: 36252

hierarchy data type is more efficient.
you may have to do some research on the limit:
http://msdn.microsoft.com/en-us/library/bb677290(v=sql.100).asp
The encoding used in the hierarchyid type is limited to 892 bytes.
Consequently, nodes which have too many levels in their representation to
fit into 892 bytes cannot be represented by the hierarchyid type.

【在 y****3 的大作中提到】

: 多谢回复。我是初涉DB design，所以如果说的不对，请多指教。
: 我想主要区别还是在scale上吧。这个系统要求支持billion files with millions
: users。而且path最大值是32KB，理论上可以有上千层的folder。所以如果用
: department-manager-employee，在lookup 和 insert时是否会太expensive?

g***l
发帖数: 18555

是的，因为LEVEL可以有很多的

【在 i****a 的大作中提到】

: hierarchy data type is more efficient.
: you may have to do some research on the limit:
: http://msdn.microsoft.com/en-us/library/bb677290(v=sql.100).asp
: The encoding used in the hierarchyid type is limited to 892 bytes.
: Consequently, nodes which have too many levels in their representation to
: fit into 892 bytes cannot be represented by the hierarchyid type.

y****3
发帖数: 131

Thank you very much!
HierarchyId does seems to be a nice choice. However, as you said, if the
levels cannot be fit into 892 bytes, it will not work.
In my case, I guess with a relative path size limit of 32KB, it could be the
case that the levels will exceed 892 bytes.

【在 i****a 的大作中提到】

B*****g
发帖数: 34098

不明白，你在design阶段，为啥要有限制

the

【在 y****3 的大作中提到】

: Thank you very much!
: HierarchyId does seems to be a nice choice. However, as you said, if the
: levels cannot be fit into 892 bytes, it will not work.
: In my case, I guess with a relative path size limit of 32KB, it could be the
: case that the levels will exceed 892 bytes.

相关主题
● [转载] 问个datamining clustering的问题。	● file based database
● 请教 sql server index问题	● 我经常问的几道SQL SERVER DBA的面试题，图省事不问编程
● Re: sql server 面试题（5）	● 做DW的，如何估计/衡量一项任务的工作量？
进入Database版参与讨论

y****3
发帖数: 131

因为如果在已知数据里有无法用HierarchyId的，那design里就不能用了吧。否则可能
会有Data Corruption。

【在 B*****g 的大作中提到】

: 不明白，你在design阶段，为啥要有限制
:
: the

g***l
发帖数: 18555

其实就是个TREE结构吧

y****w
发帖数: 3747

更灵活，没那么多"现实"限制。假如我添一个需求，可以很方便的把一个有1000个sub
folder，2000000个files的目录挪到另一个子目录下去。这算是经常地用法吧，典型
tree问题，再去检查一下大家的设计是不是有比较好的支持。

【在 B*****g 的大作中提到】

: 这个和department-manager-employee设计有神马区别？

B*****g
发帖数: 34098

一个director，手下10个manager，100个team lead，1000个developer，要从一个VP手
下转到另外一个VP手下，就是数量级的区别。

sub

【在 y****w 的大作中提到】

: 更灵活，没那么多"现实"限制。假如我添一个需求，可以很方便的把一个有1000个sub
: folder，2000000个files的目录挪到另一个子目录下去。这算是经常地用法吧，典型
: tree问题，再去检查一下大家的设计是不是有比较好的支持。

y****w
发帖数: 3747

我觉得这个题目出的蛮有水平的：
1. tree结构设计。效率与灵活性如何兼顾。
2. 注意那个32k，可以考察下对dbms大字段支持的了解，是不是可以inline一部分，80
/20准则。
3. 接着可以考怎么写sql,一般的到recusrive的。基本上developer需要知道的query写
法基本上都能考察到。
4. 这个model涉及的东西是每个人都非常熟悉的。

【在 y****3 的大作中提到】

y****w
发帖数: 3747

tree model里面只要改director的parent就行了.
假如reporting chain短了或长了一截，有没有用level区别就很大。o(1) vs. O(n**level).

【在 B*****g 的大作中提到】

: 一个director，手下10个manager，100个team lead，1000个developer，要从一个VP手
: 下转到另外一个VP手下，就是数量级的区别。
:
: sub

B*****g
发帖数: 34098

我是没明白这两个设计有什么区别。

level).

【在 y****w 的大作中提到】

: tree model里面只要改director的parent就行了.
: 假如reporting chain短了或长了一截，有没有用level区别就很大。o(1) vs. O(n**level).

y****w
发帖数: 3747

level会变啊. 这个update代价可能会很大。
看挪动目录的case,你比较下和下面这个model的区别.假设你 mv /home/beijing/.../...<1000level>/somefolder /home/yhangw/somefolder
user: user_id, user_root_dir_id, other_attributes
directiory: dir_id, dir_name, parent_dir, other_attributes
file: file_id, file_name, parent_dir, size, mtime, ...
无论如何，这个需求interviewer没提出来，所以这种优点也未必被采纳。

【在 B*****g 的大作中提到】

: 我是没明白这两个设计有什么区别。
:
: level).

B*****g
发帖数: 34098

我是说manager-employee和directory-folder的设计区别

/...<1000level>/somefolder /home/yhangw/somefolder

【在 y****w 的大作中提到】

: level会变啊. 这个update代价可能会很大。
: 看挪动目录的case,你比较下和下面这个model的区别.假设你 mv /home/beijing/.../...<1000level>/somefolder /home/yhangw/somefolder
: user: user_id, user_root_dir_id, other_attributes
: directiory: dir_id, dir_name, parent_dir, other_attributes
: file: file_id, file_name, parent_dir, size, mtime, ...
: 无论如何，这个需求interviewer没提出来，所以这种优点也未必被采纳。

y****w
发帖数: 3747

奥~ 在我看来没区别，但对用level来说就有区别。
n久前在这儿讨论tree时候说的就是dept模型。一哥当时提到有些情况不太现实，比如每个人级别都得动啥的。但挪挪目录就无所谓了。
可以更复杂些，比如考虑link文件,这样的文件系统允许同一个文件在不同的级别上,而且有可能出现循环的情况。

【在 B*****g 的大作中提到】

: 我是说manager-employee和directory-folder的设计区别
:
: /...<1000level>/somefolder /home/yhangw/somefolder

相关主题
● 被人鄙视了	● 包子：SQL Server如何将一个字段的内容复制到另外一个字段？
● 出错了	● 数据 migrating中的风险，注意事项
● 请教三个Key的property,	● Can you cluster two different sql server version together?
进入Database版参与讨论

y****3
发帖数: 131

Again, thank you very much for everyone who kindly helped me in this. I am
already learning a lot as a db design newbie....
Yes, I totally agree with you.
2 is a very interesting point.Sql Server supports varchar(8000), and for
larger data you have to use varchar(max), which may not be in row.
I am thinking about adding a hash column so that I can do easy comparison
and also index.
1. is also very interesting. I am debating if I should add a table to
capture the sub-directory information like gekjl suggested.
SUBFOLDER FOLDER_ID SUB_FLODER_ID, LEVEL.
Currently, here is what I have, do you think this is a good design?
CREATE TABLE Users
(
UserID int PRIMARY KEY NOT NULL,
UserName varchar(256) NOT NULL
)
go
CREATE TABLE Directories
(
DirectoryID bigint PRIMARY KEY NOT NULL,
UserId int NOT NULL FOREIGN KEY REFERENCES Users (UserId),
RelativePath varchar(max) NOT NULL,
LastModified datetime NOT NULL,
CreateTime datetime NOT NULL
)
CREATE TABLE SubDirectories
(
DirectoryID bigint NOT NULL FOREIGN KEY REFERENCES Directories(DirectoryID),
SubDirectoryId bigint NOT NULL FOREIGN KEY REFERENCES Directories(
DirectoryID)
)
CREATE TABLE Files
(
FileId bigint PRIMARY KEY NOT NULL,
FileName varchar(256) NOT NULL,
UserID int NOT NULL FOREIGN KEY REFERENCES Users (UserId),
DirectoryId bigint NOT NULL FOREIGN KEY REFERENCES Directories(DirectoryID),
FileSize int NOT NULL,
LastModified datetime NOT NULL,
CreateTime datetime NOT NULL
)

80

【在 y****w 的大作中提到】

: 我觉得这个题目出的蛮有水平的：
: 1. tree结构设计。效率与灵活性如何兼顾。
: 2. 注意那个32k，可以考察下对dbms大字段支持的了解，是不是可以inline一部分，80
: /20准则。
: 3. 接着可以考怎么写sql,一般的到recusrive的。基本上developer需要知道的query写
: 法基本上都能考察到。
: 4. 这个model涉及的东西是每个人都非常熟悉的。

y****w
发帖数: 3747

subdir没必要。用不用level掌握90/10原则，问问会有多少更新，这个设计需不需要极
端照顾层次查询---而且，要想真正拿到level的好处,还得加上root。

【在 y****3 的大作中提到】

: Again, thank you very much for everyone who kindly helped me in this. I am
: already learning a lot as a db design newbie....
: Yes, I totally agree with you.
: 2 is a very interesting point.Sql Server supports varchar(8000), and for
: larger data you have to use varchar(max), which may not be in row.
: I am thinking about adding a hash column so that I can do easy comparison
: and also index.
: 1. is also very interesting. I am debating if I should add a table to
: capture the sub-directory information like gekjl suggested.
: SUBFOLDER FOLDER_ID SUB_FLODER_ID, LEVEL.

y****3
发帖数: 131

非常感谢各位的指点。
我最后的design是：
/****** Object: Table [dbo].[Users] ******/
CREATE TABLE [dbo].[Users](
[UserID] [int] NOT NULL,
[UserName] [varchar](256) NOT NULL,
PRIMARY KEY CLUSTERED
(
[UserID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF
, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
/****** Object: Table [dbo].[Directories] ******/
CREATE TABLE [dbo].[Directories](
[DirectoryID] [bigint] NOT NULL,
[UserId] [int] NOT NULL,
[RelativePath] [nvarchar](max) NULL,
[LastModified] [datetime] NOT NULL,
[CreateTime] [datetime] NOT NULL,
[ParentDirectoryId] [bigint] NULL,
[PathHash] [varchar](50) NOT NULL,
PRIMARY KEY CLUSTERED
(
[DirectoryID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF
, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
ALTER TABLE [dbo].[Directories] WITH CHECK ADD FOREIGN KEY([UserId])
REFERENCES [dbo].[Users] ([UserID])
/****** Object: Table [dbo].[Files] ******/
CREATE TABLE [dbo].[Files](
[FileId] [bigint] NOT NULL,
[FileName] [varchar](256) NOT NULL,
[UserID] [int] NOT NULL,
[DirectoryId] [bigint] NOT NULL,
[FileSize] [bigint] NOT NULL,
[LastModified] [datetime] NOT NULL,
[CreateTime] [datetime] NOT NULL,
CONSTRAINT [PK_Files] PRIMARY KEY CLUSTERED
(
[FileId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF
, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
ALTER TABLE [dbo].[Files] WITH CHECK ADD FOREIGN KEY([DirectoryId])
REFERENCES [dbo].[Directories] ([DirectoryID])
ALTER TABLE [dbo].[Files] WITH CHECK ADD FOREIGN KEY([UserID])
REFERENCES [dbo].[Users] ([UserID])
然后在join的column上建立non-clustered index。把solution交上去后好像他们还算
满意，所以给了on-site。再次感谢各位的帮助。

【在 y****w 的大作中提到】

: subdir没必要。用不用level掌握90/10原则，问问会有多少更新，这个设计需不需要极
: 端照顾层次查询---而且，要想真正拿到level的好处,还得加上root。

(共1页)

进入Database版参与讨论

相关主题
● 数据 migrating中的风险，注意事项	● Question about SQL server lock
● Can you cluster two different sql server version together?	● [转载] 问个datamining clustering的问题。
● 什么时候不用索引	● 请教 sql server index问题
● 找duplicate row的最有效的sql statement?	● Re: sql server 面试题（5）
● 如何决定index	● file based database
● 紧急求助，关于SQL Server	● 我经常问的几道SQL SERVER DBA的面试题，图省事不问编程
● indexing就是设置primary key吗？	● 做DW的，如何估计/衡量一项任务的工作量？
● SQL, recruiter发过来的面试题 (转载)	● 被人鄙视了

相关话题的讨论汇总
话题: null话题: userid话题: key话题: files

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天