关于hdf5的讨论汇总 - 话题女王

z**********i
发帖数: 12276

来自主题: Statistics版 - anyone has experience with hdf5

不了解，头一次听说。
Current Release: HDF5-1.8.10
HDF5 is a data model, library, and file format for storing and managing data
. It supports an unlimited variety of datatypes, and is designed for
flexible and efficient I/O and for high volume and complex data. HDF5 is
portable and is extensible, allowing applications to evolve in their use of
HDF5. The HDF5 Technology suite includes tools and applications for managing
, manipulating, viewing, and analyzing data in the HDF5 format.

p**z
发帖数: 65

来自主题: _Python版 - 例子:把数据存入HDF5文件后用Matlab读取

不把笔记翻译成中文了，就直接贴了。
Example Python code below for creating the HDF5 file. Note uncompressed or '
gzip' compression type can be understood by both Matlab and HDFView.
from __future__ import division, print_function
import h5py
import numpy as np
data = np.array([('John', 35, 160.5), ('Mary', 20, 150)], dtype= [('Name', '
a10'), ('Age' ,'i'), ('Weight', 'f')])
##alternative:
#data = np.array([('John', 35, 160.5), ('Mary', 20, 150)], dtype = {'names':
['Name','Age','Weight'], 'formats':['a10','i','f'... 阅读全帖

gw
发帖数: 2175

来自主题: Linux版 - centos 6.9 make insall时relink error是怎么回事？

提示是要把库文件改名结果 permission denied.其实是根本没有那个要改名的文件
比如 mv hdf5.so.8.0.0.1 hdf5.so.8.0.0.1U
看了其中一些debug 信息，似乎有说 relink before install 不理解

r****t
发帖数: 10904

来自主题: Programming版 - read matlab .mat file, by C++???

现在 .mat 都是 hdf5 格式，直接用 hdf5 提供的c++ api 读。

B********e
发帖数: 1062

来自主题: Programming版 - 请问：如何设计一个复杂数据类的存储文件格式

json and hdf5 are both our friends They are both very flexible and can be
used to represent complex structures.
Normally, we used json to define high level concepts and used hdf5 to store
the intermediate results.

D***h
发帖数: 183

来自主题: Programming版 - 从java读取python输出的pytables

就是python pandas写出的hdf5文件，然后由于想从java直接访问python输出的hdf5文
件，可行吗？

p*******e
发帖数: 125

来自主题: Programming版 - Time series big data大家觉得怎么存储比较好？

Hdf5 on Hadoop? 感觉除了高频数据，大多没有那么big，是不是hdf5 file分时间段（
一年一个file）存就不错？这Hadoop hdfs可能提供了一个fault tolerance的好处，不
过文件corrupted大多也可以重新load一次。这distributed file system对time
series data还有什么好处？欢迎大家讨论。想到这个因为听说一些fintech公司用
Hadoop spark处理这些数据。

w********w
发帖数: 4

来自主题: Computation版 - 有人在ubuntu上装过那个meep包吗？

仿真时老是说hdf5的库找不到h5f.c之类的文件
是不是那个hdf5 serial的包有问题？

w********w
发帖数: 4

来自主题: EE版 - 有人在ubuntu上装过那个meep包吗？

仿真时老是说hdf5的库找不到h5f.c之类的文件
是不是那个hdf5 serial的包有问题？

r*****s
发帖数: 590

来自主题: EE版 - 有人在ubuntu上装过那个meep包吗？

你机器里hdf5装了马我记得是必须的装完meep自动就有hdf5了吧

o**n
发帖数: 1249

来自主题: Mathematics版 - matlab在循环中读写大文件的问题

btw, the reason I'm using hdf5 instead mat is because hdf5 can handle
multiple levels of data such as my data structure: /Data001/X,Y,Z,.. /
Data002/X,Y,Z.. (X, Y, Z are some matrix). You've got to use structure in
this case for mat data, but I think structure is not efficient in terms of
both time and space consumption.

it.

a*****9
发帖数: 153

来自主题: JobHunting版 - 吐槽贴：简历请慎重

cvmastersonline上线以来楼主已经改过40份左右的cs简历了，实在忍不住来版上吐个
槽。各种不准确不专业今天就不提了，主要吐槽一下楼主见过的低级失误：
ubuntu是没有11.02这个版本号的。。
HDFS，不是HDF5
Demo，不是demon。demon是魔鬼的意思。。。
大家都知道刷题重要，但是楼主还是忍不住提醒大家一下：刷题确实对面试很重要，但
前提是您的简历允许您拿到面试。如果楼主是面试官，楼主对简历的想法会很简单：同
学您这简历都可以写的这么草率，写的code也就不用看了吧。。。

c*******y
发帖数: 1630

来自主题: Stock版 - my understand of IB data

I will show you some sample I collected. I spent hours on programming
tricks/usages, potential outdated packages. asking around stackoverflow
to get something working.
Originally I thought IB disconnects frequently, but I will give a second
thought.
Here's some test.
Time range:
2014-03-03 23:30:30 to 2014-03-05 19:41:41 almost 2 days.
In [40]: e.head(1)
Out[40]:
Bid n
Time
2014-03-03 23:30:30.224323 0.8925 0
In [42]: e.tail(2)
Ou... 阅读全帖

c*******y
发帖数: 1630

来自主题: Stock版 - my understand of IB data

v*********w
发帖数: 7

来自主题: CS版 - Postdoctoral Scholar Research Opportunity at UCSD (SDSC)

Academic Division:
Engineering, Mathematics, Natural and Physical Sciences

Academic Department/Research Unit:
Computer Science and Engineering, Electrical and Computer Engineering,
Mathematics, San Diego Supercomputer Center

Disciplinary Specialty of Research:
Computational Sciences and Scalable IO Library Development

Description:
High Performance GeoComputing Laboratory at San Diego Supercomputer Center,
the University of California at San Diego (UCSD), invites applications for a
postdoct... 阅读全帖

gw
发帖数: 2175

来自主题: Linux版 - centos 6.9 make insall时relink error是怎么回事？

不是make install centos 6.9
是装其他软件，比如hdf5, graphviz

r****t
发帖数: 10904

来自主题: Programming版 - 请教Matlab和IDL的处理数据能力差异

听说 matlab .mat 现在已经是 hdf5 了，是不是这样？

y**b
发帖数: 10166

来自主题: Programming版 - vector

最近发现MPI并行程序操作HDF5格式大文件时，使用array简直是恶梦。
一是分配和释放同一块内存可能由不同函数完成，程序员负担极重。
二是各个进程(比如发送者和接收者)必须极其明确自己是否分配或释放了某块内存，
点对点通讯与collective通讯混合使用时候，极易出错，而且不好调试。
改成vector以后轻松一大截。

y**b
发帖数: 10166

来自主题: Programming版 - 问个参数读入和传递的设计问题

多谢pptwo and goodbug! 按这个思路做了，感觉不错，有几个问题再请教一下：
1. 这个singleton维护的hashmap类似于一个全局变量，无需传递函数参数，
任何对象和函数都可以取用，很方便，可是总觉得有点特别。想问一下这样做
很普遍吗？有个实验室开发的一个大型面向对象程序包，读入数据之后进行了
无数的分离和传递，直到每个用到(不同数据部分)的对象都完全用local的数据
结构来维护所需数据，好处是各个对象显得high cohesion, 缺点是非常繁琐、
数据冗余很多。你们觉得那种设计更好？
2. pptwo: You got great flexibility by not hard-coding all the parameters
in that singleton class. 这句话怎么理解？我想把所有数据一次性读入到
该singleton class，这样失去flexibility？
3. 大量进程读(一次)一个小文件(比如singleton class存储的内容)开销不大，
但是读那些很大的数据文件开销可能很大。比如我在该singleton cla... 阅读全帖

B********e
发帖数: 1062

来自主题: Programming版 - 请问：如何设计一个复杂数据类的存储文件格式

hdf5 作计算的很多都用这种格式
bson比较简单，转化容易

B********e
发帖数: 1062

来自主题: Programming版 - 请问：如何设计一个复杂数据类的存储文件格式

hdf5

m***r
发帖数: 359

来自主题: Programming版 - Python日报 2015年2月楼

Python日报 2015-02-23
@好东西传送门出品, 过刊见
http://py.memect.com
订阅：给 [email protected]
/* */ 发封空信，标题: 订阅Python日报
更好看的HTML版
http://py.memect.com/archive/2015-02-23/short.html
1) 【一个基于Python的Nearest Neighbors Search库】 by @路遥_机器学习
关键词：数据科学, 博客, 代码, 机器学习
一个基于Python的Nearest Neighbors Search库 [1] 。博文介绍 [2] 。另外基于这个
库的一个推荐系统 [3] ，将作为一个demo presentation在WWW上出现。作者是剑桥的
博士后 @唧唧歪歪de计算机博士
[1] https://github.com/ryanrhymes/panns
[2] http://ryanrhymes.blogspot.fi/2015/02/about-panns-naive-tool-for-approximate.h... 阅读全帖

m***r
发帖数: 359

来自主题: Programming版 - Python日报 2015年2月楼

g*******u
发帖数: 3948

来自主题: Programming版 - 请教数据存储问题

每条数据存一个文件？ binary的？不需要压缩一下？
一个一条读起来会不会来回读费时间？不搞笑？
我本来还想好多数据放一个表存个hdf5？似乎也没有意思对吧？
主要是，一个一条读起来高效吗？
thx

w***g
发帖数: 5958

来自主题: Programming版 - 请教数据存储问题

你要给出应用场景，或许能再给点别的建议。
HDF5没意思。如果非要数据库，可以考虑leveldb。

g*******u
发帖数: 3948

来自主题: Programming版 - 请教数据存储问题

我有两个应用 time series 数据
1 就是固定长度的数据我组织好用来做训练。比如每条1分钟之类的。数据可能上
千万。每条数据倒不大。
2 就是时间序列很长比如一个文件可能是1个月的数据，一条数据有可能60G之类的
，可以比较方便的进行按某个时间段进行查询和截取比如需要今天 10点到 11点
的数据
目前先侧重1 吧
感觉1，2 要采用不同的方法吧？
我也知道hdf5太老了但是也不知道用别的啥
多谢

f*******a
发帖数: 80

来自主题: Computation版 - 有人在ubuntu上装过那个meep包吗？

I installed MEEP on cygwin with HDF5 1.8. No problem.

v*********w
发帖数: 7

来自主题: Computation版 - Postdoctoral Scholar Research Opportunity at UCSD (SDSC)

v*********w
发帖数: 7

来自主题: GeoSpace版 - Postdoctoral Scholar Research Opportunity at UCSD (SDSC)

r***6
发帖数: 401

来自主题: Quant版 - stock ticks data storage

In memory last n ticks use circular buffer. For a whole days data use hdf5
or binary or R dataset.

co-ask.
★ Sent from iPhone App: iReader Mitbbs 6.88 - iPhone Lite

mw
发帖数: 525

来自主题: Quant版 - 请教实现中高频接受数据，即时储存的系统结构

有钱的话kdb
没钱的话hdf5
hiahia

z****g
发帖数: 1978

来自主题: Quant版 - 【database】存储market data的数据库

hdf5

k*******d
发帖数: 1340

来自主题: Quant版 - 大家time series data 怎末存？

每个entry都一样长度的话还是可以的，直接seek。
HDF5估计不行

k*******d
发帖数: 1340

来自主题: Quant版 - 大家time series data 怎末存？

每个entry都一样长度的话还是可以的，直接seek。
HDF5估计不行

c*******g
发帖数: 695

来自主题: Statistics版 - 请问windows下面如何安装R的xgobi Package

Package里面
http://cran.r-project.org/web/packages/xgobi/index.html
Windows binary: not available, see ReadMe
ReadMe 说
ADaCGH, GDD, PermuteNGS, RDieHarder, RScaLAPACK, Rcplex, Rmpi, SV,
cudaBayesreg, doMPI, gputools, hdf5, magma, ncdf4, rpud, rpvm, xgobi
or their dependencies also require additional libraries / software to
build on Windows I do not have (and may not even exist in versions
for Windows).
manual 里面说
SystemRequirements xgobi must be installed additionally, see file README, or
INST... 阅读全帖

s*********e
发帖数: 1051

来自主题: Statistics版 - anyone has experience with hdf5

how is the performance?

s*********e
发帖数: 1051

来自主题: Statistics版 - anyone has experience with hdf5

就是这第一句话。
两个特点有点意思
- 存储大数据
- 读效率很高

data
of
managing

D******n
发帖数: 2836

来自主题: Statistics版 - anyone has experience with hdf5

貌似是一个旧概念，当初matlab好像挺喜欢用这个format。
如果没记错的话。

s*********e
发帖数: 1051

来自主题: Statistics版 - HDF5 is really fast!

2X faster than import from csv and 4x faster than sqlite.
http://statcompute.wordpress.com/2012/12/22/data-import-efficie

s*********e
发帖数: 1051

来自主题: Statistics版 - HDF5 is really fast!

here is the case for R http://statcompute.wordpress.com/2012/12/23/data-import-efficiency-a-case-in-r/.

D**u
发帖数: 288

来自主题: Statistics版 - HDF5 is really fast!

Looks like the rhdf5 packaged is just released today.
http://www.bioconductor.org/packages/devel/bioc/manuals/rhdf5/m
Thanks for sharing.

D**u
发帖数: 288

来自主题: Statistics版 - data.table is amazing

Ok, I am going to try hdf5 + data.table combination, compare to rsqlite+
sqldf. That will be the optimal way I can think of now.

s*********e
发帖数: 1051

来自主题: Statistics版 - R 有点令人失望

看情况。SQLite 的portability 好，但是hdf5的读取速度快。

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

topics

未名新帖统计// 7月16日

历史上的今天