o****o 发帖数: 8077 | 1 借帖问如何高效读入大的CSV或者任意TXT文件
比如读入一个700多MB的CSV,在r里面很慢,即使是用如下方式先预置了每列的属性:
trainset<-read.csv("train_set.csv", nrows=1000)
colClasses<-sapply(trainset, class);
trainset<-read.csv("train_set.csv", sep=",", header=T,
colClasses=colClasses)
仍然要花很长时间,差不都是SAS的30倍,SAS一分钟,R硬是花了30多分钟。 |
|
o****o 发帖数: 8077 | 2 nice, I need to read the help file more carefully, it is documented there
already
#-------------------------------------------------------------
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", row.names, col.names,
as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
|
|
D*********2 发帖数: 535 | 3 data structure:
1 col of character
3 cols of numeric
tried following ways:
1) .sas7bdat to .csv, then read.csv
problem: character be trimmed, eg. 007 to 7. I really need the trial number
to be 3 digits.
2) .sas7bdat to .txt, then read.table
problem: numeric variables be read as factor, say, 167.400, after using as.
numeric, it became 1674.
I also tried colClasses, but unless set the numeric cols to "factor" or "
character", there is an error message.
Thanks a lot!!!!!!!!!! |
|
D*******a 发帖数: 207 | 4
number
Just use colClass to read as "character", then you will have 007.
1. If you can not read other than "factor" or "character, this is because
you have non-numeric entries in this column. For example, "Missing", "-",
etc. Find it out and correct it.
2.You can't use as.numeric(factor) in this case. This function will return
the coded level (very dangerous! You will be burnt by this in the future, if
you are not careful). For example,
as.numeric(factor(c("a",2)))
[1] 2 1
You got 1674 because |
|
D*********2 发帖数: 535 | 5 Yes, it works when I use
> dat.corr <- read.csv(file="xxx.csv", colClasses=rep("factor", 4), T)
However, I did not notice, even use read.csv, it gave me "factor". Why?
I simply did import to .csv in SAS.
> dat.corr <- read.csv(file="xxx.csv", T)
> sapply(dat.corr, class)
Y X1 X2 X3
"integer" "factor" "factor" "factor"
Y is the one should be factor, say, "007". X1:X3 should be numeric.
Thanks again! |
|
D*******a 发帖数: 207 | 6
First change the dots to NA in the X1,X2,X3 columns. Then:
dat.corr <- read.csv(file="xxx.csv", colClass=c("character",rep("numeric",3)
), T) |
|
q****o 发帖数: 37 | 7 read.table is "devil", which will yield a data.frame. Thus it assumes all
the string variables are factor.
it's safe to predefine the classes you want to read if possible such as
dat<-read.table("test.txt",colClasses=c(rep('numeric',2),rep('character',4))
,header=T,sep="\t") |
|
D******n 发帖数: 2836 | 8 as.character after read.csv
colClasses= inside the read.csv |
|
c*****m 发帖数: 4817 | 9 来自主题: Statistics版 - R 小问题 0000123456 = 123456 if this is a number, if you want 0000123456, then set
colClasses = "character" in read.table() |
|
o****o 发帖数: 8077 | 10 thanks, will study it.
now I found I can use SAVE(), LOAD() when I need to use the file a lot of
times in the future, shake off 50% more time comparing to read.csv(...,
colClass=colAttr), or using SCAN function
directly read ZIPPED CSV file observes no time saving so far, anyone got
luck? |
|
o****o 发帖数: 8077 | 11 looks like ff package helps on solving the problem where the file is TOO
large to fit in memory, like the bigmemory package does, but it doesn't help
on efficiency here as it maps data into disk.
Am I missing anything here?
>
> library(ff)
>
> system.time(
+ dsnff<-read.csv.ffdf(file="c:\_data\MNISTtrain.csv")
+ )
user system elapsed
22.44 9.30 42.17
>
> system.time(
+ dsn1<-read.csv(file="c:\_data\MNISTtrain.csv")
+ )
user system elapsed
13.71 0.04 13.77
>
>
> t<-Sys.t... 阅读全帖 |
|
o****o 发帖数: 8077 | 12 scan的问题是不能读入不同属性的列,比如文件混合了字符串和数值变量,单单是数值
矩阵还行,不过也不比预置了colClasses=的读表格函数快多少,我的经验是大约5--10
%左右
现在就是用SAS把数据处理完了,如果需要用到SAS里面没有的算法再port到R里搞 |
|
q*******l 发帖数: 36 | 13 用这个试试:
require(xlsx)
read.xlsx {xlsx}
read.xlsx(file, sheetIndex, sheetName=NULL, rowIndex=NULL,
startRow=NULL, endRow=NULL, colIndex=NULL,
as.data.frame=TRUE, header=TRUE, colClasses=NA,
keepFormulas=FALSE, encoding="unknown", ...)
|
|