关于colclasses的讨论汇总 - 话题女王

全部话题 - 话题: colclasses

o****o
发帖数: 8077

借帖问如何高效读入大的CSV或者任意TXT文件
比如读入一个700多MB的CSV，在r里面很慢，即使是用如下方式先预置了每列的属性：
trainset<-read.csv("train_set.csv", nrows=1000)
colClasses<-sapply(trainset, class);
trainset<-read.csv("train_set.csv", sep=",", header=T,
colClasses=colClasses)
仍然要花很长时间，差不都是SAS的30倍，SAS一分钟，R硬是花了30多分钟。

o****o
发帖数: 8077

来自主题: Statistics版 - 用R灌水的终极利器。

nice, I need to read the help file more carefully, it is documented there
already
#-------------------------------------------------------------
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", row.names, col.names,
as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",

D*********2
发帖数: 535

来自主题: Statistics版 - 求助：Import .sas7bdat to R

data structure:
1 col of character
3 cols of numeric
tried following ways:
1) .sas7bdat to .csv, then read.csv
problem: character be trimmed, eg. 007 to 7. I really need the trial number
to be 3 digits.
2) .sas7bdat to .txt, then read.table
problem: numeric variables be read as factor, say, 167.400, after using as.
numeric, it became 1674.
I also tried colClasses, but unless set the numeric cols to "factor" or "
character", there is an error message.
Thanks a lot!!!!!!!!!!

D*******a
发帖数: 207

来自主题: Statistics版 - 求助：Import .sas7bdat to R

number
Just use colClass to read as "character", then you will have 007.
1. If you can not read other than "factor" or "character, this is because
you have non-numeric entries in this column. For example, "Missing", "-",
etc. Find it out and correct it.
2.You can't use as.numeric(factor) in this case. This function will return
the coded level (very dangerous! You will be burnt by this in the future, if
you are not careful). For example,
as.numeric(factor(c("a",2)))
[1] 2 1
You got 1674 because

D*********2
发帖数: 535

来自主题: Statistics版 - 求助：Import .sas7bdat to R

Yes, it works when I use
> dat.corr <- read.csv(file="xxx.csv", colClasses=rep("factor", 4), T)
However, I did not notice, even use read.csv, it gave me "factor". Why?
I simply did import to .csv in SAS.
> dat.corr <- read.csv(file="xxx.csv", T)
> sapply(dat.corr, class)
Y X1 X2 X3
"integer" "factor" "factor" "factor"
Y is the one should be factor, say, "007". X1:X3 should be numeric.
Thanks again!

D*******a
发帖数: 207

来自主题: Statistics版 - 求助：Import .sas7bdat to R

First change the dots to NA in the X1,X2,X3 columns. Then:
dat.corr <- read.csv(file="xxx.csv", colClass=c("character",rep("numeric",3)
), T)

q****o
发帖数: 37

来自主题: Statistics版 - 【R】关于R的variable type

read.table is "devil", which will yield a data.frame. Thus it assumes all
the string variables are factor.
it's safe to predefine the classes you want to read if possible such as
dat<-read.table("test.txt",colClasses=c(rep('numeric',2),rep('character',4))
,header=T,sep="\t")

D******n
发帖数: 2836

来自主题: Statistics版 - R: 怎么读入某个field中含有comma的文件？

as.character after read.csv
colClasses= inside the read.csv

c*****m
发帖数: 4817

来自主题: Statistics版 - R 小问题

0000123456 = 123456 if this is a number, if you want 0000123456, then set
colClasses = "character" in read.table()

o****o
发帖数: 8077

来自主题: Statistics版 - 有技巧得用R才能发挥它的威力

thanks, will study it.
now I found I can use SAVE(), LOAD() when I need to use the file a lot of
times in the future, shake off 50% more time comparing to read.csv(...,
colClass=colAttr), or using SCAN function
directly read ZIPPED CSV file observes no time saving so far, anyone got
luck?

o****o
发帖数: 8077

来自主题: Statistics版 - 有技巧得用R才能发挥它的威力

looks like ff package helps on solving the problem where the file is TOO
large to fit in memory, like the bigmemory package does, but it doesn't help
on efficiency here as it maps data into disk.
Am I missing anything here?
>
> library(ff)
>
> system.time(
+ dsnff<-read.csv.ffdf(file="c:\_data\MNISTtrain.csv")
+ )
user system elapsed
22.44 9.30 42.17
>
> system.time(
+ dsn1<-read.csv(file="c:\_data\MNISTtrain.csv")
+ )
user system elapsed
13.71 0.04 13.77
>
>
> t<-Sys.t... 阅读全帖

o****o
发帖数: 8077

来自主题: Statistics版 - 有技巧得用R才能发挥它的威力

scan的问题是不能读入不同属性的列，比如文件混合了字符串和数值变量，单单是数值
矩阵还行，不过也不比预置了colClasses=的读表格函数快多少，我的经验是大约5--10
%左右
现在就是用SAS把数据处理完了，如果需要用到SAS里面没有的算法再port到R里搞

q*******l
发帖数: 36

来自主题: Statistics版 - R 读入excel 问题

用这个试试：
require(xlsx)
read.xlsx {xlsx}
read.xlsx(file, sheetIndex, sheetName=NULL, rowIndex=NULL,
startRow=NULL, endRow=NULL, colIndex=NULL,
as.data.frame=TRUE, header=TRUE, colClasses=NA,
keepFormulas=FALSE, encoding="unknown", ...)

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

topics

未名新帖统计// 7月16日

历史上的今天