c***z 发帖数: 6348 | 1 怎么获取ftp文件的修改日期?
比如说这个 ftp://ftp.bls.gov/pub/special.requests/cew/2011/
页面上有修改日期,但是 file.info() returned nothing
我找到了scrap http page的办法,但是不能用于ftp page
请问哪位大侠有经验么?谢谢! | c***z 发帖数: 6348 | 2 下面是我的code and result
> sourcefile
[1] "ftp://ftp.bls.gov/pub/special.requests/cew/2011/2011.q1-q3.county_high_
level.zip"
> file.info(sourcefile)
size
ftp://ftp.bls.gov/pub/special.requests/cew/2011/2011.q1-q3.county_high_level
.zip NA
isdir
ftp://ftp.bls.gov/pub/special.requests/cew/2011/2011.q1-q3.county_high_level
.zip NA
mode
ftp://ftp.bls.gov/pub/special.requests/cew/2011/2011.q1-q3.county_high_level
.zip
mtime
ftp://ftp.bls.gov/pub/special.requests/cew/2011/2011.q1-q3.county_high_level
.zip
ctime
ftp://ftp.bls.gov/pub/special.requests/cew/2011/2011.q1-q3.county_high_level
.zip
atime
ftp://ftp.bls.gov/pub/special.requests/cew/2011/2011.q1-q3.county_high_level
.zip
uid
ftp://ftp.bls.gov/pub/special.requests/cew/2011/2011.q1-q3.county_high_level
.zip NA
gid
ftp://ftp.bls.gov/pub/special.requests/cew/2011/2011.q1-q3.county_high_level
.zip NA
uname
ftp://ftp.bls.gov/pub/special.requests/cew/2011/2011.q1-q3.county_high_level
.zip
grname
ftp://ftp.bls.gov/pub/special.requests/cew/2011/2011.q1-q3.county_high_level
.zip
> | c***z 发帖数: 6348 | 3 我发现其实 getURL()就能得到 修改时间
哪位大侠能指点一下怎么parse出 文件名和修改时间呢?
> files <- getURL(sourcelink, ftp.use.epsv = FALSE, dirlistonly = FALSE)
> files <- strsplit(files, "\n")
> files <- unlist(files)
> files
[1] "-r-xr-xr-x 1 owner group 15780895 Mar 26 16:16 2011.q1-q3.
county_high_level.zip"
[2] "-r-xr-xr-x 1 owner group 128178060 Mar 26 17:02 2011.q1-q3.
end.zip"
[3] "dr-xr-xr-x 1 owner group 0 Mar 28 10:05 county"
[4] "dr-xr-xr-x 1 owner group 0 Mar 28 10:05 county_high
_level"
[5] "dr-xr-xr-x 1 owner group 0 Mar 28 10:05 csa"
[6] "dr-xr-xr-x 1 owner group 0 Mar 28 10:05 microsa"
[7] "dr-xr-xr-x 1 owner group 0 Mar 28 10:05 msa"
[8] "dr-xr-xr-x 1 owner group 0 Mar 28 10:05 national"
[9] "dr-xr-xr-x 1 owner group 0 Mar 28 10:05 size"
[10] "dr-xr-xr-x 1 owner group 0 Mar 28 10:05 state"
> files <- getURL(sourcelink, ftp.use.epsv = FALSE, dirlistonly = FALSE)
> files <- strsplit(files, "\n")
> files <- unlist(files)
> files
[1] "-r-xr-xr-x 1 owner group 15780895 Mar 26 16:16 2011.q1-q3.
county_high_level.zip"
[2] "-r-xr-xr-x 1 owner group 128178060 Mar 26 17:02 2011.q1-q3.
end.zip"
[3] "dr-xr-xr-x 1 owner group 0 Mar 28 10:05 county"
[4] "dr-xr-xr-x 1 owner group 0 Mar 28 10:05 county_high
_level"
[5] "dr-xr-xr-x 1 owner group 0 Mar 28 10:05 csa"
[6] "dr-xr-xr-x 1 owner group 0 Mar 28 10:05 microsa"
[7] "dr-xr-xr-x 1 owner group 0 Mar 28 10:05 msa"
[8] "dr-xr-xr-x 1 owner group 0 Mar 28 10:05 national"
[9] "dr-xr-xr-x 1 owner group 0 Mar 28 10:05 size"
[10] "dr-xr-xr-x 1 owner group 0 Mar 28 10:05 state"
> | c***z 发帖数: 6348 | 4 我有一个naive的想法是strsplit again
有更好的办法么?谢谢! | c*********t 发帖数: 340 | 5 想不出更好的办法,对rcurl不是很熟
但是有个笨办法供lz参考
既然是fixed length就找出想要的column的位置:)
> grep("M",unlist(strsplit(files[1],"")))
47
> substr(files,47,47+11)
[1] "Mar 26 16:16" "Mar 26 17:02" "Mar 28 10:05" "Mar 28 10:05" "Mar 28 10:
05" "Mar 28 10:05" "Mar 28 10:05" "Mar 28 10:05" "Mar 28 10:05"
[10] "Mar 28 10:05" | c***z 发帖数: 6348 | 6 substr不行,因为每个files元素的长度不一样
有20,21,28这3种
再抱怨一下R的date class还真是要命啊 | c***z 发帖数: 6348 | 7 我是这么做的,可以获取ftp时间。
但是因为递归,还是不能运行良好。
# list of contents
filestubs <- getURL(sourcelink, ftp.use.epsv = FALSE, dirlistonly = FALSE)
filestubs <- strsplit(filestubs, "\n")
filestubs <- unlist(filestubs)
files <- as.data.frame(filestubs)
# obtain names and modify time
for (i in 1:length(filestubs)) {
# i <- 1
temp <- strsplit(filestubs[i], " ")
temp <- unlist(temp)
temp.name <- temp[length(temp)]
files$name[i] <- temp.name
temp.date <- paste(temp[length(temp)-3], temp[length(temp)-2], mod.year,
sep = ' ')
temp.date <- strptime(temp.date, "%b %d %Y")
files$ftp.date[i] <- as.Date(temp.date)
temp.size <- temp[length(temp)-4]
files$size[i] <- temp.size
}
files$link <- paste(sourcelink, files$name, sep = '') |
|