c***z 发帖数: 6348 | 1 Below are just some of my personal opinions, please don't take them
personally :)
1. Data science is a very broad term. If I dare to put down a definition,
the fundamental question for data science should be:
Are we really doing what we thinking we are doing?
In formal words, data science is the science of measuring inference from
data. Not only inference, but also the confidence of such inference.
Data scientists are most concerned about what we don't know (e.g. data
quality, panel bias, model validity, etc), and this is exactly why we are
called scientist.
An analogy is that software engineers are most concerned about what hasn't
happened yet (e.g. site reliability, scalability, etc).
2. My definition is closer to that of statistics, although statisticians
seldom need to worry about too much (dirty, unstructured, unlabeled) data.
Under this definition, many data scientist positions are actually for
analysts and engineers, because they only care about inference or
reliability, rather than confidence and validity.
Specifically, by the nature of input data:
Statisticians work on small volumes of clean data, likely with lots of
assumptions, likely from academic literature;
Data analysts work on small volumes of dirty data, not knowing how to clean
data and making assumptions mostly from business knowledge;
Data engineers work on large volumes of clean data, likely structured for
query and display;
Data scientists work on large volumes of dirty data, likely unstructured and
unlabeled.
3. The key questions a data scientist working in business settings should
ask:
Do we have well defined questions?
Do we have truthfully labeled data?
Do we have unbiased panel?
Features and models are secondary to questions and data. Specifically, the
first steps of research should be to ask the right questions and decide the
level and unit of analysis.
Essentially, a data scientist need skills from business, science and
engineering, which basically cover three functional roles:
A data architect,
A solution architect,
A software architect,
This is exactly why many data scientists are under unreasonable expectation
and enormous stress. | d****n 发帖数: 12461 | 2 版主厉害。
好吧,我能吐槽data science里面有一半时间是在data mangling吗? | c***z 发帖数: 6348 | 3 多谢前辈捧场
一半时间还好啦,我是80% :(
剩下20%是fit curve,挺没意思的
【在 d****n 的大作中提到】 : 版主厉害。 : 好吧,我能吐槽data science里面有一半时间是在data mangling吗?
|
|