

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
DataSciences版 - 数据清理和数据质量控制---大数据时代的挑战之三
predict的时候对于test data,要不要standardized?我先开个题
请教 ptedictive model deployment 的实际问题下周面A和L的data scientist and data engineer. 有没有面经?一般问些啥?
说个严肃的问题,现在是不是该跳槽了?谁敢自称data scientist?
Gartner 弄了个 Advanced Analytics Platform 的评估恭喜新版成立。什么背景的人会成为data scientist
学hadoop的,都是自己装一个吗?data science领域master工作机会和待遇如何
有同行愿意经常讨论讨论,互相学习的吗?[Data Science Project Case] Data Monitoring
提供ML作为service 的创业想法Bioinformatics AND Data Science
怎么处理categorical variable有很多个level的what is big data?
话题: data话题: quality话题: tools话题: discipline话题: processes
1 (共1页)
发帖数: 52
Or data cleansing, data quality control etc.
Gartner 去年底发表过一个Dara Quality Tools Magic Quadrant 的 report, 对相关
Vendor做了些总结。我不很了解这些Vendor 的选择是否靠谱,但他们对于数据质量控
要。请记住,"Garbage in, garbage out".
这个Report originally available from http://www.gartner.com/technology/reprints.do?id=1-1LCD5XL&ct=131007&st=sb,
But not any more. 我这里摘一点,同时附上他们现在付费网址,供大家参考,也帮他
发帖数: 52
Magic Quadrant for Data Quality Tools
gartner.comOctober 7
Data quality assurance is a discipline focused on ensuring that data is fit
for use in business processes ranging from core operations to analytics and
decision-making, regulatory compliance, and engagement and interaction with
external entities.
As a discipline, it comprises much more than technology — it also includes
roles and organizational structures, processes for monitoring, measuring,
reporting and remediating data quality issues, and links to broader
information governance activities via data-quality-specific policies.
Given the scale and complexity of the data landscape across organizations of
all sizes and in all industries, tools to help automate key elements of the
discipline continue to attract more interest and to grow in value. As such,
the data quality tools market continues to show substantial growth, while
exhibiting innovation and change.
The data quality tools market includes vendors that offer stand-alone
software products to address the core functional requirements of the
discipline, which are:
Data profiling and data quality measurement: The analysis of data to capture
statistics (metadata) that provide insight into the quality of data and
help to identify data quality issues.
Parsing and standardization: The decomposition of text fields into component
parts and the formatting of values into consistent layouts based on
industry standards, local standards (for example, postal authority standards
for address data), user-defined business rules, and knowledge bases of
values and patterns.
Generalized "cleansing": The modification of data values to meet domain
restrictions, integrity constraints or other business rules that define when
the quality of data is sufficient for an organization.
Matching: Identifying, linking or merging related entries within or across
sets of data.
Monitoring: Deploying controls to ensure that data continues to conform to
business rules that define data quality for the organization.
Enrichment: Enhancing the value of internally-held data by appending related
attributes from external sources (for example, consumer demographic
attributes and geographic descriptors).
In addition, data quality tools provide a range of related functional
abilities that are not unique to this market but that are required to
execute many of the core functions of data quality, or for specific data
quality applications:
Connectivity/adapters: The ability to interact with a range of different
data structure types.
Subject-area-specific support: Standardization capabilities for specific
data subject areas.
International support: The ability to offer relevant data quality operations
on a global basis (such as handling data in multiple languages and writing
Metadata management: The ability to capture, reconcile and interoperate
metadata related to the data quality process.
Configuration environment: Capabilities for creating, managing and deploying
data quality rules.
Operations and administration: Facilities for supporting, managing and
controlling data quality processes.
Workflow/data quality process support: Processes and user interfaces for
various data quality roles, such as data stewards.
Service enablement: Service-oriented characteristics and support for service
-oriented architecture (SOA) deployments.
The tools provided by vendors in this market are generally consumed by end-
user organizations for internal deployment in their IT infrastructure — to
directly support transactional processes that require data quality
operations and to enable staff in data-quality-oriented roles (such as data
stewards) to engage in data quality improvement work. Off-premises solutions
in the form of hosted data quality offerings, SaaS delivery models and
cloud services continue to evolve and grow in popularity.
Return to Top
For vendors to be included in the Magic Quadrant, they must meet the
following criteria:
They must offer stand-alone packaged software tools or cloud-based services
(not only embedded in, or dependent on, other products

【在 l******o 的大作中提到】
: Or data cleansing, data quality control etc.
: Gartner 去年底发表过一个Dara Quality Tools Magic Quadrant 的 report, 对相关
: Vendor做了些总结。我不很了解这些Vendor 的选择是否靠谱,但他们对于数据质量控
: 制的总结还很到位。在数据被大量收集的今天,强调数据清理和数据质量控制,尤为必
: 要。请记住,"Garbage in, garbage out".
: 这个Report originally available from http://www.gartner.com/technology/reprints.do?id=1-1LCD5XL&ct=131007&st=sb,
: But not any more. 我这里摘一点,同时附上他们现在付费网址,供大家参考,也帮他
: 们做下广告。

发帖数: 52
付费link: http://gtnr.it/1tdIeVw


【在 l******o 的大作中提到】
: Magic Quadrant for Data Quality Tools
: gartner.comOctober 7
: Data quality assurance is a discipline focused on ensuring that data is fit
: for use in business processes ranging from core operations to analytics and
: decision-making, regulatory compliance, and engagement and interaction with
: external entities.
: As a discipline, it comprises much more than technology — it also includes
: roles and organizational structures, processes for monitoring, measuring,
: reporting and remediating data quality issues, and links to broader
: information governance activities via data-quality-specific policies.

发帖数: 52
Or data cleansing, data quality control etc.
Gartner 去年底发表过一个Dara Quality Tools Magic Quadrant 的 report, 对相关
Vendor做了些总结。我不很了解这些Vendor 的选择是否靠谱,但他们对于数据质量控
要。请记住,"Garbage in, garbage out".
这个Report originally available from http://www.gartner.com/technology/reprints.do?id=1-1LCD5XL&ct=131007&st=sb,
But not any more. 我这里摘一点,同时附上他们现在付费网址,供大家参考,也帮他
发帖数: 52
Magic Quadrant for Data Quality Tools
gartner.comOctober 7
Data quality assurance is a discipline focused on ensuring that data is fit
for use in business processes ranging from core operations to analytics and
decision-making, regulatory compliance, and engagement and interaction with
external entities.
As a discipline, it comprises much more than technology — it also includes
roles and organizational structures, processes for monitoring, measuring,
reporting and remediating data quality issues, and links to broader
information governance activities via data-quality-specific policies.
Given the scale and complexity of the data landscape across organizations of
all sizes and in all industries, tools to help automate key elements of the
discipline continue to attract more interest and to grow in value. As such,
the data quality tools market continues to show substantial growth, while
exhibiting innovation and change.
The data quality tools market includes vendors that offer stand-alone
software products to address the core functional requirements of the
discipline, which are:
Data profiling and data quality measurement: The analysis of data to capture
statistics (metadata) that provide insight into the quality of data and
help to identify data quality issues.
Parsing and standardization: The decomposition of text fields into component
parts and the formatting of values into consistent layouts based on
industry standards, local standards (for example, postal authority standards
for address data), user-defined business rules, and knowledge bases of
values and patterns.
Generalized "cleansing": The modification of data values to meet domain
restrictions, integrity constraints or other business rules that define when
the quality of data is sufficient for an organization.
Matching: Identifying, linking or merging related entries within or across
sets of data.
Monitoring: Deploying controls to ensure that data continues to conform to
business rules that define data quality for the organization.
Enrichment: Enhancing the value of internally-held data by appending related
attributes from external sources (for example, consumer demographic
attributes and geographic descriptors).
In addition, data quality tools provide a range of related functional
abilities that are not unique to this market but that are required to
execute many of the core functions of data quality, or for specific data
quality applications:
Connectivity/adapters: The ability to interact with a range of different
data structure types.
Subject-area-specific support: Standardization capabilities for specific
data subject areas.
International support: The ability to offer relevant data quality operations
on a global basis (such as handling data in multiple languages and writing
Metadata management: The ability to capture, reconcile and interoperate
metadata related to the data quality process.
Configuration environment: Capabilities for creating, managing and deploying
data quality rules.
Operations and administration: Facilities for supporting, managing and
controlling data quality processes.
Workflow/data quality process support: Processes and user interfaces for
various data quality roles, such as data stewards.
Service enablement: Service-oriented characteristics and support for service
-oriented architecture (SOA) deployments.
The tools provided by vendors in this market are generally consumed by end-
user organizations for internal deployment in their IT infrastructure — to
directly support transactional processes that require data quality
operations and to enable staff in data-quality-oriented roles (such as data
stewards) to engage in data quality improvement work. Off-premises solutions
in the form of hosted data quality offerings, SaaS delivery models and
cloud services continue to evolve and grow in popularity.
Return to Top
For vendors to be included in the Magic Quadrant, they must meet the
following criteria:
They must offer stand-alone packaged software tools or cloud-based services
(not only embedded in, or dependent on, other products

【在 l******o 的大作中提到】
: Or data cleansing, data quality control etc.
: Gartner 去年底发表过一个Dara Quality Tools Magic Quadrant 的 report, 对相关
: Vendor做了些总结。我不很了解这些Vendor 的选择是否靠谱,但他们对于数据质量控
: 制的总结还很到位。在数据被大量收集的今天,强调数据清理和数据质量控制,尤为必
: 要。请记住,"Garbage in, garbage out".
: 这个Report originally available from http://www.gartner.com/technology/reprints.do?id=1-1LCD5XL&ct=131007&st=sb,
: But not any more. 我这里摘一点,同时附上他们现在付费网址,供大家参考,也帮他
: 们做下广告。

发帖数: 52
付费link: http://gtnr.it/1tdIeVw


【在 l******o 的大作中提到】
: Magic Quadrant for Data Quality Tools
: gartner.comOctober 7
: Data quality assurance is a discipline focused on ensuring that data is fit
: for use in business processes ranging from core operations to analytics and
: decision-making, regulatory compliance, and engagement and interaction with
: external entities.
: As a discipline, it comprises much more than technology — it also includes
: roles and organizational structures, processes for monitoring, measuring,
: reporting and remediating data quality issues, and links to broader
: information governance activities via data-quality-specific policies.

发帖数: 3652
data warehouse 里的第一步就是ETL extract transform, load 就是除了clean data.
1 (共1页)
what is big data?学hadoop的,都是自己装一个吗?
[转载] Data Scientists专业要求有同行愿意经常讨论讨论,互相学习的吗?
Data Scientist Subway Map提供ML作为service 的创业想法
我不会编程怎么处理categorical variable有很多个level的
predict的时候对于test data,要不要standardized?我先开个题
请教 ptedictive model deployment 的实际问题下周面A和L的data scientist and data engineer. 有没有面经?一般问些啥?
说个严肃的问题,现在是不是该跳槽了?谁敢自称data scientist?
Gartner 弄了个 Advanced Analytics Platform 的评估恭喜新版成立。什么背景的人会成为data scientist
话题: data话题: quality话题: tools话题: discipline话题: processes