Data Mining

 

What is Data Mining?

Data Mining is the process of discovering previously unknown information from the data in data warehouses.  Uncovering new correlations, patterns,, and trends can be accomplished by applying techniques drawn from artificial intelligence (AI) research, statistics and mathematics, and modeling techniques to analyze large amounts of data stored in the data warehouse.

Powerful hardware (SMP symmetric multiprocessing servers and MPP massively parallel processors) coupled with advances in AI and neural network technology are making data mining a valuable extension to many data warehousing initiatives.

However, data mining still is not widely used because of the specialized nature of the tools and the degree of sophistication and knowledge required of the data mining user.  Data mining tools use sophisticated, automated algorithms.   Some tools offer one or two data mining algorithms, and others offer a full suite of data mining capabilities.  Data mining algorithms provide a mechanism to study relationships between data sets.  Associative methods attempt to form rules that describe outcomes based o input data.  Classical statistical methods such as regression models also are used widely.

Currently, four general classes of data mining algorithms are used:

Neural networks-Programs that process information in a manner patterned after the complex interconnections among nerve cells in the brain.   Provides speeds unmatched matched by traditional computers for sorting through large databases to find close data matches.

Decision tree analysis-Models a data set by splitting it into smaller and smaller sets, where each split is represented by an explicit rule describing the differences between the subsets.  New data can be characterized by comparing it to the decision s at each level in the tree.  The explicit nature of each decision makes the operations particularly easy to understand.

Clustering analysis-A partitioning technique that identifies clusters or groups of closely related data.  Members of the same cluster are guaranteed to be similar to each other and different from other groups.  The variances between clusters can identify important distinctions hidden in the data.

Association rules analysis-A model that relates inputs to outcomes, using a single level of decision rules.  Unlike decision trees, single rules here are used to related important combinations of input features to categories of outcome.

Some Data Mining Tool Vendors

Company Product(s)
Angoss KnowledgeStudio allows users to mine large databases using five decision tree algorithms and three neural net algorithms.  KnowledgeStudio can be embedded in other applications.
Business Objects BusinessMiner discovers trends hidden in data, then displays them for analysis in the form of a decision tree.  BusinessMiner provides modeling, discovery, visualization, what-if, and segmentation functions.
DataMind DataCruncher is a data mining tool that can access Oracle and Informix databases.  It includes data mining assistants that automate data mining processes such as the selection of classifications or clustering of data mining techniques, selection of data sources, goal identifications, data preparation, model discovery, model evaluation,evaluation view, and case prediction.
IBM Intelligent Miner for Data enables the mining of data stored in relational databases and flat files.  It can be used to discover associations or patterns, to segment (or cluster) records based on similarity of attributes, to discover similar time sequences, or to create predictive (or classification) models.
Integral Solutions The Clementine data mining toolkit allows users to extract selected data, manipulate it, and then visualize trends and relationships.  It uses machine learning techniques such as neural networks and rule induction to identify relationships automatically in data and generate rules to apply to future cases.
Magnify Magnify's Pattern data mining system integrates a data warehouse, toolkits for data mining and predictive modeling, and industry-specific vertical applications to support the entire data mining process from cleansing and loading the data to decision-making.
Pilot Software Pilot Discovery Server uses customer metrics such as profitability, lifetime value, or new product return on investment to create a focused market segmentation and analysis of customer behavior.  Working with and residing in a relational data warehouse, Pilot DiscoveryServer issues standard SQL queries for decisiontree analysis.
SAS Enterprise Miner contains analysis tools to create and compare multiple models.  Statistical tools include clustering, decision trees, linear and logistic regression , an neural networks.  Data preparation tools include outlier detection, variable transformations, random sampling, and tools to partition data sets into train, test, and validate data sets.  Advanced visualization tools enable users to examine large amounts of data in multidimensional histograms and to compare modeling results graphically.
Silicon Graphics Mineset is a visual mining tool that provides graphical representations in data exploration, mining, and validation processes.  It includes association, decision tree, and Bayesian association algorithms into its toolset.   The association algorithms are used for market-basket analysis, and the decision-tree options have been extended to show all possible tree configurations.   The Bayesian algorithm is used with an Evidence visualizer to show evidence for or against specific outcomes.
SPSS AnswerTree enables users to discover segments, build customer profiles, and predict outcomes.  It create4s diagrams that read like a flowchart so users can identify critical segments and relationships.  AnswerTree offers powerful statistical analysis algorithms to handle the segments.
Thinking Machines Darwin lets users conduct data mining and integrate data mining results into custom applications.  Rather than creating statistical printouts of the results into custom applications.  Rather than creating statistical printouts of the results, Darwin offers online visualization.  Features include the ability to display n-dimensional graphics, dynamically rotate graphs so they can be viewed from many different perspectives, and show related 3-D graphs on screens.

Links to Data Mining