funded by   
Class Miner
The Miner class is the central part of the Mining library. After receiving
its input data, the miner calculates many data from them: counting of events,
counting of combination of events, the dependence of events from each other
and their reliabilty, etc. Information on how these calculations are done
internally are provided in the section "Internal calculations".
Overview
| Miner |
| New() |
| New(ByRef MiningInput:
MiningInputData) |
|
| AttributableRisk():
Single(,) |
| BinningThreshold():
Double |
| CaseCountPerEvent(): Integer() |
| CasesEvents():
Boolean(,) {read-only} |
| CaseNames():
String() {read-only} |
| ChiSquare():
Single(,) {read-only} |
| Confidence():
Single(,) {read-only} |
| DependenceThreshold():
Double |
| EventCountPerCase():
Integer() {read-only} |
| EventNames():
String() {read-only} |
| FourFoldCorrelation(): Single(,)
{read-only} |
| FrequencyDecrease():
Double |
| MaximumEventCount():
Integer {read-only} |
| MaximumDepth():
Integer |
| MinimumEventCount():
Integer |
| MinimumFirstEventFrequency():
Double |
| NumberOfCases():
Integer {read-only} |
| NumberOfCasesOriginal():
Integer {read-only} |
| NumberOfEvents():
Integer {read-only} |
| OddsRatio():
Single(,) {read-only} |
| Pearson(): Single()
{read-only} |
| RelativeRisk():
Single() {read-only} |
| Support(): Int16(,)
{read-only} |
|
| calculateEvolutionTree():
void |
| calculateComplexityDistribution():
void |
| getComplexityDistribution():
Int16(,) |
| getComplexityDistributionExpected():
Single(,) |
| getComplexityDistributionAsMap(EventIndex:
Integer, Optional EventMaximum: Integer = 30, Optional Width: Integer =
450, Optional Height: Integer = 150): ImageMapData |
| getComplexityDistributionDifferenceToExpectedAsMap(EventIndex:
Integer, Optional EventMaximum: Integer = 30, Optional Width: Integer =
450, Optional Height: Integer = 150): ImageMapData |
| getEventsOrderedByFrequency(ByRef
Numbers: Integer(), ByRef Names: String(), Optional MinimumEventCount:
Integer = 0, Optional TopN: Integer = 0): void |
| getNodes(): TreeNode() |
| Mine(): void |
Programming Language
-
Microsoft Visual Basic .Net.
Availability
Constructors
Public Sub New()
Instantiates a new Miner object.
This constructor is not intended for general use since the input
data cannot be set later.
Public Sub New(ByRef MiningInput As MiningInputData)
Instantiates a new Miner object with the given MiningInputData.
Parameters
-
MiningInputData: the data on which mining is to be performed.
Remarks
-
The Miner gets its data from the MiningInputData object. It does not automatically
start mining, but allows for several properties to be set before.
Interfaces
This class does not implement an interface.
Enumerations
The class does not provide enumerations.
Properties
Public ReadOnly Property AttributableRisk()
As Single(,)
Returns a matrix with the Attributable Risk values for pairs of events.
The Attributable Risk compares the occurence of event2 under the pre-condition
of event1 to its occurence under the pre-condition of NOT-event1.
From the fourfold table for each pair of events (event1;event2)
|
event1 - |
event1 + |
| event2 - |
a |
b |
| event2 + |
c |
d |
it is calculated in two steps: nTmp = b / (a + b) * (a * d / b / c -
1), ar = nTmp / (1 + nTmp).
The Attributable Risk matrix is typically non-symmetric, i.e. AR(event1;event2)
<> AR(event2;event1).
Property Value
-
a matrix with the Attributable Risk values for pairs of events.
Remarks
-
Before this property can be retrieved, the respective calculations (e.g.
Mine)
have to be performed. Since this is a calculated value, this property is
read-only.
-
The index is 1-based for each dimension and corresponds to the EventNames
property.
-
The objectives of Attibutable Risk and Relative
Risk are similar, but epidemiologists prefer the Attibutable Risk.
-
This definition and way of calculation of the Attributable Risk was taken
from the PhD thesis "Buchan: The Development of a Statistical Computer
Software Resource for Medical Research" (p. 54; no more available in the
internet as a pdf-file, but you can browse through its chapters at the
help pages of "StatsDirect".
-
For more information on the calculation process, see "Significant
combinations" in the "Internal Calculations"
section.
Public Property BinningThreshold() As Double
Gets or sets the binning threshold.
"Binning" means that two or more events are merged into one pseudo-event
representing the set union of those events. Binning is only useful for
events which almost always occur together. This threshold determines the
value which must be reached for the probality for the occurence of event
A with the pre-condition of the occurence of event B and for the probality
for the occurence of event B with the pre-condition of the occurence of
event A.
Property Value
Remarks
-
Binning is performed for the calculation of the Evolution
Tree and for the Dependence Network.
-
Default Value for the threshold is 1 (i.e. both events must always occur
together).
-
A value greater than 1 will prevent binning.
Public ReadOnly Property CaseCountPerEvent()
As Integer()
Returns an array showing the number of cases the respective event was encountered
in.
Property Value
-
an array showing the number of cases the respective event was encountered
in.
Remarks
-
Before this property can be retrieved, the respective calculations (Mine
or calculateEvolutionTree) have to
be performed. Since this is a calculated value, this property is read-only.
-
The array index is 1-based, and the same with the EventNames()
property.
Public ReadOnly Property CasesEvents() As Boolean(,)
Returns a two-dimensional matrix indicating if a given event has occured
in a given case.
Property Value
-
a two-dimensional matrix indicating if a given event has occured in a given
case.
Remarks
-
The first dimension is the case. Its index is the same with the CaseNames()
property.
-
The second dimension is the event. Its index is the same with the
EventNames()
property.
-
Both indices are 1-based.
-
This property is set with the constructor of the Miner
object.
Public ReadOnly Property CaseNames() As String()
Returns an array with the case names.
Property Value
-
an array with the case names.
Remarks
-
The index is 1-based.
-
This property is set with the constructor of the Miner
object.
Public ReadOnly Property ChiSquare() As Single(,)
Returns a matrix with chi-square values for pairs of events. Each pair
of events is linked to a fourfold table showing how often the events occured
alone or in combination or none of them. A high chi-square value means
that the events are likely (positively or negatively) dependent on each
other.
From the fourfold table for each pair of events (event1;event2)
|
event1 - |
event1 + |
| event2 - |
a |
b |
| event2 + |
c |
d |
it is calculated as: (a * d - b * c) * (a * d - b * c) * (a + b +c +d)
/ (a + b) / (c + d) / (b + d) / (a + c). The chi-square matrix is symmetric,
i.e. ChiSquare(event1;event2) = ChiSquare(event2;event1).
Property Value
-
a matrix with chi-square values for pairs of events.
Remarks
-
Before this property can be retrieved, the respective calculations (calculateEvolutionTree
or Mine) have to be performed. Since this is a calculated
value, this property is read-only.
-
The index is 1-based for each dimension and corresponds to the EventNames
property.
-
In contrast to the standard definition of chi-square, the value here is
signed in order to indicate the direction of non-equality (positive sign
for positive correlation, negative sign for negative correlation).
-
For more information on the calculation process, see "Significant
combinations" in the "Internal Calculations"
section.
Public ReadOnly Property Confidence() As Single(,)
Returns a matrix with confidence values for pairs of events. Each pair
of events is linked to a fourfold table showing how often the events occured
alone or in combination or none of them. The confidence expresses the proportion
of the occurence of the combination of events relative to all occurences
of the second event. A high confidence value means that the second events
is highly dependent on the first event.
Property Value
-
a matrix with confidence values for pairs of events.
Remarks
Public Property DependenceThreshold()
As Double
Gets or sets the dependence threshold used for classifying nodes in the
evolution tree as dependent sub-nodes or as parallel (same-level) nodes.
Property Value
-
the dependence threshold.
Remarks
-
This property is only used for the evolution tree.
-
During the calculation of an Evolution
Tree, events have to be classified as events on the same level or as
dependent events. If the (signed value) of the ChiSquare
property is greater than this threshold, the new event is regarded as dependent
on the previous event and will go into the next generation of the evolution
tree.
-
For more information on the calculation of the evolution tree, see "Evolution
Tree" in the "Internal Calculations" section.
Public ReadOnly Property EventCountPerCase()
As Integer()
Returns an array showing the number of events for each case.
Property Value
-
an array showing the number of events for each case.
Remarks
-
Before this property can be retrieved, the respective calculations (e.g.
Mine)
have to be performed. Since this is a calculated value, this property is
read-only.
-
The index is 1-based and corresponds to the CaseNames
property.
Public ReadOnly Property EventNames() As String()
Returns an array with the event names.
Property Value
-
an array with the event names.
Remarks
-
The value of this property is set with the constructor
of the Miner object.
-
The index is 1-based.
Public ReadOnly Property FourFoldCorrelation()
As Single(,)
Returns a matrix showing the fourfold table correlation values for pairs
of events. This value is also referred to as "Phi" in the statistics literature.
From the fourfold table for each pair of events (event1;event2)
|
event1 - |
event1 + |
| event2 - |
a |
b |
| event2 + |
c |
d |
it is calculated as: (a * d - b * c) / SQRT((a + b) * (c + d) * (b +
d) * (a + c)). This coefficient matrix is symmetric, i.e. FF(event1;event2)
= FF(event2;event1)
Property Value
-
a matrix showing the fourfold table correlation values for pairs of events.
Remarks
-
Before this property can be retrieved, the respective calculations (e.g.
Mine)
have to be performed. Since this is a calculated value, this property is
read-only.
-
The index is 1-based for each dimension and corresponds to the EventNames
property.
-
Sometimes, this correlation coefficient is also called Pearson's correlation
coefficent r; see also "Pearson's
correlation" at "HyperStat".
-
For more information on the calculation process, see "Significant
combinations" in the "Internal Calculations"
section.
Public Property FrequencyDecrease() As
Double
Gets or sets the minimum frequency for an event in a later generation in
the evolution tree. (proportion towards the parent number of occurences)
Property Value
-
the minimum frequency for an event in a later generation in the evolution
tree.
Remarks
-
The minimum frequency for the occurence of an event in a later generation
of the evolution tree is set independent from the minimum frequency of
the start generation (=MinimumFirstEventFrequency).
-
It is measured in relation its individual parent node.
-
For more information on the calculation of the evolution tree, see "Evolution
Tree" in the "Internal Calculations" section.
Public Property MaximumDepth() As Integer
Sets or gets the maximum number of generations displayed in the evolution
tree.
Property Value
-
the maximum number of generations displayed in the evolution tree.
Remarks
-
The calculation of the evolution tree may stop earlier due to other parameters,
e.g. MinimumEventCount.
-
Increasing values of MaximumDepth may cause an exponential increase in
calculation time.
-
For more information on the calculation of the evolution tree, see "Evolution
Tree" in the "Internal Calculations" section.
Public ReadOnly Property MaximumEventCount()
As Integer
Returns the maximum number of events encountered in a case.
Property Value
-
the maximum number of events encountered in a case.
Remarks
Public Property MinimumEventCount() As
Integer
Gets or sets the minimum number of occurences of an event must have in
order to be included in data mining.
Property Value
-
the minimum number of occurences for an event to be included in data mining.
Remarks
-
Early during data mining, events which occured at low frequency are sorted
out. More information on this process can be obtained from the "General"
chapter of the "Internal Calculations" section.
-
If MinimumEventCount is set to a value greater than the product of MinimumFirstEventFrequency
times number of cases, MinimumFirstEventFrequency
is adjusted.
Public Property MinimumFirstEventFrequency()
As Double
Gets or sets the minimum frequency for an event at the base of an evolution
tree.
Property Value
-
the minimum frequency for an event at the base of an evolution tree.
Remarks
Public ReadOnly Property NumberOfCases() As
Integer
Returns the number of cases which still exist in the Miner.
Property Value
-
the number of cases which still exist in the Miner.
Remarks
-
Early during data mining, events which occured at low frequency are sorted
out. Then, also cases which do not have events are sorted out thus reducing
the number of cases relative to the original
number of cases. More information on this process can be obtained from
the "General" chapter of the
"Internal Calculations" section.
Public ReadOnly Property NumberOfCasesOriginal()
As Integer
Returns the number of cases originally found in the Miner.
Property Value
-
the number of cases originally found in the Miner.
Remarks
-
Early during data mining, events which occured at low frequency are sorted
out. Then, also cases which do not have events are sorted out thus reducing
the actual number of cases relative to the
original number of cases. More information on this process can be obtained
from the "General" chapter of
the "Internal Calculations" section.
Public ReadOnly Property NumberOfEvents()
As Integer
Returns the number of events which still exist in the Miner.
Property Value
-
the number of events which still exist in the Miner.
Remarks
-
Early during data mining, events which occured at low frequency are sorted
out thus reducing the actual number of events relative to the original
number of events. More information on this process can be obtained from
the "General" chapter of the
"Internal Calculations" section.
Public ReadOnly Property OddsRatio() As Single(,)
Returns a matrix with the odds ratio for pairs of events. The odds ratio
compares the occurence of both events or none of the events on one side
with the occurence of one event only (without the other event) on the other
side.
From the fourfold table for each pair of events (event1;event2)
|
event1 - |
event1 + |
| event2 - |
a |
b |
| event2 + |
c |
d |
it is calculated as: (d / b) / (c / a) . The Odds Ratio matrix is symmetric,
i.e. OR(event1;event2) = OR(event2;event1).
Property Value
-
a matrix with the odds ratio for pairs of events.
Remarks
Public ReadOnly Property Pearson() As Single(,)
Returns a matrix with the Pearson coefficients for pairs of events.
From the fourfold table for each pair of events (event1;event2)
|
event1 - |
event1 + |
| event2 - |
a |
b |
| event2 + |
c |
d |
it is calculated as: square_root(ChiSquare / (ChiSquare + number_of_cases)),
with
ChiSquare = (a * d - b * c) * (a * d - b *
c) * (a + b +c +d) / (a + b) / (c + d) / (b + d) / (a + c). The Pearson
coefficient matrix is symmetric, i.e. Pearson(event1;event2) = Pearson(event2;event1).
Property Value
-
a matrix with the Pearson coefficients for pairs of events.
Remarks
-
Before this property can be retrieved, the respective calculations (e.g.
Mine)
have to be performed. Since this is a calculated value, this property is
read-only.
-
The index is 1-based for each dimension and corresponds to the EventNames
property.
-
For more information on the calculation process, see "Significant
combinations" in the "Internal Calculations"
section.
-
Note that different definitions of the coefficient mean different ways
of calculating them. Often, the term Pearson coefficient is used to describe
the "fourfold correlation coefficient".
-
This definition of the Pearson's coefficients of contingency was taken
from the PhD thesis "Buchan: The Development of a Statistical Computer
Software Resource for Medical Research" (p. 47; no more available in the
internet as a pdf-file, but you can browse through its chapters at the
help pages of "StatsDirect".
Public ReadOnly Property RelativeRisk() As Single(,)
Returns a matrix with the Relative Risk values for pairs of events. The
Relative Risk compares the occurence of event2 under the pre-condition
of event1 to its occurence under the pre-condition of NOT-event1.
From the fourfold table for each pair of events (event1;event2)
|
event1 - |
event1 + |
| event2 - |
a |
b |
| event2 + |
c |
d |
it is calculated as: (d / (b + d)) / (c / (a + c)). The Relative Risk
matrix is typically non-symmetric, i.e. RR(event1;event2) <> RR(event2;event1).
Property Value
-
a matrix with the Relative Risk values for pairs of events.
Remarks
-
Before this property can be retrieved, the respective calculations (e.g.
Mine)
have to be performed. Since this is a calculated value, this property is
read-only.
-
The index is 1-based for each dimension and corresponds to the EventNames
property.
-
The objectives of Attibutable Risk and
Relative Risk are similar, but epidemiologists prefer the Attibutable
Risk.
-
For more information on the calculation process, see "Significant
combinations" in the "Internal Calculations"
section.
Public ReadOnly Property Support() As Int16(,)
Returns a matrix with the Support values for pairs of events. For each
pair (event1; event2) the number of cases containing such a combination
is counted. The Support matrix is symmetric, i.e. Support(event1;event2)
= Support(event2;event1).
Property Value
-
a matrix with the Support values for pairs of events.
Remarks
-
Before this property can be retrieved, the respective calculations (e.g.
Mine)
have to be performed. Since this is a calculated value, this property is
read-only.
-
The index is 1-based for each dimension and corresponds to the EventNames
property.
-
For more information on the calculation process, see the "General"
chapter in the "Internal Calculations" section.
Methods
Public Sub calculateEvolutionTree()
Initiates the calculation of an evolution tree.
Return Value
Parameters
Remarks
Public Sub calculateComplexityDistribution()
Calculates the distribution of events relative to the total number of events
per case.
Return Value
Parameters
Remarks
Public Function getComplexityDistribution()
As Int16(,)
Returns a matrix showing how often a given event occured in cases with
a given number of distinct events.
Return Value
-
a matrix showing how often a given event occured in cases with a given
number of distinct events.
Parameter
Remarks
-
The first dimension stands for the event. Its index is identical with EventNames.
-
The second dimension stands for the number of events per case.
-
The total number of cases with a given number of events nEv is stored at
the index (0; nEv).
-
Before this function can be called, the respective calculations (calculateComplexityDistribution)
have to be done first.
-
For more information on the calculation of the Complexity Distribution,
see "Complexity Distribution"
in the "Internal Calculations" section.
Public Function getComplexityDistributionExpected()
As Single(,)
Returns a matrix showing how often a given event was expected to occur
in cases with a given number of distinct events estimated from the total
number of case with the given event and the total number of cases with
the given number of distinct events.
Return Value
-
a matrix showing how often a given event was expected to occur in cases
with a given number of distinct events.
Parameter
Remarks
-
The first dimension stands for the event. Its index is identical with EventNames.
-
The second dimension stands for the number of events per case.
-
The base probability for a given events nEv is stored at the index (nEv;0).
-
Before this function can be called, the respective calculations (calculateComplexityDistribution)
have to be done first.
-
For more information on the calculation of the Complexity Distribution,
see "Complexity Distribution"
in the "Internal Calculations" section.
Public Function getComplexityDistributionAsMap(ByVal
EventIndex As Integer, Optional ByVal EventMaximum As Integer = 30, Optional
ByVal Width As Integer = 450, Optional ByVal Height As Integer = 150) As
ImageMapData
Returns an ImageMap showing
the Complexity Distribution.
Return Value
-
an ImageMap showing the Complexity Distribution.
Parameter
-
EventIndex: The index of the event. It is the same with the index
of the EventNames property.
-
EventMaximum: The maximum number of events per case to be shown.
-
Width: The width of the image map in pixel. By default 450.
-
Height: The height of the image map in pixel. By default 150.
Remarks
-
The x-axis shows the number of events per case, the y-axis the number of
cases with the respective number of events.
-
Cases with the selected event are shown in red color, all cases in blue
color.
-
Before this function can be called, the respective calculations (calculateComplexityDistribution)
have to be done first.
-
For more information on the calculation of the Complexity Distribution,
see "Complexity Distribution"
in the "Internal Calculations" section.
Public
Function getComplexityDistributionDifferenceToExpectedAsMap(ByVal EventIndex
As Integer, Optional ByVal EventMaximum As Integer = 30, Optional ByVal
Width As Integer = 450, Optional ByVal Height As Integer = 150) As ImageMapData
Returns an ImageMap showing
the difference in occurrence between expected and observed numbers of the
Complexity
Distribution.
Return Value
-
an ImageMap showing the difference in occurrence between expected and observed
numbers of the Complexity Distribution.
Parameter
-
EventIndex: The index of the event. It is the same with the index
of the EventNames property.
-
EventMaximum: The maximum number of events per case to be shown.
-
Width: The width of the image map in pixel. By default 450.
-
Height: The height of the image map in pixel. By default 150.
Remarks
-
The x-axis shows the number of events per case, the y-axis the difference
in the number of cases between the observed and the expected occurence.
-
A negative difference is shown in red color, a positive difference in blue
color.
-
Before this function can be called, the respective calculations (calculateComplexityDistribution)
have to be done first.
-
For more information on the calculation of the Complexity Distribution,
see "Complexity Distribution"
in the "Internal Calculations" section.
Public Sub getEventsOrderedByFrequency(ByRef
Numbers As Integer(), ByRef Names As String(), Optional ByVal MinimumEventCount
As Integer = 0, Optional ByVal TopN As Integer = 0)
Sorts the events by descending order of frequency. The event indices are
returned in the Numbers array, their names in the Names
array.
Return Value
-
None. But relevant arrays are passed to the function by reference
and changed during the analysis.
Parameter
-
Numbers: an array containing the indices of the selected and ordered
events.
-
Names: an array containing the names of the selected and ordered
events.
-
MinimumEventCount: indicates the minimum number of cases an event
has to be found in. By default 0.
-
TopN: indicates how many events have to be selected; 0 means that
all events (fulfilling MinimumEventCount) have to be selected.
Remarks
-
The Numbers and Names arrays will be filled with the
indices and names of the events in descending order.
-
Their indices are 1-based.
-
The search for the next most frequent event stops when TopN events
have been found or when no event occuring at least MinimumEventCount
times was encountered.
Public Function getNodes() As TreeNode()
Returns the nodes of the evolution tree.
Return Value
-
the nodes of the evolution tree.
Parameter
Remarks
Public Sub Mine()
Starts the data mining process.
Return Value
Parameter
Remarks
-
This function performs the first (general) steps of data mining: counting
events per case, cases per event, removing events which occured to rarely,
removing cases which do not have events, calculating support and confidence,
binning, and calculating data on pairs of events.
-
Parameters relevant for this general part of data mining are BinningThreshold,
and
MinimumEventCount. They are set via
their respective properties.
-
For more information on the Mining process, see the "General"
and "Significant combinations" chapters
in the "Internal Calculations" section.
Interaction with other classes
Classes using MiningInputData
The MiningInputData class is the primary entrance point into the Mining
dll. It is passed as a parameter into the constructor
of the Miner object.
A MiningInputData object should be the output of some appropriate functions
of the application into which the Mining class is embedded.
Classes used by MiningInputData
The MiningInputData class uses quite basic data types only to ensure a
wide usability.
Interal Calculations
General
Data mining relies on flat data structures. The principal data matrix is
two-dimensional: one dimension lists all cases (case identifiers), the
other dimension lists all possible events (event identifiers). For each
point (case;event) in the matrix, a boolean value indicates if the given
event was encountered in given case. This structure has to be supplied
from outside to the mining application.
The first steps performed in the Miner objects are counting: the Miner
determines how many events can be found per case (private function countEventsPerCase),
and how many cases an event can be found in (private function countCasesPerEvent).
These general calculations are already enough for the distribution of
events towards events per case.
Events which did not occur are deemed useless and thus removed. Also
events which happened only very rarely may be deemed useless and removed.
The private function used is removeZeroEvents; it uses the MinimumEventCount
property to determine if an event is to be removed or not; set the property's
value to 1 in order to prevent the removal of events.
After the removal of rare events, cases without events are removed
(private function removeZeroCases which is invoked by removeZeroEvents).
Now, combinations of events are counted: for each pair (event1; event2)
the number of cases containing such a combination is determined. This is
called the Support. This matrix is symmetrical.
It is calculated with the private function calculateSupport.
Next, the Support for a pair of events (event1; event2) is set into
relation to the number of cases containing event2. This is called the Confidence.
It can also be interpreted as the probality for the occurence of event1
with the pre-condition of the occurence of event2. The matrix is typically
non-symmetrical It is calculated with the private function calculateConfidence.
Binning is the next step. "Binning" means that two or more events are
merged into one pseudo-event representing the set union of those events.
Binning is only useful for events which almost always occur together. The
BinningThreshold
property sets the minimum value of confidence which must be fulfilled in
both directions for a pair of events (event1; event2).
Binning is performed with the private function removeReciprocallyDependent.
It replaces events for the binned pseudo-events and calculates their presence
in the cases (set union).
If binning was successful, events per case, cases per event, support
and confidence are calculated again.
Significant combinations
Now we can ask if a combination of two events is somehow special: if it
occurs more often than one would expect or less often, and if we can rely
on this difference.
As an example, let us assume that we have 100 cases. Event1 occured
60 times, event2 45 times, and the combination of event1 and event2 was
found 32 times. At the moment, we do not look at event3, event4 and so
on. This leads to the following four field matrix (also called "fourfold
table" or "2 x 2 contingency table"):
|
event1 |
NOT event1 |
sum |
| event2 |
32 |
|
45 |
| NOT event2 |
|
|
|
| sum |
60 |
|
100 |
The remaining fields can easily be filled:
|
event1 |
NOT event1 |
sum |
| event2 |
32 |
13 |
45 |
| NOT event2 |
28 |
27 |
55 |
| sum |
60 |
40 |
100 |
The probalities for each event or combination of events is obtained
by dividing the numbers in the above table by the number of all events.
Two events are called independent from each other, if the probability
of the combination is the product of the probalities of the events, i.e.
p(e1;e2)=p(e1)*p(e2).
In our example, the probality for the combination is 0.32, while the
product is 0.60*0.45=0.27. Hence, the combination of events occured more
often than expected. The table with all expected values calculated from
the marginal distribution looks as follows:
|
event1 |
NOT event1 |
sum |
| event2 |
27 |
18 |
45 |
| NOT event2 |
33 |
22 |
55 |
| sum |
60 |
40 |
100 |
The difference between the observed value and the expected value for
the combination of e1 and e2 is squared:
(32 - 27) ^ 2 = 5 ^ 2 = 25.
This squared value is divided by each expected value and the quotients
are summed up:
25 / 27 + 25 / 18 + 25 / 33 + 25 / 22 = 4.21.
This last value is called the "chi-square-value". Chi-square-values
higher than 3.84 are called significant at the 5% level.
The function calculates a few more values.
The "phi" value and the "relatve risk". The phi value stands for the
correlation (also called contingency) between the two events, without explaining
its reliability, but showing a direction ("negative" vs. "positive" correlation).
The Pearson coefficient is calculated as the square root of chi-square
divided by the sum of chi-square and the total number of cases.
The "relative risk" compares the occurence of event2 under the pre-condition
of event1 to its occurence under the pre-condition of NOT-event1; so does
"attributable risk" using a different formula.
The "odds ratio" compares the occurence of both events or none of the
events on one side with the occurence of one event only (without the other
event) on the other side.
In the above example, the phi-value is calculated from the second table
as (32 * 27 - 28 * 13) / (60 * 40 * 45 * 55) = 0.0000842 which is a very
low positive correlation.
The Pearson coefficient is sqr(4.21 / (4.21 + 100)) = 0.2.
The Relative Risk is rr(ev1,ev2) = (32 / 60) / (13 / 40) = 1.64, and
rr(ev2,ev1) = (32 / 45) / (28 / 55) = 1.40, which means that event2 is
more likely when event1 has been encountered, and that event1 is more likely
when event2 has been encountered.
The Odds Ratio is (32 * 27) / (28 * 13) = 2.37.
The data are available with the properties ChiSquare,
FourFoldCorrelation,
OddsRatio,
Pearson,
and
RelativeRisk.
Note that the chi-square-test is not an exact test. It requires all
expected values to be at least 5 and the total number of cases to be at
least 20. For smaller values, the Yates correction or the Fisher's Exact
Test could be used. None of them is implemented in the Miner; instead the
chi-square-test is used for all numbers. This must be thought of when interpreting
the results of data mining.
Furthermore, the chi-square-test is a two-sided test. That means, it
checks for equality / inequality. But here, a one-sided view is taken:
the combination of the two events occurs more often or less often than
expected. Hence, the chi-square-test is not fully appropriate for this
interpretation. A one-sided test is also not applicable because generally
the direction of the deviation from the expected value cannot be substantiated
by theoretical approaches.
For more background information on statistics and fourfold tables, see
"The
Development of a Statistical Computer Software Resource for Medical Research"
by Iain Edward Buchan.
Complexity Distribution
The complexity distribution looks for the occurence of a given event relative
to the number of all events per case. For example, how often article X
was found in shopping baskets with a total of 1, 2, 3, ... (distinct) items,
or how often a given chromosomal rearrangement was found in karyotypes
having a total of 1, 2, 3, ... (distinct) rearrangements.
The first step is very easy: the complexity distribution matrix can
simply be filled by counting how often an event was found for each total
number of events per case. Also, the total number of cases for each total
number of events per case is counted in the same part.
Very complicated however is the determination of the expected values
for the distribution. Independent from the total number of events found
in a case, a given event Ev can be encountered there only once or not at
all, and never twice or more often. A case cannot contain more events than
there are distinct events available.
Using an iterative algorithm and some approximation, a base probability
for each event is calculated and with that probability the expected values
at each number of all events per case determined.
Evolution Tree
The most complicated feature of the Miner is the evolution tree. It is
based on the assumption that the events can be brought into a specific
order which represents the order in which they occured.
When analysing articles in a shopping basket, the hypothesis of such
a series may seem daring. For chromosomal aberrations found in a cancer
patient, such a hypothesis is adequate. With the last origine of data in
mind, this feature was developed.
For example, in a leucemia patient, the cancer starts with the Philadelphia
translocation, later the gain of the der(22) follows, then gain of chromosome
8, loss of chromosome 7, etc.
In retinoblastoma however, the first step often cannot be detected by
banding technics: the elimination of a gene on chromosome 13 may occur
by a point mutation, by a very small deletion, or by some deletions spanning
this gene, the latter giving raise to quite distinct descriptions of the
event for each patient. When the isochromosome 6q is gained, the typical
path which can be detected cytogenetically, starts off.
Hence, the evolution tree must cope with data where the primary event
need not be accessible, but where pathways starting from a later point
can be detected.
The calculation of the evolution tree starts with the general
calculations and the calculations of significant
combinations as described above. Next, the most frequent event is determined;
its frequency most be above the MinimalFirstEventFrequency
property. Thereafter, alternative start points are calculated with the
calculateEqualNodes function, and the next generation of events below the
most frequent event are calculated with the calculateDependentNodes function.
The calculateEqualNodes functions searches through the CaseCountPerEvent
array for events which occured with a frequency higher than MinimalFirstEventFrequency.
Such events are then checked if they are dependent on the most frequent
event determined above; if the signed value of ChiSquare(this
event; most frequent event) is greater than DependenceThreshold,
this event is regarded positively dependent and thus discarded. Otherwise,
this event is a new node on the same level as the most frequent event;
for this event, dependent events are calculated with the calculateDependentNodes
function.
The calculateDependentNodes function first checks if the maximum
depth of the Evolution Tree has been reached. If not, it establishes
a new Miner object for that set of cases which contains the event of its
parent node, but with that event not being included in the data. Note that
this may lead to a number of "empty" cases, i.e. those which contained
the event of the parent node only; those cases will be discarded in a later
step. Its MinimalFirstEventFrequency
is set to the FrequencyDecrease value,
other relevant properties are simply copied. This new Miner object in turn
starts the calculation of an Evolution Tree of its own which is to be added
to its parent node.