Class Miner
The Miner class is the central part of the Mining library. After receiving its input data, the miner calculates many data from them: counting of events, counting of combination of events, the dependence of events from each other and their reliabilty, etc. Information on how these calculations are done internally are provided in the section "Internal calculations".
Overview
- Constructors
- Interfaces
- Enumerations
- Properties
- Methods
- Interaction with other classes
- Internal calculations
Miner |
New() |
New(ByRef MiningInput: MiningInputData) |
AttributableRisk(): Single(,) |
BinningThreshold(): Double |
CaseCountPerEvent(): Integer() |
CasesEvents(): Boolean(,) {read-only} |
CaseNames(): String() {read-only} |
ChiSquare(): Single(,) {read-only} |
Confidence(): Single(,) {read-only} |
DependenceThreshold(): Double |
EventCountPerCase(): Integer() {read-only} |
EventNames(): String() {read-only} |
FourFoldCorrelation(): Single(,) {read-only} |
FrequencyDecrease(): Double |
MaximumEventCount(): Integer {read-only} |
MaximumDepth(): Integer |
MinimumEventCount(): Integer |
MinimumFirstEventFrequency(): Double |
NumberOfCases(): Integer {read-only} |
NumberOfCasesOriginal(): Integer {read-only} |
NumberOfEvents(): Integer {read-only} |
OddsRatio(): Single(,) {read-only} |
Pearson(): Single() {read-only} |
RelativeRisk(): Single() {read-only} |
Support(): Int16(,) {read-only} |
calculateEvolutionTree(): void |
calculateComplexityDistribution(): void |
getComplexityDistribution(): Int16(,) |
getComplexityDistributionExpected(): Single(,) |
getComplexityDistributionAsMap(EventIndex: Integer, Optional EventMaximum: Integer = 30, Optional Width: Integer = 450, Optional Height: Integer = 150): ImageMapData |
getComplexityDistributionDifferenceToExpectedAsMap(EventIndex: Integer, Optional EventMaximum: Integer = 30, Optional Width: Integer = 450, Optional Height: Integer = 150): ImageMapData |
getEventsOrderedByFrequency(ByRef Numbers: Integer(), ByRef Names: String(), Optional MinimumEventCount: Integer = 0, Optional TopN: Integer = 0): void |
getNodes(): TreeNode() |
Mine(): void |
Programming Language
- Microsoft Visual Basic .Net.
Availability
- Copyrighted but free software under the GNU Public License.
- Binaries and source code can be obtained from the Download section.
Constructors
Public Sub New()
Instantiates a new Miner object.This constructor is not intended for general use since the input data cannot be set later.
Public Sub New(ByRef MiningInput As MiningInputData)
Instantiates a new Miner object with the given MiningInputData.Parameters
- MiningInputData: the data on which mining is to be performed.
Remarks
- The Miner gets its data from the MiningInputData object. It does not automatically start mining, but allows for several properties to be set before.
Interfaces
This class does not implement an interface.Enumerations
The class does not provide enumerations.Properties
Public ReadOnly Property AttributableRisk() As Single(,)
Returns a matrix with the Attributable Risk values for pairs of events. The Attributable Risk compares the occurence of event2 under the pre-condition of event1 to its occurence under the pre-condition of NOT-event1.From the fourfold table for each pair of events (event1;event2)
event1 - | event1 + | |
event2 - | a | b |
event2 + | c | d |
it is calculated in two steps: nTmp = b / (a + b) * (a * d / b / c -
1), ar = nTmp / (1 + nTmp).
The Attributable Risk matrix is typically non-symmetric, i.e. AR(event1;event2)
<> AR(event2;event1).
Property Value
- a matrix with the Attributable Risk values for pairs of events.
Remarks
- Before this property can be retrieved, the respective calculations (e.g. Mine) have to be performed. Since this is a calculated value, this property is read-only.
- The index is 1-based for each dimension and corresponds to the EventNames property.
- The objectives of Attibutable Risk and Relative Risk are similar, but epidemiologists prefer the Attibutable Risk.
- This definition and way of calculation of the Attributable Risk was taken from the PhD thesis "Buchan: The Development of a Statistical Computer Software Resource for Medical Research" (p. 54; no more available in the internet as a pdf-file, but you can browse through its chapters at the help pages of "StatsDirect".
- For more information on the calculation process, see "Significant combinations" in the "Internal Calculations" section.
Public Property BinningThreshold() As Double
Gets or sets the binning threshold."Binning" means that two or more events are merged into one pseudo-event representing the set union of those events. Binning is only useful for events which almost always occur together. This threshold determines the value which must be reached for the probality for the occurence of event A with the pre-condition of the occurence of event B and for the probality for the occurence of event B with the pre-condition of the occurence of event A.
Property Value
- the binning threshold.
Remarks
- Binning is performed for the calculation of the Evolution Tree and for the Dependence Network.
- Default Value for the threshold is 1 (i.e. both events must always occur together).
- A value greater than 1 will prevent binning.
Public ReadOnly Property CaseCountPerEvent() As Integer()
Returns an array showing the number of cases the respective event was encountered in.Property Value
- an array showing the number of cases the respective event was encountered in.
Remarks
- Before this property can be retrieved, the respective calculations (Mine or calculateEvolutionTree) have to be performed. Since this is a calculated value, this property is read-only.
- The array index is 1-based, and the same with the EventNames() property.
Public ReadOnly Property CasesEvents() As Boolean(,)
Returns a two-dimensional matrix indicating if a given event has occured in a given case.Property Value
- a two-dimensional matrix indicating if a given event has occured in a given case.
Remarks
- The first dimension is the case. Its index is the same with the CaseNames() property.
- The second dimension is the event. Its index is the same with the EventNames() property.
- Both indices are 1-based.
- This property is set with the constructor of the Miner object.
Public ReadOnly Property CaseNames() As String()
Returns an array with the case names.Property Value
- an array with the case names.
Remarks
- The index is 1-based.
- This property is set with the constructor of the Miner object.
Public ReadOnly Property ChiSquare() As Single(,)
Returns a matrix with chi-square values for pairs of events. Each pair of events is linked to a fourfold table showing how often the events occured alone or in combination or none of them. A high chi-square value means that the events are likely (positively or negatively) dependent on each other.From the fourfold table for each pair of events (event1;event2)
event1 - | event1 + | |
event2 - | a | b |
event2 + | c | d |
it is calculated as: (a * d - b * c) * (a * d - b * c) * (a + b +c +d) / (a + b) / (c + d) / (b + d) / (a + c). The chi-square matrix is symmetric, i.e. ChiSquare(event1;event2) = ChiSquare(event2;event1).
Property Value
- a matrix with chi-square values for pairs of events.
Remarks
- Before this property can be retrieved, the respective calculations (calculateEvolutionTree or Mine) have to be performed. Since this is a calculated value, this property is read-only.
- The index is 1-based for each dimension and corresponds to the EventNames property.
- In contrast to the standard definition of chi-square, the value here is signed in order to indicate the direction of non-equality (positive sign for positive correlation, negative sign for negative correlation).
- For more information on the calculation process, see "Significant combinations" in the "Internal Calculations" section.
Public ReadOnly Property Confidence() As Single(,)
Returns a matrix with confidence values for pairs of events. Each pair of events is linked to a fourfold table showing how often the events occured alone or in combination or none of them. The confidence expresses the proportion of the occurence of the combination of events relative to all occurences of the second event. A high confidence value means that the second events is highly dependent on the first event.Property Value
- a matrix with confidence values for pairs of events.
Remarks
- Before this property can be retrieved, the respective calculations (calculateEvolutionTree or Mine) have to be performed. Since this is a calculated value, this property is read-only.
- The index is 1-based for each dimension and corresponds to the EventNames property.
- The confidence is calculated by dividing the Support of the combination by the number of occurences of the second event: conf(ev1,ev2) = supp(ev1,ev2)/n(ev2).
- For more information on the calculation process, see the "General" chapter in the "Internal Calculations" section.
Public Property DependenceThreshold() As Double
Gets or sets the dependence threshold used for classifying nodes in the evolution tree as dependent sub-nodes or as parallel (same-level) nodes.Property Value
- the dependence threshold.
Remarks
- This property is only used for the evolution tree.
- During the calculation of an Evolution Tree, events have to be classified as events on the same level or as dependent events. If the (signed value) of the ChiSquare property is greater than this threshold, the new event is regarded as dependent on the previous event and will go into the next generation of the evolution tree.
- For more information on the calculation of the evolution tree, see "Evolution Tree" in the "Internal Calculations" section.
Public ReadOnly Property EventCountPerCase() As Integer()
Returns an array showing the number of events for each case.Property Value
- an array showing the number of events for each case.
Remarks
- Before this property can be retrieved, the respective calculations (e.g. Mine) have to be performed. Since this is a calculated value, this property is read-only.
- The index is 1-based and corresponds to the CaseNames property.
Public ReadOnly Property EventNames() As String()
Returns an array with the event names.Property Value
- an array with the event names.
Remarks
- The value of this property is set with the constructor of the Miner object.
- The index is 1-based.
Public ReadOnly Property FourFoldCorrelation() As Single(,)
Returns a matrix showing the fourfold table correlation values for pairs of events. This value is also referred to as "Phi" in the statistics literature.From the fourfold table for each pair of events (event1;event2)
event1 - | event1 + | |
event2 - | a | b |
event2 + | c | d |
it is calculated as: (a * d - b * c) / SQRT((a + b) * (c + d) * (b + d) * (a + c)). This coefficient matrix is symmetric, i.e. FF(event1;event2) = FF(event2;event1)
Property Value
- a matrix showing the fourfold table correlation values for pairs of events.
Remarks
- Before this property can be retrieved, the respective calculations (e.g. Mine) have to be performed. Since this is a calculated value, this property is read-only.
- The index is 1-based for each dimension and corresponds to the EventNames property.
- Sometimes, this correlation coefficient is also called Pearson's correlation coefficent r; see also "Pearson's correlation" at "HyperStat".
- For more information on the calculation process, see "Significant combinations" in the "Internal Calculations" section.
Public Property FrequencyDecrease() As Double
Gets or sets the minimum frequency for an event in a later generation in the evolution tree. (proportion towards the parent number of occurences)Property Value
- the minimum frequency for an event in a later generation in the evolution tree.
Remarks
- The minimum frequency for the occurence of an event in a later generation of the evolution tree is set independent from the minimum frequency of the start generation (=MinimumFirstEventFrequency).
- It is measured in relation its individual parent node.
- For more information on the calculation of the evolution tree, see "Evolution Tree" in the "Internal Calculations" section.
Public Property MaximumDepth() As Integer
Sets or gets the maximum number of generations displayed in the evolution tree.Property Value
- the maximum number of generations displayed in the evolution tree.
Remarks
- The calculation of the evolution tree may stop earlier due to other parameters, e.g. MinimumEventCount.
- Increasing values of MaximumDepth may cause an exponential increase in calculation time.
- For more information on the calculation of the evolution tree, see "Evolution Tree" in the "Internal Calculations" section.
Public ReadOnly Property MaximumEventCount() As Integer
Returns the maximum number of events encountered in a case.Property Value
- the maximum number of events encountered in a case.
Remarks
- Before this property can be retrieved, the respective calculations (calculateComplexityDistribution or Mine or calculateEvolutionTree) have to be performed. Since this is a calculated value, this property is read-only.
Public Property MinimumEventCount() As Integer
Gets or sets the minimum number of occurences of an event must have in order to be included in data mining.Property Value
- the minimum number of occurences for an event to be included in data mining.
Remarks
- Early during data mining, events which occured at low frequency are sorted out. More information on this process can be obtained from the "General" chapter of the "Internal Calculations" section.
- If MinimumEventCount is set to a value greater than the product of MinimumFirstEventFrequency times number of cases, MinimumFirstEventFrequency is adjusted.
Public Property MinimumFirstEventFrequency() As Double
Gets or sets the minimum frequency for an event at the base of an evolution tree.Property Value
- the minimum frequency for an event at the base of an evolution tree.
Remarks
- This value is relevant for the nodes at the base of the evolution tree only. For subsequent nodes, FrequencyDecrease is used instead.
- For more information on the calculation of the evolution tree, see "Evolution Tree" in the "Internal Calculations" section.
Public ReadOnly Property NumberOfCases() As Integer
Returns the number of cases which still exist in the Miner.Property Value
- the number of cases which still exist in the Miner.
Remarks
- Early during data mining, events which occured at low frequency are sorted out. Then, also cases which do not have events are sorted out thus reducing the number of cases relative to the original number of cases. More information on this process can be obtained from the "General" chapter of the "Internal Calculations" section.
Public ReadOnly Property NumberOfCasesOriginal() As Integer
Returns the number of cases originally found in the Miner.Property Value
- the number of cases originally found in the Miner.
Remarks
- Early during data mining, events which occured at low frequency are sorted out. Then, also cases which do not have events are sorted out thus reducing the actual number of cases relative to the original number of cases. More information on this process can be obtained from the "General" chapter of the "Internal Calculations" section.
Public ReadOnly Property NumberOfEvents() As Integer
Returns the number of events which still exist in the Miner.Property Value
- the number of events which still exist in the Miner.
Remarks
- Early during data mining, events which occured at low frequency are sorted out thus reducing the actual number of events relative to the original number of events. More information on this process can be obtained from the "General" chapter of the "Internal Calculations" section.
Public ReadOnly Property OddsRatio() As Single(,)
Returns a matrix with the odds ratio for pairs of events. The odds ratio compares the occurence of both events or none of the events on one side with the occurence of one event only (without the other event) on the other side.From the fourfold table for each pair of events (event1;event2)
event1 - | event1 + | |
event2 - | a | b |
event2 + | c | d |
it is calculated as: (d / b) / (c / a) . The Odds Ratio matrix is symmetric, i.e. OR(event1;event2) = OR(event2;event1).
Property Value
- a matrix with the odds ratio for pairs of events.
Remarks
- Before this property can be retrieved, the respective calculations (e.g. Mine) have to be performed. Since this is a calculated value, this property is read-only.
- The index is 1-based for each dimension and corresponds to the EventNames property.
- For more information on the calculation process, see "Significant combinations" in the "Internal Calculations" section.
- See also: "Definition: Odds ratio"
Public ReadOnly Property Pearson() As Single(,)
Returns a matrix with the Pearson coefficients for pairs of events.From the fourfold table for each pair of events (event1;event2)
event1 - | event1 + | |
event2 - | a | b |
event2 + | c | d |
it is calculated as: square_root(ChiSquare / (ChiSquare + number_of_cases)),
with
ChiSquare = (a * d - b * c) * (a * d - b *
c) * (a + b +c +d) / (a + b) / (c + d) / (b + d) / (a + c). The Pearson
coefficient matrix is symmetric, i.e. Pearson(event1;event2) = Pearson(event2;event1).
Property Value
- a matrix with the Pearson coefficients for pairs of events.
Remarks
- Before this property can be retrieved, the respective calculations (e.g. Mine) have to be performed. Since this is a calculated value, this property is read-only.
- The index is 1-based for each dimension and corresponds to the EventNames property.
- For more information on the calculation process, see "Significant combinations" in the "Internal Calculations" section.
- Note that different definitions of the coefficient mean different ways of calculating them. Often, the term Pearson coefficient is used to describe the "fourfold correlation coefficient".
- This definition of the Pearson's coefficients of contingency was taken from the PhD thesis "Buchan: The Development of a Statistical Computer Software Resource for Medical Research" (p. 47; no more available in the internet as a pdf-file, but you can browse through its chapters at the help pages of "StatsDirect".
Public ReadOnly Property RelativeRisk() As Single(,)
Returns a matrix with the Relative Risk values for pairs of events. The Relative Risk compares the occurence of event2 under the pre-condition of event1 to its occurence under the pre-condition of NOT-event1.From the fourfold table for each pair of events (event1;event2)
event1 - | event1 + | |
event2 - | a | b |
event2 + | c | d |
it is calculated as: (d / (b + d)) / (c / (a + c)). The Relative Risk matrix is typically non-symmetric, i.e. RR(event1;event2) <> RR(event2;event1).
Property Value
- a matrix with the Relative Risk values for pairs of events.
Remarks
- Before this property can be retrieved, the respective calculations (e.g. Mine) have to be performed. Since this is a calculated value, this property is read-only.
- The index is 1-based for each dimension and corresponds to the EventNames property.
- The objectives of Attibutable Risk and Relative Risk are similar, but epidemiologists prefer the Attibutable Risk.
- For more information on the calculation process, see "Significant combinations" in the "Internal Calculations" section.
Public ReadOnly Property Support() As Int16(,)
Returns a matrix with the Support values for pairs of events. For each pair (event1; event2) the number of cases containing such a combination is counted. The Support matrix is symmetric, i.e. Support(event1;event2) = Support(event2;event1).Property Value
- a matrix with the Support values for pairs of events.
Remarks
- Before this property can be retrieved, the respective calculations (e.g. Mine) have to be performed. Since this is a calculated value, this property is read-only.
- The index is 1-based for each dimension and corresponds to the EventNames property.
- For more information on the calculation process, see the "General" chapter in the "Internal Calculations" section.
Methods
Public Sub calculateEvolutionTree()
Initiates the calculation of an evolution tree.Return Value
- None.
Parameters
- None.
Remarks
- The evolution tree can be retrieved as an array of treenode with the getNodes function.
- Parameters relevant for the evolution tree are BinningThreshold, DependenceThreshold, FrequencyDecrease, MaximumDepth, MinimumEventCount, and MinimumFirstEventFrequency. They are set via their respective properties.
- For more information on the calculation of the evolution tree, see "Evolution Tree" in the "Internal Calculations" section.
Public Sub calculateComplexityDistribution()
Calculates the distribution of events relative to the total number of events per case.Return Value
- None.
Parameters
- None.
Remarks
- The calculated data can be retrieved by the getComplexityDistribution(), getComplexityDistributionExpected(), getComplexityDistributionAsMap(...), and getComplexityDistributionDifferenceToExpectedAsMap(...) functions.
- The result cannot be influenced by setting mining parameters.
- For more information on the calculation of the Complexity Distribution, see "Complexity Distribution" in the "Internal Calculations" section.
Public Function getComplexityDistribution() As Int16(,)
Returns a matrix showing how often a given event occured in cases with a given number of distinct events.Return Value
- a matrix showing how often a given event occured in cases with a given number of distinct events.
Parameter
- None.
Remarks
- The first dimension stands for the event. Its index is identical with EventNames.
- The second dimension stands for the number of events per case.
- The total number of cases with a given number of events nEv is stored at the index (0; nEv).
- Before this function can be called, the respective calculations (calculateComplexityDistribution) have to be done first.
- For more information on the calculation of the Complexity Distribution, see "Complexity Distribution" in the "Internal Calculations" section.
Public Function getComplexityDistributionExpected() As Single(,)
Returns a matrix showing how often a given event was expected to occur in cases with a given number of distinct events estimated from the total number of case with the given event and the total number of cases with the given number of distinct events.Return Value
- a matrix showing how often a given event was expected to occur in cases with a given number of distinct events.
Parameter
- None.
Remarks
- The first dimension stands for the event. Its index is identical with EventNames.
- The second dimension stands for the number of events per case.
- The base probability for a given events nEv is stored at the index (nEv;0).
- Before this function can be called, the respective calculations (calculateComplexityDistribution) have to be done first.
- For more information on the calculation of the Complexity Distribution, see "Complexity Distribution" in the "Internal Calculations" section.
Public Function getComplexityDistributionAsMap(ByVal EventIndex As Integer, Optional ByVal EventMaximum As Integer = 30, Optional ByVal Width As Integer = 450, Optional ByVal Height As Integer = 150) As ImageMapData
Returns an ImageMap showing the Complexity Distribution.Return Value
- an ImageMap showing the Complexity Distribution.
Parameter
- EventIndex: The index of the event. It is the same with the index of the EventNames property.
- EventMaximum: The maximum number of events per case to be shown.
- Width: The width of the image map in pixel. By default 450.
- Height: The height of the image map in pixel. By default 150.
Remarks
- The x-axis shows the number of events per case, the y-axis the number of cases with the respective number of events.
- Cases with the selected event are shown in red color, all cases in blue color.
- Before this function can be called, the respective calculations (calculateComplexityDistribution) have to be done first.
- For more information on the calculation of the Complexity Distribution, see "Complexity Distribution" in the "Internal Calculations" section.
Public Function getComplexityDistributionDifferenceToExpectedAsMap(ByVal EventIndex As Integer, Optional ByVal EventMaximum As Integer = 30, Optional ByVal Width As Integer = 450, Optional ByVal Height As Integer = 150) As ImageMapData
Returns an ImageMap showing the difference in occurrence between expected and observed numbers of the Complexity Distribution.Return Value
- an ImageMap showing the difference in occurrence between expected and observed numbers of the Complexity Distribution.
Parameter
- EventIndex: The index of the event. It is the same with the index of the EventNames property.
- EventMaximum: The maximum number of events per case to be shown.
- Width: The width of the image map in pixel. By default 450.
- Height: The height of the image map in pixel. By default 150.
Remarks
- The x-axis shows the number of events per case, the y-axis the difference in the number of cases between the observed and the expected occurence.
- A negative difference is shown in red color, a positive difference in blue color.
- Before this function can be called, the respective calculations (calculateComplexityDistribution) have to be done first.
- For more information on the calculation of the Complexity Distribution, see "Complexity Distribution" in the "Internal Calculations" section.
Public Sub getEventsOrderedByFrequency(ByRef Numbers As Integer(), ByRef Names As String(), Optional ByVal MinimumEventCount As Integer = 0, Optional ByVal TopN As Integer = 0)
Sorts the events by descending order of frequency. The event indices are returned in the Numbers array, their names in the Names array.Return Value
- None. But relevant arrays are passed to the function by reference and changed during the analysis.
Parameter
- Numbers: an array containing the indices of the selected and ordered events.
- Names: an array containing the names of the selected and ordered events.
- MinimumEventCount: indicates the minimum number of cases an event has to be found in. By default 0.
- TopN: indicates how many events have to be selected; 0 means that all events (fulfilling MinimumEventCount) have to be selected.
Remarks
- The Numbers and Names arrays will be filled with the indices and names of the events in descending order.
- Their indices are 1-based.
- The search for the next most frequent event stops when TopN events have been found or when no event occuring at least MinimumEventCount times was encountered.
Public Function getNodes() As TreeNode()
Returns the nodes of the evolution tree.Return Value
- the nodes of the evolution tree.
Parameter
- None.
Remarks
- Before the nodes can be retrieved, the evolution tree must be calculated with the calculateEvolutionTree function.
- The TreeNodes array returned from this function can be added to the TreeNodeCollection of a TreeView element with the addRange function.
- Parameters relevant for the evolution tree are BinningThreshold, DependenceThreshold, FrequencyDecrease, MaximumDepth, MinimumEventCount, and MinimumFirstEventFrequency. They are set via their respective properties.
- For more information on the calculation of the evolution tree, see "Evolution Tree" in the "Internal Calculations" section.
Public Sub Mine()
Starts the data mining process.Return Value
- None.
Parameter
- None.
Remarks
- This function performs the first (general) steps of data mining: counting events per case, cases per event, removing events which occured to rarely, removing cases which do not have events, calculating support and confidence, binning, and calculating data on pairs of events.
- Parameters relevant for this general part of data mining are BinningThreshold, and MinimumEventCount. They are set via their respective properties.
- For more information on the Mining process, see the "General" and "Significant combinations" chapters in the "Internal Calculations" section.
Interaction with other classes
Classes using MiningInputData
The MiningInputData class is the primary entrance point into the Mining dll. It is passed as a parameter into the constructor of the Miner object.
A MiningInputData object should be the output of some appropriate functions of the application into which the Mining class is embedded.
Classes used by MiningInputData
The MiningInputData class uses quite basic data types only to ensure a wide usability.
Interal Calculations
General
Data mining relies on flat data structures. The principal data matrix is two-dimensional: one dimension lists all cases (case identifiers), the other dimension lists all possible events (event identifiers). For each point (case;event) in the matrix, a boolean value indicates if the given event was encountered in given case. This structure has to be supplied from outside to the mining application.
The first steps performed in the Miner objects are counting: the Miner determines how many events can be found per case (private function countEventsPerCase), and how many cases an event can be found in (private function countCasesPerEvent).
These general calculations are already enough for the distribution of events towards events per case.
Events which did not occur are deemed useless and thus removed. Also
events which happened only very rarely may be deemed useless and removed.
The private function used is removeZeroEvents; it uses the MinimumEventCount
property to determine if an event is to be removed or not; set the property's
value to 1 in order to prevent the removal of events.
After the removal of rare events, cases without events are removed
(private function removeZeroCases which is invoked by removeZeroEvents).
Now, combinations of events are counted: for each pair (event1; event2) the number of cases containing such a combination is determined. This is called the Support. This matrix is symmetrical. It is calculated with the private function calculateSupport.
Next, the Support for a pair of events (event1; event2) is set into relation to the number of cases containing event2. This is called the Confidence. It can also be interpreted as the probality for the occurence of event1 with the pre-condition of the occurence of event2. The matrix is typically non-symmetrical It is calculated with the private function calculateConfidence.
Binning is the next step. "Binning" means that two or more events are
merged into one pseudo-event representing the set union of those events.
Binning is only useful for events which almost always occur together. The
BinningThreshold
property sets the minimum value of confidence which must be fulfilled in
both directions for a pair of events (event1; event2).
Binning is performed with the private function removeReciprocallyDependent.
It replaces events for the binned pseudo-events and calculates their presence
in the cases (set union).
If binning was successful, events per case, cases per event, support
and confidence are calculated again.
Significant combinations
Now we can ask if a combination of two events is somehow special: if it occurs more often than one would expect or less often, and if we can rely on this difference.
As an example, let us assume that we have 100 cases. Event1 occured 60 times, event2 45 times, and the combination of event1 and event2 was found 32 times. At the moment, we do not look at event3, event4 and so on. This leads to the following four field matrix (also called "fourfold table" or "2 x 2 contingency table"):
event1 | NOT event1 | sum | |
event2 | 32 | 45 | |
NOT event2 | |||
sum | 60 | 100 |
The remaining fields can easily be filled:
event1 | NOT event1 | sum | |
event2 | 32 | 13 | 45 |
NOT event2 | 28 | 27 | 55 |
sum | 60 | 40 | 100 |
The probalities for each event or combination of events is obtained by dividing the numbers in the above table by the number of all events.
Two events are called independent from each other, if the probability of the combination is the product of the probalities of the events, i.e. p(e1;e2)=p(e1)*p(e2).
In our example, the probality for the combination is 0.32, while the product is 0.60*0.45=0.27. Hence, the combination of events occured more often than expected. The table with all expected values calculated from the marginal distribution looks as follows:
event1 | NOT event1 | sum | |
event2 | 27 | 18 | 45 |
NOT event2 | 33 | 22 | 55 |
sum | 60 | 40 | 100 |
The difference between the observed value and the expected value for
the combination of e1 and e2 is squared:
(32 - 27) ^ 2 = 5 ^ 2 = 25.
This squared value is divided by each expected value and the quotients
are summed up:
25 / 27 + 25 / 18 + 25 / 33 + 25 / 22 = 4.21.
This last value is called the "chi-square-value". Chi-square-values higher than 3.84 are called significant at the 5% level.
The function calculates a few more values.
The "phi" value and the "relatve risk". The phi value stands for the
correlation (also called contingency) between the two events, without explaining
its reliability, but showing a direction ("negative" vs. "positive" correlation).
The Pearson coefficient is calculated as the square root of chi-square divided by the sum of chi-square and the total number of cases.
The "relative risk" compares the occurence of event2 under the pre-condition of event1 to its occurence under the pre-condition of NOT-event1; so does "attributable risk" using a different formula.
The "odds ratio" compares the occurence of both events or none of the events on one side with the occurence of one event only (without the other event) on the other side.
In the above example, the phi-value is calculated from the second table
as (32 * 27 - 28 * 13) / (60 * 40 * 45 * 55) = 0.0000842 which is a very
low positive correlation.
The Pearson coefficient is sqr(4.21 / (4.21 + 100)) = 0.2.
The Relative Risk is rr(ev1,ev2) = (32 / 60) / (13 / 40) = 1.64, and
rr(ev2,ev1) = (32 / 45) / (28 / 55) = 1.40, which means that event2 is
more likely when event1 has been encountered, and that event1 is more likely
when event2 has been encountered.
The Odds Ratio is (32 * 27) / (28 * 13) = 2.37.
The data are available with the properties ChiSquare, FourFoldCorrelation, OddsRatio, Pearson, and RelativeRisk.
Note that the chi-square-test is not an exact test. It requires all expected values to be at least 5 and the total number of cases to be at least 20. For smaller values, the Yates correction or the Fisher's Exact Test could be used. None of them is implemented in the Miner; instead the chi-square-test is used for all numbers. This must be thought of when interpreting the results of data mining.
Furthermore, the chi-square-test is a two-sided test. That means, it checks for equality / inequality. But here, a one-sided view is taken: the combination of the two events occurs more often or less often than expected. Hence, the chi-square-test is not fully appropriate for this interpretation. A one-sided test is also not applicable because generally the direction of the deviation from the expected value cannot be substantiated by theoretical approaches.
For more background information on statistics and fourfold tables, see "The Development of a Statistical Computer Software Resource for Medical Research" by Iain Edward Buchan.
Complexity Distribution
The complexity distribution looks for the occurence of a given event relative to the number of all events per case. For example, how often article X was found in shopping baskets with a total of 1, 2, 3, ... (distinct) items, or how often a given chromosomal rearrangement was found in karyotypes having a total of 1, 2, 3, ... (distinct) rearrangements.
The first step is very easy: the complexity distribution matrix can simply be filled by counting how often an event was found for each total number of events per case. Also, the total number of cases for each total number of events per case is counted in the same part.
Very complicated however is the determination of the expected values for the distribution. Independent from the total number of events found in a case, a given event Ev can be encountered there only once or not at all, and never twice or more often. A case cannot contain more events than there are distinct events available.
Using an iterative algorithm and some approximation, a base probability for each event is calculated and with that probability the expected values at each number of all events per case determined.
Evolution Tree
The most complicated feature of the Miner is the evolution tree. It is based on the assumption that the events can be brought into a specific order which represents the order in which they occured.
When analysing articles in a shopping basket, the hypothesis of such a series may seem daring. For chromosomal aberrations found in a cancer patient, such a hypothesis is adequate. With the last origine of data in mind, this feature was developed.
For example, in a leucemia patient, the cancer starts with the Philadelphia translocation, later the gain of the der(22) follows, then gain of chromosome 8, loss of chromosome 7, etc.
In retinoblastoma however, the first step often cannot be detected by banding technics: the elimination of a gene on chromosome 13 may occur by a point mutation, by a very small deletion, or by some deletions spanning this gene, the latter giving raise to quite distinct descriptions of the event for each patient. When the isochromosome 6q is gained, the typical path which can be detected cytogenetically, starts off.
Hence, the evolution tree must cope with data where the primary event need not be accessible, but where pathways starting from a later point can be detected.
The calculation of the evolution tree starts with the general calculations and the calculations of significant combinations as described above. Next, the most frequent event is determined; its frequency most be above the MinimalFirstEventFrequency property. Thereafter, alternative start points are calculated with the calculateEqualNodes function, and the next generation of events below the most frequent event are calculated with the calculateDependentNodes function.
The calculateEqualNodes functions searches through the CaseCountPerEvent array for events which occured with a frequency higher than MinimalFirstEventFrequency. Such events are then checked if they are dependent on the most frequent event determined above; if the signed value of ChiSquare(this event; most frequent event) is greater than DependenceThreshold, this event is regarded positively dependent and thus discarded. Otherwise, this event is a new node on the same level as the most frequent event; for this event, dependent events are calculated with the calculateDependentNodes function.
The calculateDependentNodes function first checks if the maximum depth of the Evolution Tree has been reached. If not, it establishes a new Miner object for that set of cases which contains the event of its parent node, but with that event not being included in the data. Note that this may lead to a number of "empty" cases, i.e. those which contained the event of the parent node only; those cases will be discarded in a later step. Its MinimalFirstEventFrequency is set to the FrequencyDecrease value, other relevant properties are simply copied. This new Miner object in turn starts the calculation of an Evolution Tree of its own which is to be added to its parent node.