
For FullText PDF, please login, if you are a member of IEICE,
or go to Pay Per View on menu list, if you are a nonmember of IEICE.

The Bump Hunting Method Using the Genetic Algorithm with the ExtremeValue Statistics
Takahiro YUKIZANE Shinya OHI Eiji MIYANO Hideo HIROSE
Publication
IEICE TRANSACTIONS on Information and Systems
Vol.E89D
No.8
pp.23322339 Publication Date: 2006/08/01 Online ISSN: 17451361
DOI: 10.1093/ietisy/e89d.8.2332 Print ISSN: 09168532 Type of Manuscript: INVITED PAPER (Special Section on Invited Papers from New Horizons in Computing) Category: Keyword: data mining, data science, bump hunting, genetic algorithm, extremevalue statistics, tradeoff curve, decision tree, bootstrap,
Full Text: PDF(668.3KB)>>
Summary:
In difficult classification problems of the zdimensional points into two groups giving 01 responses due to the messy data structure, we try to find the denser regions for the favorable customers of response 1, instead of finding the boundaries to separate the two groups. Such regions are called the bumps, and finding the boundaries of the bumps is called the bump hunting. The main objective of this paper is to find the largest region of the bumps under a specified ratio of the number of the points of response 1 to the total. Then, we may obtain a tradeoff curve between the number of points of response 1 and the specified ratio. The decision tree method with the Gini's index will provide the simpleshaped boundaries for the bumps if the marginal density for response 1 shows a rather simple or monotonic shape. Since the computing time searching for the optimal trees will cost much because of the NPhardness of the problem, some random search methods, e.g., the genetic algorithm adapted to the tree, are useful. Due to the existence of many local maxima unlike the ordinary genetic algorithm search results, the extremevalue statistics will be useful to estimate the global optimum number of captured points; this also guarantees the accuracy of the semioptimal solution with the simple descriptive rules. This combined method of genetic algorithm search and extremevalue statistics use is new. We apply this method to some artificial messy data case which mimics the real customer database, showing a successful result. The reliability of the solution is discussed.

