Samurai relies on two assumptions:
- A substring composing an identifier is also likely to be
used in other parts of the program or in other programs alone
or as a part of other identifiers.
- Given two possible splits of a given identifier, the split
that most likely represents the developer's intent
partitions the identifier into terms occurring more often in
the program. Thus, term frequency is used to determine the
most-likely splitting of identifiers.
Samurai also exploits identifier context. It mines term frequency in
the source code and builds two term-frequency tables: a
program-specific and a global-frequency table. The first table is
built by mining terms in the program under analysis. The second
table is made by mining the set of terms in a large corpus of
Samurai ranks alternative splits of a source code identifier
using a scoring function based on the program-specific and
global frequency tables. This scoring function is at the heart
of Samurai. It returns a score for any term based on the two
frequency tables representative of the program-specific and
global term frequencies.
Our Samurai inout frequency table format is as follows:
- the first line contain an integer: the total number of words
- from the second line on the format is: rough frequency space string
thus the frequency table:
Means: 6 words were processed out of which hello was found 2 time, ..., and zut 1 time.