Home   Run   Help   Documentation   Interpret Results

SAMURAI Documentation


Samurai relies on two assumptions:
  • A substring composing an identifier is also likely to be used in other parts of the program or in other programs alone or as a part of other identifiers.
  • Given two possible splits of a given identifier, the split that most likely represents the developer's intent partitions the identifier into terms occurring more often in the program. Thus, term frequency is used to determine the most-likely splitting of identifiers.

Samurai also exploits identifier context. It mines term frequency in the source code and builds two term-frequency tables: a program-specific and a global-frequency table. The first table is built by mining terms in the program under analysis. The second table is made by mining the set of terms in a large corpus of programs.

Samurai ranks alternative splits of a source code identifier using a scoring function based on the program-specific and global frequency tables. This scoring function is at the heart of Samurai. It returns a score for any term based on the two frequency tables representative of the program-specific and global term frequencies.

Our Samurai inout frequency table format is as follows:
  1. the first line contain an integer: the total number of words
  2. from the second line on the format is: rough frequency space string

thus the frequency table:

2 hello
2 world
1 zorba
1 zut

Means: 6 words were processed out of which hello was found 2 time, ..., and zut 1 time.
Ecole polytechnique de Montreal Ecole polytechnique de Montreal Ecole polytechnique de Montreal  
  Copyright © 2011. Soccer Lab   Soccer Tools | Legal Aspects | Contact Us