Home   Run   Help   Documentation   Interpret Results

TIDIER Documentation



TIDIER (Term IDentifier RecognizER) is an approach that splits program identifiers based on a distance (string-edit distance) computed via Dynamic Time Warping (DTW) algorithm and a greedy search. It uses contextual information in the form of specialized dictionaries (e.g., acronyms, contractions and domain specific terms) and mimics the process of transforming words via a set of contraction rules. TIDIER relies on a set of input dictionaries and a distance function to split identifiers and associate the resulting terms with words in the dictionaries. The terms can be simple e.g., image or truncated/abbreviated, e.g., objectPtr, cntr, or drawrect. This (simplified) on-line release is made available to share the use of TIDIER for research purposes.

Standard Dictionaries

TIDIER relies on a set of input dictionaries to perform the splitting. The set of available pre-defined dictionaries are the following:
  • Acronyms dictionary: it contains 105 common acronyms (e.g., ansi, dom, inode, ssl, url);
  • Abbreviations dictionary: it includes 164 common abbreviations (e.g., bool for boolean, buff for buffer, wrd for word);
  • Wordnet: a complete English dictionary extracted from the WordNet upper-ontology database and from the GNU i-spell spellchecker. This dictionary includes 168000 words (we kept only words that are longer than two characters);
  • known functions dictionary: it involves 492 functions (e.g., malloc, printf, waitpid, access).

Custom Dictionary

The user has the possibility of up-loading his custom dictionary. This dictionary must be a text file that contains a single term per line. It is preferable to sort the dictionary from longer to shorter words to favor splits with longer words.

Word to Split

The identifier to split must be a single string (e.g., imageptr).

Further References

More information and details can be found in the following papers:
  • Latifa Guerrouj, Massimiliano Di Penta, Giuliano Antoniol, and Yann-Gaël Guéhéneuc. TIDIER: An Identifier Splitting Approach using Speech Recognition Techniques. Journal of Software Maintenance - Research and Practice, pp 31,30 Jun 2011, DOI: 10.1002/smr.539.
    This paper is also available to wodwload from here: camera ready of TIDIER paper
  • Nioosha Madani, Latifa Guerrouj, Massimiliano di Penta, Yann-Gaël Guéhéneuc, and Giuliano Antoniol. Recognizing Words from Source Code Identifiers using Speech Recognition Techniques. In CSMR, March 15-18, 2010 Universidad Rey Juan Carlos Madrid, Spain, pages 69-78 - Best paper award, IEEE Computer Society Press.
    Ecole polytechnique de Montreal Ecole polytechnique de Montreal Ecole polytechnique de Montreal  
      Copyright © 2011. Soccer Lab   Soccer Tools | Legal Aspects | Contact Us