TIDIER (Term IDentifier RecognizER) is an approach that splits program identifiers based on a distance (string-edit distance) computed via Dynamic Time Warping (DTW) algorithm and a greedy search.
It uses contextual information in the form of specialized dictionaries (e.g., acronyms, contractions and domain specific terms) and mimics the process of
transforming words via a set of contraction rules.
TIDIER relies on a set of input dictionaries and a distance function to split identifiers and associate the resulting terms with
words in the dictionaries. The terms can be simple e.g., image or truncated/abbreviated, e.g., objectPtr, cntr, or drawrect.
This (simplified) on-line release is made available to share the use of TIDIER for research purposes.
TIDIER relies on a set of input dictionaries to perform the splitting. The set of available pre-defined dictionaries are the following:
- Acronyms dictionary: it contains 105 common acronyms (e.g., ansi, dom, inode, ssl, url);
- Abbreviations dictionary: it includes 164 common abbreviations (e.g., bool for boolean, buff for buffer, wrd for word);
- Wordnet: a complete English dictionary extracted from the WordNet upper-ontology database and from the GNU i-spell spellchecker.
This dictionary includes 168000 words (we kept only words that are longer than two characters);
- known functions dictionary: it involves 492 functions (e.g., malloc, printf, waitpid, access).
The user has the possibility of up-loading his custom dictionary. This dictionary must be a text file that contains a single term per line. It is preferable to sort the dictionary from longer to shorter words to favor splits with longer words.
Word to Split
The identifier to split must be a single string (e.g., imageptr).
More information and details can be found in the following papers:
Latifa Guerrouj, Massimiliano Di Penta, Giuliano Antoniol, and Yann-Gaël Guéhéneuc.
TIDIER: An Identifier Splitting Approach using Speech Recognition Techniques.
Journal of Software Maintenance - Research and Practice,
pp 31,30 Jun 2011, DOI: 10.1002/smr.539.
This paper is also available to wodwload from here:
camera ready of TIDIER paper
Nioosha Madani, Latifa Guerrouj, Massimiliano di Penta, Yann-Gaël Guéhéneuc, and Giuliano Antoniol.
Recognizing Words from Source Code Identifiers using Speech Recognition Techniques.
March 15-18, 2010 Universidad Rey Juan Carlos Madrid, Spain,
pages 69-78 - Best paper award, IEEE Computer Society Press.