Introduction

This document describes the API for starting and using in your code the KD tool for the keyphrase extraction. The tool uses both statistical measures and linguistic information to detect a weighted list of n-grams representing the most important concepts of a text.

Example and Tutorial

Input files:

There are 2 possible input formats depending on the language to process:
  1. RAW TEXT: available only for English and with the -us (use Stanford on KD runnable jar) option.
  2. CONLL format (i.e. tab separated): available for both English and Italian and for all tagsets. The format must include at least 3 columns: token, PoS, and lemma. It's possible to specify the column position through the column_configuration parameter, see the help for more information

How to Run: (Refers only to KD runnable jar)

  1. open command line shell
  2. go to the KD folder (the folder containing KD.jar)
  3. java -jar KD.jar -lang ENGLISH -p WEAK -us -v -n 50 -m 6 <Folder or File to be processed>

Hints:

drag folder containing data to the command line shell in order to obtain the correct and "wrapped" path to the files.
run the tool with -STDOUT option in order to check the output directly from the console without write new file.

Parameter description used in the example:

  1. -lang ENGLISH is the main language of the file
  2. -p give a boost to more specific key-concept (ie. multi-token expressions). You can change the value of this option to have more or less multi-token expressions: NO | WEAK | MEDIUM | STRONG
  3. -us use stanford tokenizer,lemmatizer and pos tagger (included in the tool) - only for English
  4. -n is the number of concepts/key-phrases to be extracted, in the example above is set to 50
  5. -m is the maximum length of the multi-token expressions to be extracted

For more information on the parameters, run:
java -jar KD.jar -h

Configuration and tuning:

Configuration files are in the following folder and are in txt format:
{KD_Root_Folder}\languages\{Language}\configuration_files

Please, do not change the folder hierarchy!
This folder contains all the files used by the tool to increase performances and to obtain better results.
The file name are self explaining and its format is really understandable and easy to use.

If you use the tool in your code remember to use the KD_loader object in order to update the serialized data file.
e.g : KD_loader.run_the_updater(lang, configuration.languagePackPath);

How to use in your code:

Below an example of code integration:


import java.util.LinkedList;
import eu.fbk.dh.kd.lib.KD_configuration;
import eu.fbk.dh.kd.lib.KD_core;
import eu.fbk.dh.kd.lib.KD_core.Language;
import eu.fbk.dh.kd.lib.KD_keyconcept;
import eu.fbk.dh.kd.lib.KD_loader;

public class Main {

    public static void main(String[] args) {
        String languagePackPath = args[0]; //taken from command line
        String pathToFIle = args[1]; //taken from command line

        Language lang = Language.ITALIAN; //Specify language
        KD_configuration configuration = new KD_configuration(); //Creates a new instance of KD_Configuration object

        // Configuration Setup
        configuration.numberOfConcepts = 20;
        configuration.max_keyword_length = 4;
        configuration.local_frequency_threshold = 2;
        configuration.prefer_specific_concept = KD_configuration.Prefer_Specific_Concept.MEDIUM;
        configuration.skip_proper_noun = false;
        configuration.skip_keyword_with_proper_noun = false;
        configuration.rerank_by_position = false;
        configuration.group_by = KD_configuration.Group.NONE;
        configuration.column_configuration = KD_configuration.ColumExtraction.TOKEN_POS_LEMMA;
        configuration.only_multiword = false;
        configuration.tagset = KD_configuration.Tagset.TEXTPRO;

        configuration.languagePackPath = languagePackPath;//Overrides the default path with the new one taken from the command line parameter

        KD_loader.run_the_updater(lang, configuration.languagePackPath); //Updates the configuration file if something is changed

        KD_core kd_core = new KD_core(KD_core.Threads.TWO);//Create an instance of the KD core

        LinkedList<KD_keyconcept> concept_list = kd_core.extractExpressions(lang, configuration, pathToFIle, null);
        for (KD_keyconcept k : concept_list) { //loop over the extracted key_phrases and print the results
            System.out.println(k.getString() + "\t" + k.getSysnonyms() + "\t" + k.score + "\t" + k.frequency);
        }
    }
}
    

Support

This software is provided as it is. For new versions and updates please check the project web page at : KD Key-Phrases Digger at DH FBK

License

Keyphrase Digger (KD) is released under Apache License 2.0.

For distributors of proprietary software, please contact Rachele Sprugnoli (sprugnoli@fbk.eu).

For attribution, please always cite the following paper:
Moretti, G., Sprugnoli, R., Tonelli, S. "Digging in the Dirt: Extracting Keyphrases from Texts with KD". In Proceedings of the Second Italian Conference on Computational Linguistics (CLiC-it 2015), Trento, Italy.

KD lib uses:

  1. Google Guava 16 : released under Apache License 2.0
  2. Apache Commons IO 2.4 : released under Apache License 2.0
  3. Apache Commons Lang3:3.1 : released under Apache License 2.0
  4. Apache Commons CLI 1.2 : released under Apache License 2.0
  5. Apache Lucene 5.2.1 : released under Apache License 2.0
  6. MapDB 1.0.7 : released under Apache License 2.0
KD (runnable) includes:
  1. Stanford Pos Stagger 3.4.1: released under GNU General Public License (v2 or later)

If you want to see the source code of main class of the KD runnable package (the only part that contains a GPL v2 License library) please click here