There is a nice discussion of NER training here: http://stackoverflow.com/questions/32011615/how-to-create-a-good-ner-training-model-in-opennlp Unfortunately, the NER extractor doesn’t accept a lexicon of proper names, but needs to be trained on annotated data. Although such data takes time to create, the markup process is straight-forward.