Friday, February 28, 2014

NER using Stanford NLP

Named Entity Recognition - NER helps in identifying meaningful information from the textual content. 
There are different ways to get NER (place, name, organization) out of text, the below example uses Stanford NLP library to obtain NER.

Ensure you have the required Stanford NLP jars for running the below example, if you are using maven, then the following dependencies can be used. 

<!-- Stanford NLP -->
              <dependency>
                     <groupId>edu.stanford.nlp</groupId>
                     <artifactId>stanford-corenlp</artifactId>
                     <version>3.2.0</version>
              </dependency>
              <dependency>
                     <groupId>edu.stanford.nlp</groupId>
                     <artifactId>stanford-corenlp</artifactId>
                     <version>3.2.0</version>
                     <classifier>models</classifier>
              </dependency>
              <dependency>
                     <groupId>com.io7m.xom</groupId>
                     <artifactId>xom</artifactId>
                     <version>1.2.10</version>
              </dependency>
              <dependency>
                     <groupId>joda-time</groupId>
                     <artifactId>joda-time</artifactId>
                     <version>2.1</version>
              </dependency>
              <dependency>
                     <groupId>de.jollyday</groupId>
                     <artifactId>jollyday</artifactId>
                     <version>0.4.7</version>
              </dependency>
              <dependency>
                     <groupId>com.googlecode.efficient-java-matrix-library</groupId>
                     <artifactId>ejml</artifactId>
                     <version>0.23</version>
              </dependency>

High level steps include the following: 

1. Create StanfordCoreNLP object. 
2. Mention the models that might be used for the program
3. Get the annotation article by passing the text. 
4. Get the sentences of the articles. 
5. Get the words from the sentences. 
6. For each of the word from the sentences, 
    obtain NER using the NLP api: 

Code: 

public class NERClient {
      
       static String ARTICLE = "A day after resigning as Navy Chief in New Delhi, Admiral D.K. Joshi on Thursday wrote to his colleagues, saying he was “firm” on taking responsibility for the mishaps that have taken place. ";
       StanfordCoreNLP pipeline = null;
       public static void main(String args[]) {
              NERClient sc = new NERClient();
              sc.go();
       }

       private void getSentences() {
       }

       private void go() {
              Properties props = new Properties();
           props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse");
           pipeline = new StanfordCoreNLP(props);
          
              Annotation annotation = new Annotation(ARTICLE);
              pipeline.annotate(annotation);
              List<CoreMap> sentences = annotation.get(SentencesAnnotation.class);
             
              for (CoreMap coreMap : sentences) {
                     List<CoreLabel> coreLabels = coreMap.get(TokensAnnotation.class);
                     System.out.println(coreLabels.toString());
                      for (CoreLabel token: coreLabels) {
                            String word = token.get(TextAnnotation.class);
                            String ner = token.get(NamedEntityTagAnnotation.class);
                            String pos = token.get(PartOfSpeechAnnotation.class);
                            System.out.print(word + "(" + ner + ")" + "  ");
                            //System.out.println("pos :" + pos);
                      }
              }

       }

}

Output:
[A, day, after, resigning, as, Navy, Chief, in, New, Delhi, ,, Admiral, D.K., Joshi, on, Thursday, wrote, to, his, colleagues, ,, saying, he, was, ``, firm, '', on, taking, responsibility, for, the, mishaps, that, have, taken, place, .]

A(DURATION)  day(DURATION)  after(O)  resigning(O)  as(O)  Navy(ORGANIZATION)  Chief(O)  in(O)  New(LOCATION)  Delhi(LOCATION)  ,(O)  Admiral(O)  D.K.(PERSON)  Joshi(PERSON)  on(O)  Thursday(DATE)  wrote(O)  to(O)  his(O)  colleagues(O)  ,(O)  saying(O)  he(O)  was(O)  ``(O)  firm(O)  ''(O)  on(O)  taking(O)  responsibility(O)  for(O)  the(O)  mishaps(O)  that(O)  have(O)  taken(O)  place(O)  .(O)  

Hope this helps someone

Tuesday, February 25, 2014

Regular expression using Java

Regular expression (regex or regexp) is used for searching using String pattern matching. Regex is a sequence of characters that forms a search pattern .

JDK comes up with built-in api for regex manipulation.Regex can be implemented using the “Pattern” and the “Matcher” class.

Sample program: 

       public static void main(String args[]) {
             
               boolean gotit = false;
               // -------------------------------------
             String line = "Rubesh is a technology (XYZ technology solutions) !. mail - samson@gmail.com mobile - 754-543-5843";
             String pattern = "[^a-z]";
            

             // Create a Pattern object
             Pattern r = Pattern.compile(pattern);

             // Now create matcher object.
             Matcher matcher = r.matcher(line);
             while (matcher.find()) {
                  System.out.print(matcher.group());
              gotit = true;
          }
          if(!gotit){
              System.out.println("No results !!!");
          }
       }
If you want to get the start & end index of a selected letter in the output you can use the matcher.start() and matcher.end().

Input / output samples:

Regex pattern                Output
[a-z]                              ubeshisatechnologytechnologysolutionsmailsamsongmailcommobile
[A-Z]                            RXYZ
[A-Za-z]                       RubeshisatechnologyXYZtechnologysolutionsmailsamsongmailcommobile
[a-z0-9_-]                    ubeshisatechnologytechnologysolutionsmail-samsongmailcommobile-754-543-5843
[a-z0-9_-]{4,10}         ubeshtechnologytechnologysolutionsmailsamsongmailmobile754-543-58


You can try out regex directly using www.regexpal.com