Monday, March 17, 2014

Word / Sentence detector using opennlp

OpenNLP is a machine learning toolkit used for processing NLP. This article focuses on setting up a simple maven project and runs a simple program using OpenNLP:

Add the following in the maven configuration:
              <!-- Open NLP -->
              <dependency>
                  <groupId>org.apache.opennlp</groupId>
                  <artifactId>opennlp-tools</artifactId>
                  <version>1.5.3</version>
              </dependency>
             
              <dependency>
                  <groupId>commons-io</groupId>
                  <artifactId>commons-io</artifactId>
                  <version>2.4</version>
              </dependency>

Write a java program for getting sentences:

Following are the high level steps:
  • Get a reference of the model en-sent.bin using the InputStream
  • Create Sentence model and SentenceDetector for the input model stream.
  • Get the sentence array using open nlp api


public class SentenceDetectorClient {

       public static void main(String[] args) {
              new SentenceDetectorClient().go();

       }

       private void go() {
              try {

                     InputStream modelIn = new FileInputStream("src/main/resources/models/en-sent.bin");  // --- Import en-sent.bin feil for sentence mich
                     SentenceModel sModel = new SentenceModel(modelIn);
                    
                     SentenceDetectorME sentenceDetector = new SentenceDetectorME(sModel); // ------ Creating a Sentence detector based on the input stream
                    
                     String articleText = "Chris Gayle on Monday sounded out a warning to the rival teams ahead of the World Twenty20 by declaring that he can score a hundred irrespective of the conditions. “I am capable of scoring a century in any condition and on any wicket in the world. I just want to give the team that kind of a start. It will be nice to get another hundred,” Gayle said. “However it also depends on the conditions as well and how the wicket is playing,” he said. Asked about the tremendous pressure on him to perform every time, when he goes out to bat, the Jamaican dasher said it indeed was a challenge to live up to the expectations. “It creates a lot of pressure as expectations are rising. When you actually set a trend, then people expect you to come good at all times. You have fans worldwide who want me to do well. That’s what they pay for and want to see. But it’s not going to happen all the time but when I do get a chance I try to entertain people as much as possible,” he said. “We are here to retain the title and that’s not going to be easy but we are ready for it and we are ready for the challenges. Our first priority is to make it to the last four, it’s a tough group. Everybody is looking to win the tournament.”";            
              String[] sentences = sentenceDetector.sentDetect(articleText); // -----D
                    
                     int index = 0;
                     for (int i = 0; i < sentences.length; i++) {
                           index++;
                           String sentence = sentences[i];
                           System.out.println("Sentence : " + index  + " " + sentence); // --- printing seach sentence.
                          
                     }
                    
              } catch (Exception e) {
                     System.out.println("Exception : " + e);
                    
              }

       }

}

Output:
Sentence : 1 Chris Gayle on Monday sounded out a warning to the rival teams ahead of the World Twenty20 by declaring that he can score a hundred irrespective of the conditions.
Sentence : 2 “I am capable of scoring a century in any condition and on any wicket in the world.
Sentence : 3 I just want to give the team that kind of a start.
Sentence : 4 It will be nice to get another hundred,” Gayle said.
Sentence : 5 “However it also depends on the conditions as well and how the wicket is playing,” he said.
Sentence : 6 Asked about the tremendous pressure on him to perform every time, when he goes out to bat, the Jamaican dasher said it indeed was a challenge to live up to the expectations.
Sentence : 7 “It creates a lot of pressure as expectations are rising.
Sentence : 8 When you actually set a trend, then people expect you to come good at all times.
Sentence : 9 You have fans worldwide who want me to do well.
Sentence : 10 That’s what they pay for and want to see.
Sentence : 11 But it’s not going to happen all the time but when I do get a chance I try to entertain people as much as possible,” he said.
Sentence : 12 “We are here to retain the title and that’s not going to be easy but we are ready for it and we are ready for the challenges.
Sentence : 13 Our first priority is to make it to the last four, it’s a tough group.
Sentence : 14 Everybody is looking to win the tournament.”

-----------------------------------------------------------------------------

Similarly the following code, tokenizes the words from the same article:

InputStream modelIn = new FileInputStream(
"src/main/resources/models/en-token.bin");
TokenizerModel tModel = new TokenizerModel(modelIn);

TokenizerME tokenizer = new TokenizerME(tModel);

String articleText = "Chris Gayle on Monday sounded out a warning to the rival teams ahead of the World Twenty20 by declaring that he can score a hundred irrespective of the conditions. “I am capable of scoring a century in any condition and on any wicket in the world. I just want to give the team that kind of a start. It will be nice to get another hundred,” Gayle said. “However it also depends on the conditions as well and how the wicket is playing,” he said. Asked about the tremendous pressure on him to perform every time, when he goes out to bat, the Jamaican dasher said it indeed was a challenge to live up to the expectations. “It creates a lot of pressure as expectations are rising. When you actually set a trend, then people expect you to come good at all times. You have fans worldwide who want me to do well. That’s what they pay for and want to see. But it’s not going to happen all the time but when I do get a chance I try to entertain people as much as possible,” he said. “We are here to retain the title and that’s not going to be easy but we are ready for it and we are ready for the challenges. Our first priority is to make it to the last four, it’s a tough group. Everybody is looking to win the tournament.”";
String[] tokens = tokenizer.tokenize(articleText);

int index = 0;
String tokenString = "";
for (int i = 0; i < tokens.length; i++) {
index++;
tokenString = tokenString + tokens[i] + "|";
}
System.out.println("No. of tokens : " + tokenString.length());
System.out.println(tokenString);

Output:
No. of tokens : 1244
Chris|Gayle|on|Monday|sounded|out|a|warning|to|the|rival|teams|ahead|of|the|World|Twenty20|by|declaring|that|he|can|score|a|hundred|irrespective|of|the|conditions|.|“|I|am|capable|of|scoring|a|century|in|any|condition|and|on|any|wicket|in|the|world|.|I|just|want|to|give|the|team|that|kind|of|a|start|.|It|will|be|nice|to|get|another|hundred|,|”|Gayle|said|.|“However|it|also|depends|on|the|conditions|as|well|and|how|the|wicket|is|playing|,|”|he|said|.|Asked|about|the|tremendous|pressure|on|him|to|perform|every|time|,|when|he|goes|out|to|bat|,|the|Jamaican|dasher|said|it|indeed|was|a|challenge|to|live|up|to|the|expectations|.|“It|creates|a|lot|of|pressure|as|expectations|are|rising|.|When|you|actually|set|a|trend|,|then|people|expect|you|to|come|good|at|all|times|.|You|have|fans|worldwide|who|want|me|to|do|well|.|That’s|what|they|pay|for|and|want|to|see|.|But|it|’s|not|going|to|happen|all|the|time|but|when|I|do|get|a|chance|I|try|to|entertain|people|as|much|as|possible|,|”|he|said|.|“We|are|here|to|retain|the|title|and|that|’s|not|going|to|be|easy|but|we|are|ready|for|it|and|we|are|ready|for|the|challenges|.|Our|first|priority|is|to|make|it|to|the|last|four|,|it|’s|a|tough|group|.|Everybody|is|looking|to|win|the|tournament|.|”|


No comments:

Post a Comment