Monday, July 13, 2015

Apache Storm - Overview

What is Storm
Storm is a distributed real time stream processing system. In this the processing work which normally takes lot of time to do the work is delegated to different components each responsible for performing a task. 

Storm comprises of the following process at a high level. 
  • Input: The data is received from different sources by a component called spout. The sources could be file, messaging queues
  • Processing: The processing is actually done by different components called bolts. The work can be done by one node or different nodes. 
  • Output: Once the data is processed, they can be stored in db, files

Advantages: 
  • See real time results while the storm components take care of processing it in high speed by utilising different nodes to process them. 

Applications of Storm:
  • Process real time data from different devices and analyse them quicker as and when the data flows into the system. 
  • Lively statistics. 
  • Build predictive models for real time data. 
  • Build monitoring and alerting systems. 

Why Storm: 
  • simple to program
  • support for multiple programming languages. 
  • fault tolerant: takes care of workers going down, reassigning tasks when necessary. 
  • Scaling: multi-node scaling options. 

Operation :
  • Local mode: run in a single machine
  • Remote mode: 
      • we submit our topology to the storm cluster, composed of different process usually in different machines, 

Nodes: 
  • Master node
      • they run a daemon called Nimbus. 
        • Responsible for distributing code around the cluster
        • Assign tasks to worker nodes
        • Monitor failures
  • Worker node

Types of grouping: 
  • Shuffle grouping. 
      • select the tuple emitted by the source to a randomly chosen bolt. 
      • useful for performing mathematical operations. 
      • not suitable for operations that cannot be randomly distributed. 
  • Field grouping
      • control how tuples are sent to bolts, based on one or more field definitions. 
  • All grouping
      • sends a single copy of each tuple to all instances of the receiving bolt. 
      • used to send signals to all bolts i.e., refresh a cache
  • Direct grouping
  • Global grouping

No comments:

Post a Comment