top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

Discussion About Apache Sqoop?

0 votes
567 views

What is Apache Sqoop?

Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB

Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases. This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

With Sqoop, you can import data from a relational database system or a mainframe into HDFS. The input to the import process is either database table or mainframe datasets. For databases, Sqoop will read the table row-by-row into HDFS. 

For mainframe datasets, Sqoop will read records from each mainframe dataset into HDFS. The output of this import process is a set of files containing a copy of the imported table or datasets. The import process is performed in parallel. For this reason, the output will be in multiple files. These files may be delimited text files (for example, with commas or tabs separating each field), or binary Avro or SequenceFiles containing serialized record da

Sqoop includes some other commands which allow you to inspect the database you are working with. For example, you can list the available database schemas (with the sqoop-list-databases tool) and tables within a schema (with the sqoop-list-tables tool). Sqoop also includes a primitive SQL execution shell (the sqoop-eval tool).

Most aspects of the import, code generation, and export processes can be customized. For databases, you can control the specific row range or columns imported. You can specify particular delimiters and escape characters for the file-based representation of the data, as well as the file format used. You can also control the class or package names used in generated code. Subsequent sections of this document explain how to specify these and other arguments to Sqoop.

Sqoop is a collection of related tools. To use Sqoop, you specify the tool you want to use and the arguments that control the tool.

If Sqoop is compiled from its own source, you can run Sqoop without a formal installation process by running the bin/sqoop program.

Video for Sqoop

posted Jan 30, 2018 by anonymous

  Promote This Article
Facebook Share Button Twitter Share Button LinkedIn Share Button


Related Articles

What is ZooKeeper ?

Apache ZooKeeper is an effort to develop and maintain an open-source server which enables highly reliable distributed coordination.

ZooKeeper™ is an open source Apache project that provides centralized infrastructure and services that enable synchronization across an Apache™ Hadoop® cluster.

ZooKeeper maintains common objects needed in large cluster environments. Examples of these objects include configuration information, hierarchical naming space, and so on. Applications leverage these services to coordinate distributed processing across large clusters. 

For applications, ZooKeeper provides an infrastructure for cross-node synchronization. It does this by maintaining status type information in memory on ZooKeeper servers. A ZooKeeper server keeps a copy of the state of the entire system and persists this information in local log files. Large Hadoop clusters supported by multiple ZooKeeper servers (a master server synchronizes the top-level servers). 

Within ZooKeeper, an application can create what is called a znode (a file that persists in memory on the ZooKeeper servers). The znode can be updated by any node in the cluster, and any node in the cluster can register to be informed of changes to that znode (in ZooKeeper parlance, a server can be set up to “watch” a specific znode). 

Using this znode infrastructure (and there is much more to this such that we can’t even begin to do it justice in this section), applications can synchronize their tasks across the distributed cluster by updating their status in a ZooKeeper znode, which would then inform the rest of the cluster of a specific node’s status change. This cluster-wide status centralization service is essential for management and serialization tasks across a large distributed set of servers.

Video for ZooKeeper 

https://www.youtube.com/watch?v=Omh2JwcB0xY​

READ MORE

What is Apache SINGA?

SINGA is an Apache Incubating project for developing an open source machine learning library. It provides a flexible architecture for scalable distributed training, is extensible to run over a wide range of hardware, and has a focus on health-care applications.

SINGA was initiated by the DB System Group at National University of Singapore in 2014, in collaboration with the database group of Zhejiang University.

SINGA is a general distributed deep learning platform for training big deep learning models over large datasets. It is designed with an intuitive programming model based on the layer abstraction. A variety of popular deep learning models are supported, namely feed-forward models including convolutional neural networks (CNN), energy models like restricted Boltzmann machine (RBM), and recurrent neural networks (RNN). Many built-in layers are provided for users. SINGA architecture is sufficiently flexible to run synchronous, asynchronous and hybrid training frameworks. SINGA also supports different neural net partitioning schemes to parallelize the training of large models, namely partitioning on batch dimension, feature dimension or hybrid partitioning.

The second goal is to make SINGA easy to use. It is non-trivial for programmers to develop and train models with deep and complex model structures. Distributed training further increases the burden of programmers, e.g., data and model partitioning, and network communication. Hence it is essential to provide an easy to use programming model so that users can implement their deep learning models/algorithms without much awareness of the underlying distributed platform.

Video for Apache SINGA

https://www.youtube.com/watch?v=tkk5tOiWeEQ

READ MORE

What is Apache Velocity?

Apache Velocity is a Java-based template engine that provides a template language to reference objects defined in Java code. Here is a conversation or quarrel between Velocity (Apache) developers and Spring ones revolving around the reason why Velocity is not supported on the Spring framework.

Velocity is a Java-based templating engine.

It’s an open source web framework designed to be used as a view component in the MVC architecture, and it provides an alternative to some existing technologies such as JSP.

Velocity can be used to generate XML files, SQL, PostScript and most other text-based formats.The core class of Velocity is the VelocityEngine.

It orchestrates the whole process of reading, parsing and generating content using data model and velocity template.

Here are the steps we need to follow for any typical velocity application:

  • Initialize the velocity engine
  • Read the template
  • Put the data model in context object
  • Merge the template with context data and render the view

Velocity Template Language (VTL) provides the simplest and cleanest way of incorporating the dynamic content in a web page by using VTL references.

VTL reference in velocity template starts with a $ and is used for getting the value associated with that reference. 

VTL provides also a set of directives which can be used for manipulating the output of the Java code. Those directives start with #. 

Example

#set ($message = "Query Home")
#set ($customer.name = "Sandeep Bedi")​

Video for Apache Velocity

https://www.youtube.com/watch?v=DnWe-QJHEzQ​

READ MORE
...