Thursday, December 14, 2017

Big Data in Data Science

Tools and plays

Kafka, Elastic Map Reduce, Avro, Parque, Storm, Hbase

NodejS or Java
- Either:

 Kafka, Storm, Neo4j or Hbase
- Mongoose
- Solr/Lucene

Cassandra, Spark

Deep working experience applying machine learning and statistics to real world problems
Solid understanding of a wide range of data mining / machine learning software packages (e.g., Spark ML, scikit-learn, H2O, Weka, Keras)
Experience with version control systems (git) and comfortable using command-line tools

Knowledge of semantic web technology (e.g., RDF, OWL, SPARQL)
Knowledge of search technologies (e.g., Solr, ElasticSearch)
A link to a portfolio and/or code samples demonstrating your work experience (GitHub, Kaggle, KDD contributions earn major props)

Data Analyst – BI - Training:

Coding data extraction, transformation and loading (ETL) routines.
APIs and databases to pull data together

Hadoop, SQL and NoSQL technologies is required, as well as basic scripting experience in a dynamic language, such as Python or R.
Tools like Jethro, Kyvos, Dremio, AtScale etc.
BI tools like Tableau, Domo, Qlikview etc.
Sata visualization
Relational Databases (eg., Postgres, SQL Server, Oracle, MySQL)
Distributed Databases (eg., Hive, Redshift, Greenplum)
NoSQL Data Frameworks (eg., Spark, Mongo, Cassandra, HBase)
Data Analysis and Transformation (eg., R, Matlab, Python, etc.)

Big Data providers: Cloudera CDH, Hortonworks HDP and Amazon EC2/EMR for deploying and developing large scale solutions.
Hadoop/Spark Big Data Environment Clusters using Foreman, Puppet and Vagrant. Deploy Big Data Platforms (including Hadoop & Spark) to multiple clusters using Cloudera CDH, on both CDH4 and CDH5.
Hadoop MapReduce, YARN, HBase, Spark performance for large-scale data analysis.
Spark performance based on Cloudera and Hortornworks HDP cluster setup in Production Server.
Machine learning data models on Terabytes of data using Spark Ml and Mlib libraries.
 ETL systems using Python, HIVE and Apache spark SQL framework. Storing all the result files in Apache parquet and mapping them to HIVE for Enterprise Datawarehousing.
Real-time data pipelines using Kafka and Python consumers to ingest data through Adobe Real-time Firehorse API into Elastic Search and built real-time dashboards using Kibana.
Aribnb Airflow tool, to run the machine learning scripts in a DAG manner.
Test cases using Python Nose framework.
Scikit learn python scripts to Ml\Mlib spark scripts, which resulted to scalable pipeline framework computing.
Data Pipelines using Spark and Scala on AWS EMR framework and S3.
Real-time Data pipelines using Spark Streaming and Apache Kafka in Python.
Real-time Data pipelines using Apache Storm Java API for processing live streams of data and ingesting to Hbase.
Data pipelines on Cloudera/Hortornworks Hadoop Platform using Apache PIG and automating workflow using Apache Oozie.

Technology: Hadoop Ecosystem /Spring Boot/Microservices/AWS /J2SE/J2EE/Oracle
DBMS/Databases: DB2, My SQL, SQL, PL/SQL
Big Data Ecosystem: HDFS, Map Reduce, Oozie, Hive/Impala, Pig, Sqoop, Zookeeper and Hbase,
Spark, Scala
NOSQL Databases: Mongo DB, Hbase
Version Control Tools: SVN, CVS, VSS, PVCS

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.