Tuesday, April 2, 2019

Spark Machine Leanring Course NYC

Spark PySpark Advanced Python Devops Software Engineering Course By Joshi
 

 


 


Multiple & Logistic Regression in Spark 101

Spark Mlib code
From the code “ from pyspark.ml import regression”
Imported from Pyspark library and not a local code.
Implementing this on your own for practice, created notebook on community databrics:
https://towardsdatascience.com/building-a-linear-regression-with-pyspark-and-mllib-d065c3ba246a?gi=66d5401a1bb7


Regression Spark parameters

Spark Logistic Regression

https://spark.apache.org/docs/1.4.1/api/java/org/apache/spark/ml/regression/LinearRegression.html

Quant Methods in Regression

Maximum likelihood
Log likelihood
Regression Code

https://dziganto.github.io/classes/data%20science/linear%20regression/machine%20learning/object-oriented%20programming/python/Understanding-Object-Oriented-Programming-Through-Machine-Learning/

Git commands 101 Fork, clone, merge branches

Git command primer to merge two branches

Converting Python into Spark code

Running Scipy would be bad idea as it would slow down things as python doesn’t work on JVM. It is suggested to us Scalanlp-Breeze.
https://stackoverflow.com/questions/40370759/using-python-scipy-from-spark-scala
Is Python code faster or should we have small UDF in Scala?
Not an easy and straight forward answer. But Python doesn’t run on JVM so we have to use it as API.
Garbage Collection and Java Serialization decides the speed.
https://changhsinlee.com/pyspark-udf/
Spark Optimization:
https://blog.cloudera.com/blog/2015/04/how-to-translate-from-mapreduce-to-apache-spark-part-2/
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
https://stackoverflow.com/questions/32435263/dataframe-join-optimization-broadcast-hash-join/39404486

Partitioning in Spark

https://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297
Executioner and  Driver - how many gb of memory
How to make partitions in Spark
Make sure the number of partition is at least the number of executors.

https://github.com/vaquarkhan/vaquarkhan/wiki/How-to-calculate-node-and-executors-memory-in-Apache-Spark
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/sparksqlshufflepartitions_draft.html

Spark Partitioning data & speed

How to optimally partition data using spark.

PySpark and its Transformation

https://blog.usejournal.com/tutorial-on-pyspark-transformations-and-mlib-7ed289a9e843?gi=192332db80e

MLlib DataFrame-based API

http://spark.apache.org/docs/latest/ml-guide.html

RDD vs DF

https://www.adsquare.com/comparing-performance-of-spark-dataframes-api-to-spark-rdd/

Submitting a job to Pyspark on terminal

# Run a Python application on a Spark standalone cluster
./bin/spark-submit \
--master spark://207.184.161.138:7077 \
examples/src/main/python/pi.py \
1000
https://spark.apache.org/docs/latest/submitting-applications.html
driver-memory 5G
--conf spark.driver.maxResultSize
--conf spark.shuffle.service.enabled
--conf spark.dynamicAllocation.enabled
--conf spark.ui.enabled
--conf spark.speculation
--conf spark.port.maxRetries
--queue root.mde.ste_queue.ste_queue3
--conf spark.kryoserializer.buffer.max
--conf spark.executor.pyspark.memory
--conf spark.executor.memoryOverhead
Submit a job to spark cluster
https://stackoverflow.com/questions/49081211/submit-python-script-into-spark-cluster
Compilation using Maven

Classes & Functions in Python 103

Meta Classes

A metaclass is the class of a class. A class defines how an instance of the class (i.e. an object) behaves while a metaclass defines how a class behaves. A class is an instance of a metaclass.
Skeleton of a class.
https://stackoverflow.com/questions/100003/what-are-metaclasses-in-python
https://realpython.com/python-metaclasses/

Understanding closures

https://www.programiz.com/python-programming/closure

Super

super(GeneralizedLinearRegression, self)
When do we use Super?

Staticmethods

@staticmethod
With staticmethods, neither self (the object instance) nor cls (the class) is implicitly passed as the first argument. They behave like plain functions except that you can call them from an instance or the class.
https://stackoverflow.com/questions/136097/what-is-the-difference-between-staticmethod-and-classmethod
Double Underscore
How to create private classes in Python

Single Underscore

Names, in a class, with a leading underscore are simply to indicate to other programmers that the attribute or method is intended to be private. However, nothing special is done with the name itself.
To quote PEP-8:
_single_leading_underscore: weak "internal use" indicator. E.g. from M import * does not import objects whose name starts with an underscore.
A single leading underscore isn't exactly just a convention: if you use from foobar import *, and module foobar does not define an __all__ list, the names imported from the module do not include those with a leading underscore. Let's say it's mostly a convention, since this case is a pretty obscure corner;-
__foo__: this is just a convention, a way for the Python system to use names that won't conflict with user names.
_foo: this is just a convention, a way for the programmer to indicate that the variable is private (whatever that means in Python).
__foo: this has real meaning: the interpreter replaces this name with _classname__foo as a way to ensure that the name will not overlap with a similar name in another class.
No other form of underscores have meaning in the Python world.
There's no difference between class, variable, global, etc in these conventions.
https://stackoverflow.com/questions/1301346/what-is-the-meaning-of-a-single-and-a-double-underscore-before-an-object-name

Higher order functions & Decorators

Used @total_ordering which is from functools and is a part of higher order functions.
@property
def getConfidenceInterval(self):
Functions being passed and behavior changed.
Decorators with symbol @ are used for decorator to pass in the function in another function and now change its behavior.
https://stackoverflow.com/questions/681953/how-to-decorate-a-class
https://docs.python.org/2/library/functools.html

Unit Testing & Integration Testing in Python 101

Two folders exist.
Python Unit Testing
UNIT TESTING is a level of software testing where individual units/ components of a software are tested. The purpose is to validate that each unit of the software performs as designed. A unit is the smallest testable part of any software. It usually has one or a few inputs and usually a single output.
There are certification exams about automated testing etc.
https://stackoverflow.com/questions/15351546/how-to-import-a-class-from-unittest-in-python

Magic Methods

https://blog.rmotr.com/python-magic-methods-and-getattr-75cf896b3f88?gi=76e08922d80d

Jira / OpenProject (Agile Development) 101

Reporting Agile Project Development & Sprint
It allows you to see all the open questions to know about the meetings and what people are doing
Principles of Agile Development and what are sprints

Jenkins 101

SDLC and Jenkins
Why is DevOPS becoming important?
https://www.tutorialspoint.com/jenkins/jenkins_overview.htm
https://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297

Questions

How does the libraries talk to each other?
SAS running in on VM 101 course?
Python API for Spark or Python runs on the container?
How would we create our code to run on cluster?
set PYTHONPATH=%PYTHONPATH%;
C:\Users\xxx\Downloads\python-3.6.0-embed-amd64\

Linux Commands

Source copy
scp -r username@ip/project/ C:/Users/xxx/Documents

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.