— — Stored Function, Stored Procedure, Trigger, CTE, and Recursive CTE.

Except for the basic SELECT-FROM-WHERE and common INSERT, UPDATE and DELETE, there are some other useful SQL queries including Stored Functions, Stored Procedures, Triggers, Common Table Expressions(CTE), and Recursive CTE.

I will use MySQL for the classicmodels, which you can download here. I will focus on three tables(customers, employees, and orders) just to make it easy to understand. Getting hands dirty is always the best way to learn.

Image for post
Image for post
ER Diagram for classicmodels database

Note: Just in case if you want to know how this ER diagram is generated: on the menu bar, click on…


What is AutoML and Why AutoML?

  1. AutoML automates methods for model selection, hyperparameter tuning, and model ensemble. It does not help feature engineering.
  2. AutoML works best for common cases including tabular data(66% of data used at work are tabular), time series, and text data. It does not work as good in deep learning because deep learning requires massive calculation and proper layer architect, which does not function well with the hyperparameter tuning part in AutoML.
  3. AutoML can simplify machine learning coding and thus reduce labor costs. …

In 2014 there was an article Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” showing the practical results from179 classifiers’ performance on 121 datasets. The answer is: there is no best classifier, but there is a most proper one for that dataset.

Overall Random Forest shows the best result but it is champion only on 9.9% of datasets. SVM, Neural Network, and Boosting technologies follow behind. The pattern conforms to our expectations: The higher dimensions the dataset is, the better SVM and Random Forest performs compared to Boosting technologies. …


Image for post
Image for post

Optimization

1. Approximate Greedy Algorithm

XGboost will always choose the best gain to determine the split point. So it is a greedy algorithm, which does not guarantee the best results for the long run. When there are a lot of features, it will run forever. So to deal with this, we can quantile the dataset. The more quantiles we set, the more accurate the threshold.

2. Parallel Learning & Weighted Quantile Sketch

The Quantile Sketch Algorithm combines the values from each computer to make an approximate histogram.

In the original quantile, the number of obs in each quantile is the same.


Image for post
Image for post

The XGBoost is highly correlated with Gradient Boost. It stands for Extreme Grident Boost.


Image for post
Image for post

In contrast to AdaBoost, the gradient boost start with a single leaf, instead of a tree or stump. This leaf represents the weights for all the samples.

So the first one is the average value of the variable we need to predict. The leaves are usually between 8 and 32. Gradient Boost scales all trees by the same amount.

1. Initialize the weights.

2. Build a tree based on error (Residuals).

We replace the residuals with the average. Usually, the tree will overfit (low bias with high variance). …


Image for post
Image for post

I had a difficult time finding materials for boosting techniques in machine learning. So I summarize all of them here. If you like it, please give a thumb or leave a comment.

Thank you!

We have 3 boosting technologies, and each is them is quite complex. I decided to separate them into 4 articles. Please subscribe and keep all these notes if you can’t learn all of them in one day.

  1. AdaBoost (Adaptive Boosting)
  2. Gradient Boosting for Regression & Classification
  3. XGBoost for Regression and Classification
  4. XGBoost Optimization and Hyperparameter Tuning

This article is for AdaBoost.

To summarize:

1. AdaBoost Combines a lot of stumps to make classifications/regression.


Image for post
Image for post

I am enrolled in a business analytics master program at Washington University in St. Louis. My track is customer analytics. Customer Analytics requires a knowledge structure with both marketing and data science. In this article, I want to share my journey in business analytics and the tools & software for Mac users.

Before diving into this article, it is better if you have studied the basic courses required in a bachelor’s degree: Calculus, Algebra, and Statistics are not necessary for business analytics purposes, however, they support the learning process.

Please leave a comment or give a thumb if you find…


These a few days I have been dealing with quite “raw” datasets about vegetarian restaurants. Below are several interesting insights from this dataset, but this is not the “main dish” of this article. Instead, I will explain how I cleaned and analyzed this RAW dataset step by step.

Note: I did not illustrate the text classification and text normalization file here. You can explore them by yourself.

Image for post
Image for post
Distribution among Cities

Sean Zhang

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store