I am a highly motivated data scientist with 4 years of experience and a Master’s degree in progress. I am skilled at
- Collecting, analyzing and interpreting large volumes of data
- Designing and implementing innovative solutions for complex business and research problems
- Extracting patterns from noisy data and communicating findings to a diverse audience.
I am a critical thinker and I especially love tackling machine learning, natural language processing and statistical computing problems. Take a look at my past projects below!
Currently, I am looking for summer 2020 internships and full-time positions post December 2020.
Projects
Machine Learning and Statistical Modeling on Event Attendance in New York
Authors: Zhiyu Lin, Jiamin Zhong, Zihe Yang
What Factors into people's decisions on attending events? We looked into Yelp events attendance patterns through classification models such as Random Forest, K-Means,Decision Trees, SVMs, and Naive Bayes, regression models including linear and logistic regression, as well as unsupervised learning methods such as topic modeling, association rule mining, PCA dimension reduction and hypothesis testing. Click on the link to see some interesting results and cool visualizations.
[code & data]
Natural Language Processing (NLP) & Machine Learning on Fraud Detection
Authors: Zhiyu Lin, Duo Cao
Online shopping websites are prone to scams. Many products claiming to be certain trademarked brands are actually fake, causing confusion and legal issues. Could customer reviews help? Take Vans sneakers for example, we scraped 17,000 customer reviews on Vans listings from taobao.com and built a text profile for this brand through Python packages (NLTK, Pandas, Numpy, Scikit-Learn). Turns out, machine learning models can catch the fake sneakers with an 87% success rate.
[code & data]
Survival Analysis on Predicting the U.S. Supreme Court’s Decisions
Author: Zhiyu Lin Advisors: Vittorio Addona, Patrick Schmidt
The U.S. Supreme Court follows a doctrine “stare decisis”, meaning that they have to judge a case based on prior decisions. This doctrine ensures the Court’s accountability and legitimacy. Yet in the past 70 years, the Court has overruled almost 200 decisions (out of nearly 9,000 in total). When and why did this happen? Survival analysis models (studying the development of events over time) provided some statistical explanations including how the judges make decisions balancing the constraint of legal norms and their policy preferences.
[code & data]
Model Comparison for Discrimination on National Origins
Authors: Zhiyu Lin, Ning Hu, Tianxing Jiang, Kuiyu Zhu
Algorithmic fairness has a profound impact on societies as algorithms start to dictate decision makings in almost every industry. Biased crime prediction algorithms could discriminate against black and Latino people, and poorly designed employment algorithms could discriminate against women in the job market. What about national origin? Do machine learning algorithms implement socioeconomic discrimination against minority groups on the dimension of national origin? Turns out they do! We looked into the mathematical reasons why.
[code & data]
Technology Patent Trends & Economic Prosperity
Authors: Zhiyu Lin, Jingjie Ma, Thomas P. Malejko, Douglas Post, Nuo Tian, Nuoya Wu
The study of patent trends was one of the first attempts to measure the amount of inventive effort in an industry. However, during the information age, patents have become less prominent for popular technology-based inventions. We looked at the technology patent trends during the Dot-Com Bubble (1994-2000), the Housing Bubble (mid-2000s), the Dot-Com Bust (2001-2004) and the Great Recession (2018). Here are some interesting findings!
[code & data]
Homeless Service Facilities in Washington DC
Author: Zhiyu Lin
Washington DC has the highest family homeless percentage, almost tripling the second highest state (New York). What are the service facilities, what do they provide, and where are they located? Check out this map I made!
[code & data]