Skip to main content
~/yash-saini |

I build analytics systemsthat turn messy data into decisions.

Data Analyst | BI Engineer | Building pipelines, analytics systems, and dashboards that drive decisions

I'm Yash Saini. I don't collect tools. I ship outcomes. At CUNY, I built pipelines across 26 institutions that turned weeks of reporting into minutes. At i3 Infosoft, I segmented 1M+ CRM records and the marketing team rewrote their entire quarterly strategy based on what the data showed. I'm finishing my MS in Business Analytics at Baruch (Zicklin) and looking for a team where the data actually changes something.

0
faster reporting
0
institutions served
0
accuracy gain
0
KPI dashboards
Yash Saini, Data Analyst and BI Engineer based in New York City
// live_demonstration
What I'd Build At Your Company
Type any company name. I'll detect your industry, map data challenges, design an architecture, and draft a 90-day implementation plan. Every output is different. Try it twice.
company_intelligence.py
// the_point

Everything you just experienced was designed and built by me. The industry detection, the architecture generation, the tailored proposals. Each company gets a different output because the logic adapts, not because I memorized answers.

The platform is the proof. You just experienced my skills.

SQL & BigQueryPythonETL PipelinesData ModelingBI DashboardsA/B TestingForecastingWeb Development
// featured_case_study
MTA Turnstile Data Engineering
The full story of how I turned millions of messy transit records into a queryable, automated analytics system on Google Cloud.
From Raw MTA Files to Real-Time Transit Intelligence
END-TO-END PROJECT
1
The Problem
NYC's subway system has 472 stations and thousands of turnstiles, each recording cumulative entry/exit counts every 4 hours. The MTA publishes this data weekly as flat text files. But the raw data is riddled with problems: counter resets that produce negative values, duplicate timestamps, inconsistent station naming, and no documentation on edge cases. Analysts who want to answer basic questions ("Which stations are busiest on Friday evenings?") end up spending hours in Excel cleaning the same mess every time.
2
Data Scale & Challenges
Millions of rows per quarter. Each row represents one turnstile's reading at one timestamp. The cumulative counters mean you need to calculate deltas between readings, but counters reset unpredictably, creating massive negative spikes. Station names vary across files ("42 ST-TIMES SQ" vs "TIMES SQ-42 ST"). Some turnstiles report at non-standard intervals. All of this had to be handled before any analysis could begin.
3
Architecture & Pipeline
I designed a 5-layer cloud pipeline on Google Cloud Platform:
πŸ“₯
MTA Raw Files
->
πŸ”„
Kafka Stream
->
☁️
BigQuery
->
βš™οΈ
dbt Models
->
πŸ“Š
Analytics
Kafka handles streaming ingestion. BigQuery stores the warehouse. dbt manages transformations as version-controlled SQL models. Prefect orchestrates the whole thing on a schedule, so new weekly data flows through without manual intervention.
4
Key Transformation Logic
The hardest part was computing reliable ridership from cumulative counters. Here's the core logic:
-- Calculate per-turnstile ridership deltas -- Filter out counter resets and anomalies WITH deltas AS ( SELECT station, unit, scp, datetime, entries - LAG(entries) OVER( PARTITION BY station, unit, scp ORDER BY datetime ) AS entry_delta FROM raw_turnstile ) SELECT station, datetime, SUM(entry_delta) AS ridership FROM deltas WHERE entry_delta BETWEEN 0 AND 10000 GROUP BY station, datetime
The BETWEEN 0 AND 10000 filter catches counter resets (negative values) and anomalous spikes (readings above 10K per 4-hour window are physically impossible for a single turnstile).
5
Impact & Insights
What used to take hours of manual Excel work now runs automatically in under 5 seconds. The pipeline processes millions of records end-to-end without intervention. Ridership patterns became instantly queryable: peak hours, seasonal trends, station comparisons, all accessible through simple SQL against a clean warehouse. The project demonstrates the full data engineering lifecycle: ingestion, streaming, warehousing, transformation, orchestration, and analytics.
// interactive_data_explorer
NYC Subway Ridership Explorer
Pick a station and day of week. Explore ridership patterns across all 24 hours. This is live data visualization built from the MTA pipeline above.
Ridership Heatmap
INTERACTIVE DEMO
// selected_work
Projects That Prove It
Click any project to see findings, interactive charts, and results explained for anyone to follow.
πŸ“Š
Customer Churn Prediction
Built a model that flags customers about to cancel with 84% accuracy. Connected it to a live dashboard so retention teams can act the same day, not the same quarter.
PythonScikit-learnSQLPower BI
Click for findings & charts
// in plain english
Imagine running a subscription business where 1,000 customers are about to leave but you don't know which ones. This model identifies them before they cancel, so retention teams can reach out with the right offer at the right moment.
// key findings
🎯
84% accuracy identifying at-risk customers before they cancel
πŸ“‰
Month-to-month contracts churn at 3.5x the rate of annual subscribers
⏱️
62% of all churn happens within the first 6 months
πŸ’°
Estimated $140K annual savings if retention team acts on predictions
// feature importance
View code on GitHub
πŸ™οΈ
NYC Housing Equity Analysis
Used data from 4 city agencies to test whether low-income neighborhoods get slower responses to housing complaints. The answer: yes, measurably and consistently.
RQuartoRandom ForestSpatial Joins
Click for findings & charts
// in plain english
When you call 311 about broken heat or mold in NYC, how fast the city responds should not depend on your zip code's income level. This analysis found it does. Lower-income neighborhoods wait significantly longer, and the pattern holds even after controlling for complaint volume.
// key findings
πŸ“Š
R-squared = 0.54: income alone explains over half the variation in resolution time
⏱️
Low-income tracts wait 40% longer on average for complaint resolution
πŸ—ΊοΈ
4 public APIs combined with spatial joins across 100K+ complaints
πŸ”¬
Pattern is statistically significant even after adjusting for complaint type and volume
// income vs resolution time
View code on GitHub
πŸ›’
Retail Analytics Pipeline
Took chaotic Excel sales data and turned it into a clean, queryable database with interactive dashboards. The full journey from raw mess to business insight.
PythonSQLSQLitePower BI
Click for findings & charts
// in plain english
Most small businesses have sales data trapped in messy spreadsheets nobody can read. I built a system that cleans, organizes, and visualizes this data, turning chaotic files into clear dashboards where you can instantly see what sells, what doesn't, and where to focus.
// key findings
🧹
Raw Excel to structured database with automated cleaning for nulls, duplicates, formatting
πŸ—οΈ
Normalized relational model with proper fact and dimension tables in SQLite
πŸ“ˆ
Interactive Power BI dashboard tracking sales trends, top products, and customer segments
// monthly revenue trend
View code on GitHub
// methodology
How I Approach Data Problems
Every project follows the same discipline. The tools change, the thinking doesn't.
🎯
01
Understand the Decision
What action will this analysis inform? Who needs it, and when? I start with the business question, not the data.
πŸ”
02
Investigate the Sources
Map every data source end-to-end. Where does it break? What's missing? What's inconsistent? This step saves weeks later.
πŸ”§
03
Build Reproducible Pipelines
No one-off scripts. Automated, validated, version-controlled pipelines that run without me babysitting them.
πŸ“Š
04
Analyze & Test Hypotheses
Segmentation, modeling, experimentation. Let the data answer the question, not confirm a hunch.
πŸš€
05
Deliver Decision-Ready Insights
Dashboards that stakeholders can use without analyst support. If the output needs a tour guide, it's not done.
// where_ive_built
Work Experience
Data Specialist Intern
Research Foundation of CUNY, New York
Aug 2025 - Present
55%faster reporting
37%accuracy gain
26institutions
7+sources unified
Built the SQL and BigQuery pipelines that consolidated fragmented reporting across all 26 CUNY institutions. Designed validation frameworks that caught data integrity issues two cycles before a major deadline. Automated ETL that gave operations teams self-serve dashboards for the first time.
Founder & Growth Analyst
Gharstuff, E-commerce
Feb 2020 - Present
65%reach growth
55%signups up
45%conversion lift
Built a category-based e-commerce platform from zero and ran it as a one-person data operation. Designed customer acquisition cohorts to track which referral channels produced repeat buyers vs. one-time visitors. A/B tested every paid and organic campaign, iterating weekly on messaging based on behavioral data. Built referral funnel analytics that identified drop-off points, then restructured the flow to increase sign-ups 55%. Improved conversion rates 45% through audience segmentation. Every decision was tied to a number because there was no budget for guessing.
Data Analyst
i3 Infosoft Pvt. Ltd., Noida, India
Sep 2022 - Jul 2024
1M+CRM records
18%engagement lift
20%forecast accuracy
15+KPI dashboards
Ran A/B experiments and built Python segmentation models across the entire CRM. The 18% engagement lift became the basis for the next quarter's channel strategy. Replaced manual reporting with automated SQL pipelines. Built time-series forecasting that improved inventory planning by 20%.
// how_i_work
Skills, With Receipts
No percentages. Here's what I've actually shipped with each tool.
SQL & BigQuery
Built pipelines serving 26 institutions. Migrated 7+ fragmented sources into normalized structures. Cut query runtimes 25%.
Python
Segmented 1M+ CRM records. Built time-series forecasting. Automated ETL replacing hours of manual work.
Power BI & Tableau
Designed dashboards tracking 15+ KPIs that moved ops from weekly reports to same-day visibility.
GCP & Cloud
Deployed unified reporting layer on Google Cloud. Built BigQuery warehouse for MTA analytics.
R Programming
Spatial analysis across 100K+ NYC records from 4 public APIs. Random Forest with R-squared 0.54.
A/B Testing
Designed experiments producing 18% engagement lift. Ran campaign optimization across organic and paid.
ETL & dbt
Orchestrated pipelines with Prefect and dbt. Kafka streaming for real-time ingestion.
Data Governance
Built validation frameworks catching integrity issues before deadlines. Standardized metrics across 26 institutions.
// education
Where I Studied
PLINER BLOOM SCHOLAR
Baruch College, Zicklin School of Business
M.S. in Business Analytics - Expected May 2026
GPA 3.6/4.0. The Pliner Bloom Scholarship is awarded to select graduate students for academic merit. Coursework includes Data Mining, Database Management, Applied NLP, Statistical Modeling, and R Programming.
Amity University
M.Com in Marketing, 2020-2022
Kurukshetra University
B.Com, 2016-2019
// currently_exploring
What I'm Building & Learning Right Now
dbt testing frameworks for data quality at scale
Airflow orchestration for production ML pipelines
Streaming analytics with Kafka + Flink
Generative AI applied to data analysis workflows
Cloud cost optimization for analytics workloads
This portfolio (it's a project too)
// let's_talk

I'd rather show you than tell you.

Open to Data Analyst, BI Analyst, and Analytics Engineer roles beginning May 2026. If you've scrolled this far, you already know how I think, what I build, and what I care about. Let's talk about what I can do for your team.

Graduating Baruch College MSBA, May 2026 | Based in New York City
Email GitHub LinkedIn Resume PDF