~/yash-saini |

I build analytics systemsthat turn messy data into decisions.

Data Analyst | BI Engineer | Building pipelines, analytics systems, and dashboards that drive decisions

I'm Yash Saini. I don't collect tools. I ship outcomes. At CUNY, I built pipelines across 26 institutions that turned weeks of reporting into minutes. At i3 Infosoft, I segmented 1M+ CRM records and the marketing team rewrote their entire quarterly strategy based on what the data showed. I'm finishing my MS in Business Analytics at Baruch (Zicklin) and looking for a team where the data actually changes something.

faster reporting

institutions served

accuracy gain

KPI dashboards

Watch me solve a real business problem Resume PDF

Yash Saini, Data Analyst and BI Engineer based in New York City

// solve_it_live

Pick a Business Problem. Watch Me Solve It.

These aren't hypothetical. These simulate live BigQuery extraction processes. Click a scenario and watch the analytical workflow.

📉

Customer Churn Analysis

💰

Revenue Drop Diagnosis

⚙️

Reporting Automation

📣

Marketing Attribution

🎫

Support Optimization

// system_architecture

Engineering Impact: The 26-Institution Overhaul

Research Foundation of CUNY / Data Specialist / GCP + BigQuery + Python

Before The 26-Institution Mess

Excel Reports

Manual CSVs

Legacy Access DBs

Email Attachments

Shared Drives

Disconnected ERPs

Paper-Based Logs

7+ fragmented data sources. Inconsistent schemas. No validation layer. Metric discrepancies across institutions went undetected for multiple reporting cycles. Operations teams spent days assembling reports nobody trusted.

After Unified GCP Reporting Layer

26 Source Systems - cross-institutional schema normalization

Python ETL Pipelines - automated ingestion with validation gates

BigQuery Warehouse - unified GCP reporting layer

Programmatic Validation - data governance and reconciliation framework

Self-Serve Dashboards - operations teams pull insights without tickets

validation_pipeline.py - CUNY Data Integrity Audit

26 CUNY INSTITUTIONS / HOVER TO IDENTIFY / UNIFIED DATA LAYER

Reporting Latency Reduction

From days of manual assembly to minutes

Data Accuracy Gain

Validation frameworks catching issues pre-deadline

Manual ETL Workload Eliminated

Pipeline orchestration replacing hand-built processes

What changed: Operations teams stopped asking "is this number right?" and started asking "what should we do about it?" The unified GCP reporting layer replaced 7 fragmented data sources with a single governed warehouse. Programmatic validation frameworks now catch integrity issues before deadlines, not after audits. Pipeline orchestration runs on schedule. Nobody babysits it. Nobody emails spreadsheets. The system ships.

// featured_case_study

MTA Turnstile Data Engineering

The full story of how I turned millions of messy transit records into a queryable, automated analytics system on Google Cloud.

From Raw MTA Files to Real-Time Transit Intelligence

END-TO-END PROJECT

The Problem

NYC's subway system has 472 stations and thousands of turnstiles, each recording cumulative entry/exit counts every 4 hours. The MTA publishes this data weekly as flat text files. But the raw data is riddled with problems: counter resets that produce negative values, duplicate timestamps, inconsistent station naming, and no documentation on edge cases. Analysts who want to answer basic questions ("Which stations are busiest on Friday evenings?") end up spending hours in Excel cleaning the same mess every time.

Data Scale & Challenges

Millions of rows per quarter. Each row represents one turnstile's reading at one timestamp. The cumulative counters mean you need to calculate deltas between readings, but counters reset unpredictably, creating massive negative spikes. Station names vary across files ("42 ST-TIMES SQ" vs "TIMES SQ-42 ST"). Some turnstiles report at non-standard intervals. All of this had to be handled before any analysis could begin.

Architecture & Pipeline

I designed a 5-layer cloud pipeline on Google Cloud Platform:

📥

MTA Raw Files

🔄

Kafka Stream

☁️

BigQuery

⚙️

dbt Models

📊

Analytics

Kafka handles streaming ingestion. BigQuery stores the warehouse. dbt manages transformations as version-controlled SQL models. Prefect orchestrates the whole thing on a schedule, so new weekly data flows through without manual intervention.

Key Transformation Logic

The hardest part was computing reliable ridership from cumulative counters. Here's the core logic:

-- Calculate per-turnstile ridership deltas -- Filter out counter resets and anomalies WITH deltas AS ( SELECT station, unit, scp, datetime, entries - LAG(entries) OVER( PARTITION BY station, unit, scp ORDER BY datetime ) AS entry_delta FROM raw_turnstile ) SELECT station, datetime, SUM(entry_delta) AS ridership FROM deltas WHERE entry_delta BETWEEN 0 AND 10000 GROUP BY station, datetime

The BETWEEN 0 AND 10000 filter catches counter resets (negative values) and anomalous spikes (readings above 10K per 4-hour window are physically impossible for a single turnstile).

Impact & Insights

What used to take hours of manual Excel work now runs automatically in under 5 seconds. The pipeline processes millions of records end-to-end without intervention. Ridership patterns became instantly queryable: peak hours, seasonal trends, station comparisons, all accessible through simple SQL against a clean warehouse.

// selected_work

Projects That Prove It

Navigate through 6 real projects. Each one shows the problem, the approach, and the measurable result.

Customer Churn Prediction

Built a model that flags customers about to cancel with 84% accuracy. Connected it to a live dashboard so retention teams can act the same day, not the same quarter.

PythonScikit-learnSQLPower BI

View on GitHub >

// key findings

🎯

84% accuracy identifying at-risk customers

📉

Month-to-month contracts churn at 3.5x the rate

⏱️

62% of all churn happens within first 6 months

💰

Estimated $140K annual savings from predictions

NYC Housing Equity Analysis

Used data from 4 city agencies to test whether low-income neighborhoods get slower responses to housing complaints. The answer: yes, measurably and consistently.

RQuartoRandom ForestSpatial Joins

View on GitHub >

// key findings

📊

R-squared = 0.54: income explains half the variation

⏱️

Low-income tracts wait 40% longer for resolution

🗺️

4 public APIs combined with spatial joins

🔬

Statistically significant after adjusting for volume

Retail Analytics Pipeline

Took chaotic Excel sales data and turned it into a clean, queryable database with interactive dashboards. The full journey from raw mess to business insight.

PythonSQLSQLitePower BI

View on GitHub >

// key findings

🧹

Raw Excel to structured DB with automated cleaning

🏗️

Normalized relational model with fact and dimension tables

📈

Interactive Power BI dashboard tracking trends and segments

A/B Experiment Simulator

Built a cinematic-grade A/B testing platform with Bayesian analysis, uplift modeling, and a DuckDB SQL lab. A full experiment lifecycle in the browser.

PythonStreamlitSciPyDuckDBPlotly

View on GitHub >Live Demo >

// key findings

🔬

Frequentist + Bayesian: auto-selects Z/T/Chi-squared

📊

T-Learner uplift reveals responding segments

💰

$2.8M projected impact from simulated lift

🧪

DuckDB SQL lab with 5 production queries

Snowflake ETL: Pacific Retail

Designed a Bronze-Silver-Gold data lakehouse in Snowflake. Raw CSVs go in dirty; validated, governed, analytics-ready tables come out.

SnowflakeSQLETLData Governance

View on GitHub >

// key findings

🏗️

3-layer architecture: Bronze, Silver, Gold

🔒

7 validation rules: email, age, gender, price, stock, rating, transactions

📈

Gold layer views: daily sales + customer affinity

⚡

Zero manual steps: CSV to queryable analytics

Dreamhouse CRM (Salesforce)

Built a full-stack real estate CRM on Salesforce using Lightning Web Components, Apex, and Flows. Property listings, broker management, and geocoding.

SalesforceLWCApexSOQLFlows

View on GitHub >

// key findings

⚡

LWC: reactive UI with real-time search

🔧

Apex backend: geocoding and SOQL-optimized queries

🔄

Automated Flows: lead assignment and notifications

✅

Full CI/CD: Jest, ESLint, scratch org pipeline

1 / 6

// methodology

How I Approach Data Problems

Every project follows the same discipline. The tools change, the thinking doesn't.

🎯

Understand the Decision

What action will this analysis inform? Who needs it, and when? I start with the business question, not the data.

🔍

Investigate the Sources

Map every data source end-to-end. Where does it break? What's missing? What's inconsistent? This step saves weeks later.

🔧

Build Reproducible Pipelines

No one-off scripts. Automated, validated, version-controlled pipelines that run without me babysitting them.

📊

Analyze & Test Hypotheses

Segmentation, modeling, experimentation. Let the data answer the question, not confirm a hunch.

🚀

Deliver Decision-Ready Insights

Dashboards that stakeholders can use without analyst support. If the output needs a tour guide, it's not done.

// where_ive_built

Work Experience

Data Specialist Intern

Research Foundation of CUNY, New York

Aug 2025 - Present

55%faster reporting

37%accuracy gain

26institutions

7+sources unified

Built the SQL and BigQuery pipelines that consolidated fragmented reporting across all 26 CUNY institutions. Designed validation frameworks that caught data integrity issues two cycles before a major deadline. Automated ETL that gave operations teams self-serve dashboards for the first time.

Founder & Growth Analyst

Gharstuff, E-commerce

Feb 2020 - Present

65%reach growth

55%signups up

45%conversion lift

Built a category-based e-commerce platform from zero and ran it as a one-person data operation. Designed customer acquisition cohorts to track which referral channels produced repeat buyers vs. one-time visitors. A/B tested every paid and organic campaign, iterating weekly on messaging based on behavioral data. Built referral funnel analytics that identified drop-off points, then restructured the flow to increase sign-ups 55%. Improved conversion rates 45% through audience segmentation. Every decision was tied to a number because there was no budget for guessing.

Data Analyst

i3 Infosoft Pvt. Ltd., Noida, India

Sep 2022 - Jul 2024

1M+CRM records

18%engagement lift

20%forecast accuracy

15+KPI dashboards

Ran A/B experiments and built Python segmentation models across the entire CRM. The 18% engagement lift became the basis for the next quarter's channel strategy. Replaced manual reporting with automated SQL pipelines. Built time-series forecasting that improved inventory planning by 20%.

// how_i_work

Skills, With Receipts

No percentages. Here's what I've actually shipped with each tool.

SQL & BigQuery

Built pipelines serving 26 institutions. Migrated 7+ fragmented sources into normalized structures. Cut query runtimes 25%.

Python

Segmented 1M+ CRM records. Built time-series forecasting. Automated ETL replacing hours of manual work.

Power BI & Tableau

Designed dashboards tracking 15+ KPIs that moved ops from weekly reports to same-day visibility.

GCP & Cloud

Deployed unified reporting layer on Google Cloud. Built BigQuery warehouse for MTA analytics.

R Programming

Spatial analysis across 100K+ NYC records from 4 public APIs. Random Forest with R-squared 0.54.

A/B Testing

Designed experiments producing 18% engagement lift. Ran campaign optimization across organic and paid.

ETL & dbt

Orchestrated pipelines with Prefect and dbt. Kafka streaming for real-time ingestion.

Data Governance

Built validation frameworks catching integrity issues before deadlines. Standardized metrics across 26 institutions.

// education

Where I Studied

PLINER BLOOM SCHOLAR

Baruch College, Zicklin School of Business

M.S. in Business Analytics - Expected May 2026

GPA 3.6/4.0. The Pliner Bloom Scholarship is awarded to select graduate students for academic merit. Coursework includes Data Mining, Database Management, Applied NLP, Statistical Modeling, and R Programming.

💼

Assistant Budget Specialist - Undergraduate Student Government. Managing and allocating budgets across 140+ student clubs and organizations. Overseeing funding requests, expense tracking, and financial reporting for the entire student activities portfolio.

Amity University

M.Com in Marketing, 2020-2022

Kurukshetra University

B.Com, 2016-2019

I build analytics systemsthat turn messy data into decisions.

I'd rather show you than tell you.