Basketball AI: Player Tracking, Team Detection, and Number Recognition with Python

Name: Basketball AI: Player Tracking, Team Detection, and Number Recognition with Python
Uploaded: 2025-12-04T20:11:04.000Z
Duration: 1 h 14 min 7 s

How to Build a Basketball Play Analysis System

Overview of the System

The video presents a step-by-step guide on building an AI system for analyzing basketball plays, utilizing RFDTR sum tool and small VLM.

The challenge lies in the fast-paced nature of basketball, where players often overlap and jersey numbers can be difficult to read due to blurriness and occlusion.

Components of the System

The system integrates several advanced open-source models into one pipeline for robust player identification, movement tracking, and shot classification.

Key components include RFDR for object detection, SAM 2 for pixel-level tracking, and an unsupervised clustering pipeline using sigip um and kins for team identification.

Player Tracking and Jersey Recognition

After tracking players, the next step is recognizing jersey numbers by fine-tuning small VLM2 for Optical Character Recognition (OCR).

A custom model is trained to detect 33 characteristic points on the court to accurately map player positions from video frames onto a standardized court layout.

Data Preparation and Model Training

The build process took approximately 1,000 hours. Object detection is performed using RFDRS which balances speed with accuracy effectively.

The dataset consists of 10 classes including player actions like dribbling or shooting as well as ball-related classes that help determine play outcomes.

Fine-Tuning Process

Images are sourced from 1080p video clips resized to fit model input requirements without aggressive augmentations typical in YOLO training.

Fine-tuning occurs directly on Roboflow platform; all models used are pre-trained and available for public use with links provided in the description.

Inference Setup

After training completion, models are deployed via Google Collab which simplifies hardware optimization during inference.

To run inference on video frames, users import necessary packages from Roboflow Universe while managing API keys securely within Collab environment.

Visualizing Results

Frame extraction utilizes supervision's frame generator; predictions are made by calling the infer method with specified confidence thresholds.

Visualization involves creating annotators that apply color-coded boxes around detected objects based on class IDs; filtering options allow focusing on specific detections like jersey numbers or player actions.

Tracking Objects in Sports Videos with SAM 2

Overview of Object Tracking Techniques

The process begins by using NumPy's in function to filter detections based on specific class IDs, which will be used as input for the tracking step with the Segment Anything Model (SAM).

Traditional trackers like SORT utilize bounding boxes across frames and employ a common filter to predict object motion, matching new detections through Intersection over Union (IoU) overlap.

Advanced trackers such as Deep SORT or ByteTrack incorporate visual embeddings from object crops to enhance tracking accuracy during challenging conditions like occlusion or rapid movement.

Advantages of SAM 2 in Sports Tracking

SAM 2 excels in sports environments by capturing pixel-level details of objects, including shape, color, and texture, stored as high-dimensional visual embeddings. This allows it to track players even amidst fast movements or overlaps.

An initial prompt (point or bounding box) is required for SAM 2 to start tracking; this can be automated using detections from the RFDTR model without manual intervention.

Implementation Details

SAM 2 offers four sizes ranging from a lightweight version with approximately 39 million parameters to a large version with over 220 million parameters. The large version is preferred for better tracking quality despite slower performance.

To load the model, import the build_SAM_2_camera_predictor function and provide paths for the checkpoint and configuration file. The inference code follows a straightforward three-step structure.

Inference Process Breakdown

The first step involves running inference on an initial video frame using RFDTR to detect players and assign tracker IDs for linking with SAM 2.

Each detection prompts initialization via the predictor's add_new_prompt method, where frame index, tracker ID, and bounding box coordinates are passed.

Handling Segmentation Errors

During high-action basketball videos, segmentation errors may occur where masks consist of disconnected regions. These artifacts can negatively impact subsequent pipeline stages like team clustering.

A cleanup function is applied post-segmentation to retain only significant mask segments while removing smaller ones that are too far apart. This enhances stability and reliability for downstream tasks.

Unsupervised Learning Approach for Team Separation

Given variability in teams' appearances across games (e.g., different uniforms), training a supervised model requires extensive data labeling which may not generalize well.

An unsupervised learning approach is adopted: extracting one frame per second from sample videos and detecting players using RFDTR before cropping their images for further processing.

Dimensionality Reduction and Clustering

High-dimensional embeddings obtained from player crops are reduced using UMAP to facilitate easier visualization while preserving relationships between points crucial for clustering consistency.

K-means clustering is then applied on these three-dimensional embeddings as an effective method for grouping similar data points together.

Team Classification in Sports Analytics

Overview of Team Classification Algorithm

The algorithm partitions data into two groups, anticipating two teams per game based on player images. If successful, cropped player images will separate into distinct clusters corresponding to each team based on uniform color and visual features.

Implementation of the Team Classifier

A class named team classifier encapsulates the entire SIG leap um and K means pipeline, facilitating easy reuse within an open-source sports repository. To execute the pipeline, instantiate the team classifier object and call its fit method with a list of player crops.

Testing the Team Classifier

Training is quick, taking only seconds. After training, predictions are made to assign each crop to one of two clusters (zero or one). This allows for identification of players belonging to the same team without knowing their actual names initially.

Mapping Clusters to Real Teams

Clusters are mapped to real team names through external input; for instance, zero can represent Boston Celtics and one can represent New York Knicks. This mapping enables visualization of players in colors that correspond with their uniforms.

Advanced Model Integration

Mentioned is a model called Quen 3VL that can identify NBA teams from groups of players; however, it was excluded for simplicity's sake. The problem remains solvable using open-source models as referenced in a linked notebook about open vocabulary object detection with Quen 3VL.

Tracking Players and Visualizing Results

Player Tracking Methodology

The tracking process utilizes SAM 2 alongside assigning tracker IDs and team IDs to detections. Since tracks from SAM 2 remain stable over time, team clustering is performed once while referencing stored assignments across video frames thereafter.

Visualization Techniques

Box annotator and mask annotator are employed to color-code players according to their assigned teams by splitting masks by team ID within a callback function—this results in clear visualizations indicating which players belong to which teams across all frames.

Jersey Number Recognition Challenges

Complexity in Reading Jersey Numbers

Despite seeming straightforward, reading jersey numbers poses significant challenges due to factors like size, occlusion, lighting variations, motion blur, and fabric deformation during movement—all complicating recognition efforts beyond typical document OCR capabilities.

Utilizing Small VLM2 for OCR Tasks

Small VLM2 is introduced as a compact vision language model suitable for OCR tasks; despite being pre-trained mainly on document text styles, it surprisingly performs well on basketball jersey numbers achieving an initial accuracy rate of 56%. However, this was insufficient for reliable recognition due to inaccuracies such as impossible number outputs like "011" or "3000."

Fine-Tuning Small VLM2

Data Preparation for Fine-Tuning

To enhance performance further, small VLM2 was fine-tuned using a custom dataset consisting of cropped jersey number images auto annotated by itself from an object detection dataset filtered specifically for number classes—resulting in a multimodal OCR dataset containing approximately 3,600 images resized appropriately for training purposes.

Training Process Overview

The training setup involves selecting small VLM from available models followed by confirming choices before initiating training from public checkpoints while monitoring progress via live accuracy and loss charts directly within Google Collab until completion yields an improved test accuracy rate of 86%.

Understanding Intersection Over Smaller Areas in Sports Analytics

Introduction to iOS and IOU

The concept of intersection over a smaller area (iOS) is introduced, which is similar to the intersection over union (IOU) but focuses on normalized regions.

An iOS value of one indicates that a number mask is entirely within a player mask, suggesting that the number belongs to that player.

Processing Player and Number Masks

The process involves converting number boxes into masks using XYXY coordinates and then applying mask IO batch processing with player masks and number masks.

A matrix of players by numbers is generated, containing overlap scores based on the iOS metric.

Challenges in Frame Matching

It’s noted that single frames often do not provide matches for all players; for instance, only five out of ten players might be matched due to visibility issues.

Even when crops appear good, small variations can lead to incorrect values being returned from models like VLM2.

Multi-frame Association Strategy

To improve accuracy, number association relies on multiple frames rather than just one.

A Python dictionary containing player names and numbers for both teams is created to facilitate lookup during tracking.

Mapping Players' Positions Using Homography

Understanding Homography

Homography is explained as a method for mapping points between two flat surfaces—specifically from camera views of the court to a top-down layout.

This transformation utilizes a 3x3 matrix known as the homography matrix, which requires at least four pairs of corresponding points for calculation.

Dynamic Camera Challenges

The challenge arises from constantly moving cameras that change visible reference points on the court.

Keypoint detection models are introduced as solutions for automatically identifying landmarks across frames without manual marking.

Keypoint Detection Model Implementation

Defining Key Points

A total of 33 key points are defined covering essential features such as corners, baselines, center circle, paint area, arcs, and basket locations.

Each point serves as a homography reference with its own index; labeling these points was identified as time-consuming yet crucial.

Data Preparation and Augmentation

The dataset preparation involved annotating 850 images in Roboflow while defining flip indexes for symmetry during horizontal flip augmentation.

Training the Keypoint Detection Model

Fine-tuning Process

The training process begins by selecting YOLO architecture from available options in Roboflow's custom training page.

Recommendations suggest using medium or large model sizes; extra-large may be excessive while anything below medium lacks reliability.

Monitoring Training Progress

Once training starts in the cloud environment, progress can be monitored directly through the browser interface until completion.

Evaluating Model Performance

Inference Package Utilization

After training completion, inference packages are used to evaluate model performance by calling functions with specific model IDs against video samples.

Visualizing Keypoints

Initial visualizations show scattered keypoints due to occlusions or offscreen elements leading to noisy predictions. Confidence scores help filter reliable landmarks from uncertain ones.

Mapping Player Positions on the Basketball Court

Integrating Player Detections with Court Coordinates

The process begins by aligning player detections with visible chord features to map each player's position from video frames to real-world coordinates on the basketball court.

A homography matrix is calculated, which defines the relationship between points in the video frame and actual court points, requiring four pairs of reference points for accuracy.

Reference points are derived from a keypoint detection model and manually defined based on court geometry; maintaining consistent indexing between models is crucial for accurate mapping.

Transforming Video Frame Data

After running inference with the keypoint model, a mask identifies anchors with confidence above a threshold, allowing extraction of aligned arrays for further processing.

The view transformer computes a 3x3 homography matrix that facilitates mapping player positions from image space to court space using detected coordinates.

Addressing Stability Issues in Player Tracking

When projecting player positions frame by frame, instability occurs as players' bounding boxes shift during jumps or overlaps, leading to inaccurate ground position representation.

To clean trajectories affected by these shifts, two-dimensional player coordinates across consecutive frames are analyzed for unnatural jumps.

Cleaning Up Trajectories for Accurate Analysis

Detecting and Correcting Unrealistic Movements

Movement between frames is measured as frame-to-frame speed; median speeds and median absolute deviation help identify unrealistic movements exceeding typical variations.

Outliers are flagged based on significant deviations from median speeds and minimum distance thresholds; short runs of flagged frames are grouped for effective cleanup.

Rebuilding Trajectories

Missing trajectory sections are restored through linear interpolation connecting valid points before and after corrupted segments.

A smoothing process using a sliding window helps create reliable movement data suitable for tactical analysis post-game.

Shot Detection and Classification Techniques

Utilizing Object Detection Models

The same object detection model trained earlier is employed to classify shooting actions such as jump shots and layups within video frames.

Detected shots can be filtered based on class IDs (e.g., jump shot labeled as ID five), isolating relevant actions for further analysis.

Challenges in Shot Position Mapping

Mapping shot positions onto the court follows similar procedures as player positioning but faces challenges when players are midair or occluded, affecting projection accuracy.

Shot Event Tracker in Sports Analysis

Overview of Shot Event Tracking

The shot event tracker utilizes a sequence of frames to identify consistent visual patterns, marking the beginning of a shot attempt when it detects player actions like jump shots or layups.

It monitors a brief time window to check for the appearance of the ball and basket class; if detected, the shot is classified as made; otherwise, it's marked as missed.

Implementation in Google Collab

The process begins by loading a sample video and initializing the tracker with parameters that define shot duration, cooldown periods, and spacing between attempts.

As frames are processed, object detection checks for three classes: jumpshot, layup, and ball/basket. The tracker updates with these detections to log events.

Example Sequence Analysis

In an example clip:

At frame 64, a jump shot start event is logged.

Frame 115 marks this attempt as missed.

A layup start event is detected at frame 120 and recorded as made by frame 135.

Data Visualization and Insights

The structured shot events can be combined with homography mapping to create a comprehensive shot chart. This chart visually represents each attempt's location on the court along with its outcome (made or missed).

Conclusion and Further Exploration

This project showcases how various tools (RFDR, SAM tool, Small VM, SIGLIP) work together for player detection and movement tracking while classifying shots effectively.

All related code and resources are available in the description below for further exploration. Viewers are encouraged to experiment with their footage and provide feedback or questions in the comments section.