PeopleDetector on RTSP Stream
This project uses the SiMa PePPi API to run a real-time people detection pipeline on a live RTSP video stream. It leverages a CenterNet-based model optimized for detecting human figures and streams the annotated results via UDP.
Purpose
This pipeline showcases how to:
Ingest live RTSP video using SiMa’s PePPi API.
Run people detection using a CenterNet-based model on SiMa’s MLSoC.
Annotate frames with bounding boxes and class labels.
Stream the output to a specified host and port via UDP.
All inference is hardware-accelerated through SiMa’s MLSoC for efficient edge deployment.
Configuration Overview
Settings are defined in project.yaml
. The following tables outline the input/output configuration and model-specific parameters.
Input/Output Configuration
Parameter |
Description |
Example |
---|---|---|
|
Type of input source |
|
|
RTSP video stream URL |
|
|
Host IP for UDP output |
|
|
Port number for UDP stream |
|
|
Inference pipeline to be used |
|
Model Configuration (Models[0]
)
Parameter |
Description |
Value |
---|---|---|
|
Model identifier |
|
|
Compressed model archive path |
|
|
Path to class label file |
|
|
Enable input normalization |
|
|
Per-channel mean for input normalization |
|
|
Per-channel stddev for input normalization |
|
|
Padding strategy for input preprocessing |
|
|
Whether to maintain original image aspect ratio |
|
|
Maximum number of detections returned per frame |
|
|
Minimum confidence to qualify as a valid detection |
|
|
Postprocessing decode method used |
|
Main Python Script
The script performs the following operations:
Loads configuration from
project.yaml
.Initializes a
VideoReader
for the RTSP stream and aVideoWriter
for UDP output.Loads the detection model with a SiMa
MLSoCSession
.Continuously:
Reads a frame
Runs inference
Annotates detected people
Streams the annotated frame over UDP
The application is packaged using mpk create
and deployed to the target device using SiMa’s standard flow.
Model Details
Download from here.
Model Type: CenterNet-based
Target: People detection
Normalization:
Mean:
[0.408, 0.447, 0.470]
Stddev:
[0.289, 0.274, 0.278]
Detection Confidence Threshold: 0.7
Output: Top 10 people detections per frame