# VisionFit The Visionfit Generators create synthetic data for applications at the intersection of computer **vision** and **fit**ness. ## Parameter Inputs The generator pages in the API User Portal have detailed parameter information for each generator, including parameter names, descriptions, and constraints. Select the VisionFit generator you are using to explore what parameters are available. ## API Outputs Our output data and pixel-perfect labels include everything from semantic segmentation masks and rep counts to 2D and 3D keypoints (with camera matrices!). For each job, a zipped archive can be downloaded that includes the following files: - `job.json`: A JSON file describing the original parameters that generated the job. - `labels.json`: A COCO-formatted json file of video-, frame-, and instance-level annotations and labels. See [below](visionfit.md#annotations). - `segmentation.zip`: A zipped file containing frame-level semantic and instance segmentation masks (with and without occlusion). - `video_preview.png`: A single image file, corresponding to the first frame in the rendered video. - `video.mp4`: The resulting RGB video. ## Annotations The `labels.json` file is provided in standard [COCO](https://cocodataset.org/#format-data) format. The third-party [pycocotools](https://pypi.org/project/pycocotools/) package provides useful utility functions for working with the data structure. ### Scene-level annotations Scene-level annotations are provided for each video. They are accessible via the top-level `info` field of the `labels.json` data structure. Scene-level annotations include: * `camera_pitch`: Pitch rotation of the camera in the global coordinate system, in degrees. A value of 90 indicates the camera's line of sight is parallel to the ground plane. * `camera_yaw`: Yaw rotation of the camera in the global coordinate system, in degrees. A value of 0 indicates the camera's line of sight is aligned with the +Y axis. * `camera_location`: Location of the camera in the global coordinate system, in meters. * `camera_height`: Height of the camera in the global coordinate system, in meters. * `avatar_yaw`: Yaw rotation of the avatar in the global coordinate system, in degrees. * `avatar_presenting_gender`: Gender of the underlying SMPL-X body model. * `avatar_attire_top`/`avatar_attire_bottom`: Clothing type used in the applied UV texture. * `avatar_betas`: 10 shape coefficients for the underlying SMPL-X body model. * `avatar_waist_circumference`: Circumference of the SMPL-X body model's waist, in meters. * `avatar_location`: Location of the avatar in the global coordinate system, in meters. * `avatar_identity`: Integer-based unique idenfier that controls the chosen avatar appearance. * `camera_P_matrix`: P matrix of the synthetic camera. This can be used to project any 3D position in the global coordinate system onto the image plane. Note that `P = K @ RT`. * `camera_K_matrix`: Intrinsic K matrix of the synthetic camera. * `camera_RT_matrix`: Extrinsic RT matrix of the synthetic camera (rotation + translation). * `has_self_penetrations`: Flag indicating the presense of self-penetrations anywhere in the generated data. * `rep_truncation_amounts`: A list of length `num_reps`. Each element is another list specifying the amount of truncation (expressed as a fraction) applied at various intervals throughout the base animation. Empty lists indicate that no truncation was applied for a given rep. ### Frame-level annotations Frame-level annotations are provided for each frame of a video. They are accessible via the top-level `images` field of the `labels.json` data structure. Frame-level annotations include: - `frame_number`: The frame number of the corresponding image. - `rep_count_from_start`: The number of repetitions completed since the beginning of the video PLUS a float in the range of [0,1] that indicates the current frame's relative position in the exercise sequence. For example, a value of 4.23 indicates that 4 full repetitions have been completed since the beginning of the video, and that the current frame corresponds to 23% completion of the next one. - `rep_count_from_intermediate`: This value is conceptually similar to `rep_count_from_start`, but is instead indexed to the midpoint of the rep. We provide both since users may wish to define (for example) the point of most flexion OR the point of most extension as the rep inflection point. - `has_self_penetrations`: Flag indicating the presense of self-penetrations. - `metrics`: Collection of image processing-relevant metrics. - `mean_color_contrast`: Computation of color contrast between the avatar and background. - `mean_brightness_contrast`: Computation of brightness contrast between the avatar and background. - `mean_image_brightness`: Computation of the mean brightness of the image. - `mean_avatar_brightness`: Computation of the mean brightness of the avatar. - `mean_avatar_halo_brightness`: Computation of the mean brightness of the pixels immediately surrounding the avatar. - `snr`: Mean signtal-to-noise ratio for the image. ### Instance-level annotations Instance-level annotations are provided for every unique object segmented in an image. This includes the avatar, as well as any other objects of interest that are present, such as dumbbells. Instance-level annotations are accessible via the top-level `annotations` field of the `labels.json` data structure. * `color`: Normalized RGB value in the corresponding instance segmentation masks * `percent_in_fov`: Percentage of the vertices from the underlying mesh that are within the camera's field-of-view, regardless of occlusion status. This value can be used to disambiguate whether sparse instance segmentation masks reflect a high degree of environmental occlusion versus the instance being out-of-frame. * `percent_occlusion`: Percentage of the instance that is not visibile due to environmental occlusion (i.e. objects in the foreground). It is quantified as the relative difference between the occluded and unoccluded instance segmentation masks, which are also provided. * `bbox`: Bounding box in standard COCO format * `segmentation`: Polygon segmentation in standard COCO format * `area`: Area enclosed by polygon segmentation * `cuboid_coordinates`: Image coordinates of the surroinding 3D cuboid, with axes that are parallel to the global coordinate system. The order of the cuboid points is shown below. ``` 3-------2 /| /| / | / | 0-------1 | | 7----|--6 | / | / 4-------5 ``` We also provide the following annotations for each `person` instance: * `armature_keypoints`: A data structure including image coordinates (x,y), visibility (v), depth from camera (z, in meters), and 3D position in the global coordinate system (x_global, y_global, z_global; in meters) for each degree-of-freedom in the underlying SMPL-X model. Visibility values indicate whether keypoints are not in the image frame (0), in the image frame but occluded (1), or visible (2). * `vertex_keypoints`: A data structure including image coordinates (x,y), visibility (v), depth from camera (z, in meters), and 3D position in the global coordinate system (x_global, y_global, z_global; in meters) for various anatomical points of interest on the SMPL-X body mesh. Points of interest include the ears and nose. Visibility labels are defined as inĀ `armature_keypoints`. * `keypoints`: Image coordinates and visibility in standard COCO format for each keypoint in the 17-point COCO skeleton. Visibility labels are defined as in `armature_keypoints`. Note that the hip keypoints in this data structure correspond to different locations than those in `armature_keypoints`. Specifically, they correspond to a more lateral location designed to better reflect where human annotators typically place the hips (e.g. in the COCO dataset). * `num_keypoints`: Number of keypoints in the COCO skeleton with non-zero visibility. * `quaternions`: 3D rotations for each degree-of-freedom in the SMPL-X model, relative to its parent in the kinematic tree, in wxyz order. ### Segmentation annotations For each frame of a video, the following segmentation masks are provided in `segmentation.zip`: * `image.{frame_number}.cseg.png`: Semantic segmentation. * `image.{frame_number}.cseg.CLOTHING_{clothing_piece_id}.png`: Semantic segmentation for each clothing piece. * `image.{frame_number}.iseg.png`: Instance segmentation. * `image.{frame_number}.iseg.{annotation_id}.png`: Instance segmentation without occlusion.