I am working on a prototype (PoolWatch) of a system which tracks swimmers in a swimming pool. The system is to track each swimmer and report its speed and traversed distance. The current goal is to realize what it would take to implement such a system.
Current status is that the system tracks only clearly detected humans. The segmentation step is based on a skin color and the output of it is various parts of hands and legs. These blocks are merged to get a shape of a swimmer. The detection occurs for each frame of a video. Two shapes of the swimmer in two consequent frames are associated by a distance criteria - shape in the current frame is mapped to the closest shape in the previous frame. In this way the program composes the tracking path.
The current challenge is when simmer goes far away from the camera and becomes indistinguishable from surrounding water. When a swimmer is far away, the area of visible skin pixels is small.
When a swimmer dives – the visible skin vanishes entirely. The accompanied obstacle is the reflection of pink walls in the water. For some reason the color morphs into skin-like color. This environmental peculiarity impedes skin classification. On the image below there is a skin mask where there is a swimmer in the bottom-left and light reflection - in the top-right corner.
I thought about checking swimming suit to guess the swimmer’s shape, but this particular guy, I am thinking of, has a blue swimming cap (which is not rare) and it is a challenge to distinguish it from water. So the swimmer practically becomes invisible to the vision system. If such period is short, then we can try to track body by a Kalman predictor. But unbounded periods of lost detection are probable, because the described ‘far away’ position is near a pool’s border and people are used to take rest here and sometimes even exit the pool and go out. So it is a good idea to track people far away too.
When people swim in breaststroke style they dive for 3 seconds. This means that human detector, which relies on skin color, gets into trouble. To resolve the issue, we may use Kalman predictor to determine the position of a shape in the next frame, assuming the constant speed movement. If we feed the Kalman predictor with image coordinates in pixels, the tracking may become volatile. Due to projection and camera position, the shape movement in the image may become non linear even though a swimmer moves with a constant speed. So there should be a transformation from camera to world coordinates. How reliably we can determine the world position of the swimmer given the camera position of the shape is something I wonder about. At the moment I hard coded the transformation mapping.
Also the swimmer shape detector is shaky and for one swimmer it may detect legs in one frame and upper body in the next.
It means that Kalman predictor is embarrassed due to given chaotic detections and estimated path lacks a smoothness.
I marked skin patches in chosen images and got a set of skin-like pixels. To classify them I tried various things.
Neural Network was oscillating wildly. For each learning session one may end up with visually different classification networks. I tried different network configurations but didn’t tame the beast. The hyper planes, defined by actual weights of a network, do not constrain the RGB cube carefully (at least when there are small amount of them) - instead of closely wrapping 3D color volume, the network defines non-closed space and therefore includes unexpected colors, which do not resemble skin in any way.
I expected the SVM (Support Vector Machines) with RBF (Radial Basis Functions) to fit this problem well, but in reality I had problems with SVM convergence.
So I ended up taking the convex hull, wrapped around learning points, and used it as a classifier – if pixel is inside the convex hull, it is classified positively.
This component should guess the outline of the swimmer in the image. On the left image below we removed the water from image. Blue cap and upper part of swimming suite are left, but swimming trunks are classified as water and lost. On the right image we keep only skin – entire swimming suite is lost and we get a bunch of flying disconnected legs, arms etc.
Combining these approaches, we may correctly outline the swimmer with clothes, which are required for shape appearance model (pixels histogram) to associate shapes from multiple frames with target swimmer. This seems very prominent, because swimming suit has most distinguishable colors between different swimmers; skin color is more common and not likely to discriminate human bodies (at least for now, I enforce such a constraint).