Zapr has been granted its first patent for ‘Method and System for Context Aware Video Compression’! The implications of this disruptive patent are vast. Kudos to the inventors for putting another feather in Zapr’s cap. Let's learn more about this very exciting patent from Jasmeen Patel, one of the inventors!
First of all, congratulations, our first patent, this is really exciting! But for the readers, can you break down what this patent is, in layman's terms?
Jasmeen: The patent which we are talking about is related to compressing videos and images in a more natural way. The algorithm analyses the scene and different segments of the videos and identifies the objects/areas of interest or the areas where a human would pay attention or may be the relevant object for the scene. Once it identifies these areas, this information can be used to compress the video/images such that the important areas can be kept in good quality while not so important ones can be coded in a bit inferior quality resulting in overall video compression while keeping the user experience unaltered.
In hindsight, this win might seem obvious, but could you go back and describe in detail the whole process, the work that went into it - from conception to execution?
Jasmeen: So, we were working on one of our projects which required us to find interesting areas in an image and later that could be used for further processing. The idea for video compression stemmed from that initial work; we thought of extending the idea a bit more and using the information that we get from a scene in optimally compressing the video. We designed a neural network which can understand the scene and can identify important areas in an image / video. We can understand the network as an aggregator of information from three different modules.
- Saliency module: This module identifies regions which are more likely to be watched by a human. This module works independently on each image and doesn't have information about the scene environment. To give you an example, in footage of cricket, match players, umpires etc. are important and thus should be shown in full detail, however, saliency only predicts areas where most people would be watching i.e. players batting on the pitch in most of the cases.
- Object Segmentation: We have some predefined classes of objects which we feel are most frequently used, such as, person, table, chair etc. The number of classes this module can predict can be customized as per need. This module, as the name suggests, finds the regions and segments out of the predefined objects present in the scene. Here also, it does not analyse the full scene environment and makes the predictions only on the basis of one image. To compare with the same cricket example, it can predict all the players, umpires etc. and segment out those but it will also segment out people in the background (audience, staff etc.) which is not required in the scene.
- Context aware module: This module takes care of some of the past as well as future frames and tries to gather information about the scene in such a way that we can use this information to decide the nature of the scene. For example, bat, ball etc. are important in a match scenario but might not be so important if the same objects are present in the kitchen or in a TV show. It also captures information if someone is interacting with the object or not and hence deciding the importance of the region.
We finally combine the information from all these three modules and make our final prediction on which regions are important and which are not so important. The frames are then re-encoded in such a way that more numbers of bits are assigned to the important regions and vice versa.
Are there similar inventions that your team drew inspiration from? What makes this patent stand out from the rest of them, what makes it unique?
Jasmeen: As I have mentioned earlier, we drew our inspiration from one of our projects which required us to find regions of interest and that is when we thought of extending the idea to image/video compression. Most the codecs which are available use temporal similarity among frames and spatial similarity in the a frame to compress the video. Although these algorithms are very efficient at compressing the video, they do not consider the content of the video or the frame before compressing. This shows that for such an algorithm, all the pixels have the same importance and that is where we felt that we can contribute our bit into this tech. We designed a network which can find regions which are more important and hence the same compression algorithms now can use that information and can compress the video more efficiently.
Finally, this is no small feat, so what does it mean for the media industry?
Jasmeen: Compression techniques have a lot to offer to the industry. Some of the applications where it can be used are:
-
Streaming video content
-
Video calls
-
Storing video data
These are some of the applications but the invention is not limited to this, there are many more applications which can be derived from the same technology which we might explore in the future.