MICO is a part-funded European Union Framework Programme 7 research and development project. The project brings together a group of top notch organisations from both research and industry to build a multipurpose cross-media analysis platform.

The Story so far…

The Internet is going through a dramatic change in recent years. Whereas the Web of 2003 still mostly consisted of information represented in textual documents (mostly HTML), the Web of 2013 consists largely of multimedia documents represented in form of videos, photos, and audio objects. According to a recent study by Cisco, “by 2015, one million minutes of video – the equivalent of 674 days – will cross the Internet every second.” Also, “information units” are increasingly more complex and often consist of text, photos and video at the same time (e.g. a blog post about a concert with a collection of photos and a video). The reasons are mostly the rise of digital photography, the everytime-everywhere Internet where people quickly and easily upload multimedia documents with their smart phones, and the rise of social media platforms, where even non-technical people can easily share different forms of content. Research shows that more than 71% of online adults use video-sharing sites to satisfy their information needs on a regular basis and 46% of Internet users post original photos and videos they have created themselves.

While this change allows fantastic new ways for accessing information from all around the world, at any time, in any place, the dramatic increase in multimedia content makes it more and more difficult to find the relevant information. The reason is that information extraction from multimedia documents is considerably harder than from textual documents.

To illustrate the problem (and the objectives of the project), we use the following example scenario throughout the proposal:

Sarah and Peter are real Internet citizens. Not only do they use the Web as their most important source of information, they also are active participants on different social media sites. Their true love is their dog Balou, who occupies an important part in their life. When Balou was a small puppy, they started a blog to tell their family and friends the funny stories of his life. Frequently, they take videos and pictures of Balou and upload them to YouTube and Flickr. Every month they write a report in their blog, embedding the best videos and pictures for their friends to see. Over the last year, the blog has formed a real community: many other dog lovers are now linking their own videos and pictures with Sarah’s and Peter’s blog. Therefore, Peter had the idea to create a collage or remix of the best dog stories, videos and pictures as a ‘Christmas Blog Post’ at the end of the year. Unfortunately, he has a hard time finding the most interesting content, since it is hidden in different media formats, distributed over many web sites, and contained in a large collection of media objects.

While this is a simplified and artificial “toy” example (the actual MICO use cases deal with crowdsourcing and video streaming), it is still representative for a whole class of problems as can be found in today’s multimedia internet.

With textual content, searching and discovering information was still relatively easy, because the mode of user interaction is also textual, so we could apply comparably simple algorithms like full-text search. Also, many techniques from natural language processing (like NER – named entity recognition) are well established and have reliable quality. On the other hand, multimedia analysis techniques are by far not as elaborate as text analysis techniques. Even well-known approaches like face recognition (as opposed to mere face detection) or sentiment analysis often do not perform satisfactorily enough for efficient information discovery. Also, fragmentation of media content is more complex, because structural separators are not as uniform as they are in textual content: whereas textual content has natural structural elements like word or sentence boundaries, even merely identifying the structural elements in images (regional) and video (temporal and regional) can be very challenging.

A natural approach to address this problem would be to combine the results of different analysis and retrieval techniques to get a better overall result quality. Unfortunately, there are currently a number of barriers that need to be overcome to follow this approach:

  • isolated media types: different media types are usually considered isolated (Google “Web”, “Images”, “Videos”, etc.), even though in the Web they are usually embedded in a context and together form an “information unit”, so a lot of useful information is not used in the analysis (e.g. a moving picture showing a dog, and an audio track with a dog barking)
  • isolated analysis techniques: different analysis & retrieval techniques are usually considered isolated and result rankings only relevant in the scope of the technique, so there is no common model to combine results in meaningful ways
  • exclusive usage of automatic analysis or user annotation: current approaches typically rely either on automatic analysis, or on user input, but rarely both, leaving much potential for accuracy and efficiency improvements unexploited
  • heterogeneous metadata formats: metadata is not represented in uniform and exchangeable formats, so there is no straightforward way to interact and exchange information between different analysis techniques. This makes it impossible to build complex applications based on different analysis techniques (e.g. recommendation)
  • no adequate query language: there is no way to query across related media objects, so it is impossible to execute complex analyses that combine text, multimedia, and metadata in a single information unit. Neither query languages for media resources like MPQF, nor text based query languages like Apache Lucene, nor structured query languages like SPARQL can be used. A uniform integration language that allows complex cross-media queries would bring together the most important media types including image, audio, video, and text

The MICO project will solve these problems on several levels: on the level of cross-media extraction, it will provide a common model for combining different analysis techniques, thereby considering users as one type of extractor; on the level of cross-media metadata publishing, it will provide a uniform way for representing and exchanging (extracted) metadata about multimedia documents; on the level of cross-media querying, it will develop a query language that is capable of querying different kinds of (related) multimedia objects in a similar way; and on the level of cross-media recommendations, it will investigate common patterns to recommend related, possibly complex, information units, independent of the media format used.

In the context of this project, when we talk about “cross-media analysis” we mean the whole analysis value chain, ranging from extracting relevant features from media objects (cross-media extraction) over representing and publishing extraction results (cross-media metadata publishing) to querying and combining these results (cross-media querying) and cross-media recommendations as user-facing results of analysis.