Since the domination of social networking sites as the primary channel of communicating ideas and sharing media, new social search engines emerged. However, these search engines crawl the social networks and index the available content based only on text. Some of these keyword-based social search engines are: Spy, SamePoint, SocialMention, WhosTalkin, wikio.com.
Spy is a web application that is updated in real time and provides the user with the ability to watch what is being said in a certain topic in specific social networking sites and blogs. Same-Point provides an easy interface for the user to select in which of the social networks to search for a keyword or topic. SocialMention works like Google alerts but for social media. Whos-Talkin is a social media search tool that allows users to search for conversations surrounding the topics that they care about the most. Wikio is a personalisable news page featuring a news search engine that searches media sites, blogs and the contributions of Wikio members.
Content management in large scale
In the scale that most of the prime social networks operate, even the most common operations are not trivial. Th e most powerful example is Facebook that has to handle almost 500 Million active users that share more that 3 billion photos per month and its servers should serve about 1.2 million photos per second.
For such volumes of content management becomes a very crucial issue. Here we refer to some technologies and tools that most of the social networks use in order to survive the torrents of queries.
Memcached is a distributed caching system that caches database queries in order to minimise the relatively slow database access. Memcached started from LiveJournal blogging and social networking site and released as open source. At this time Facebook runs thousands Memcached servers with tens of terabytes of cached data.
Custom compilers or special programming languages
Most of the social networking sites in the race to optimise their source base to the limits use custom compilers for the special needs of their hardware or use new special programming languages that fulfi ll their custom needs.
Some examples are “HipHop for PHP” which converts PHP to C++ in order to be compiled and run natively on the servers for better performance. Another example is Twitter that dropped ruby for its back-end servers and uses Scala, a new programming language that handled well the vast amount of parallel requests.
Haystack presents a generic HTTP-based object store containing needles (objects representations) that map to stored opaque objects. Storing photos as needles in the haystack eliminates the metadata overhead by aggregating hundreds of thousands of images in a single haystack store fi le. Th is keeps the metadata overhead very small and allows the user to store each needle’s location in the store fi le in an in-memory index. Th is allows retrieval of an image’s data in a minimal number of I/O operations, eliminating all unnecessary metadata overhead.
Cassandra is a distributed storage system with no single point of failure. It’s one of the poster children for the NoSQL movement (others are MongoDB, Redis etc.) and has been made open source (it becomes an Apache project). Cassandra is in use at Digg, Facebook, Twitter, Reddit, Rackspace, Cloudkick, Cisco, SimpleGeo, Ooyala, OpenX, and more companies that have large, active data sets. The largest production cluster has over than 100 TB of data in over than 150 machines.
Human powered and community question answering
Human powered systems emerged from the social networks, which provided the ability to the user to contribute with web content. Since artifi cial intelligence and computer vision problems were consistent, the researchers envisioned that the solution to unsolved problems was to har-ness the human intelligence. However, to engage users to answer questions, annotate image or proofread OCR extracted text for free had to have something as a reward. Towards this end the “games with a purpose” (GWAP) appeared. In a GWAP the user answers or solves diffi cult for a computer but easy for a human problems while s/he plays an online game.
In the same track, online community question answering sites provide a place that everyone can contribute by answering questions from other members. Th e answers are validated by a “start-based” system where the end user gives feedback whether the answer was helpful or not. Some of the well known community questions answering systems are yahoo! Answers for general questions, stackoverfl ow for questions on programming, serverFault for server administrators and IT professionals or “Seasoned Advice” for cooking professionals and many others.
A very interesting service is the Aardvark search engine, which fi nds the most relevant person from the user’s contact list and the entire community of the users to answer a question. Aardvark accepts questions in natural language (not just keywords) and uses a novel algorithm in order to map the question to the most relevant recipient.
© European Union