CSCE 436: Thoughts on readings...: Sikuli: Using GUI Screenshots for Search and Automation

Paper written by: Tom Yeh, Tsung-Hsiang Chang, and Robert C. Miller

Comment: Jesus, Zachary

Summary:

Sikuli us a "visual approach to search and automation of graphical user interfaces using screenshots" (183). Basically this application is a way to query with screenshots in order to get better results/answers to your questions. It starts off by relating situations such as pointing to an object asking what it does, to situations on the internet where you are looking at something and are confused about either what it does, why it's done something, or how you can navigate the application etc. Sikuli allows a user to draw a box around something and actually query in a search engine using that picture. There are three components that make up Sikuli. These consist of the screenshot search engine, the actual interface for using the search engine to query, and another interface to add screenshots with custom annotations attached to them to the index.

The screenshot search engine has indexed screenshots taken from various online tutorials, books, and official documents. When a user enters a picture into the query the search engine first looks at the text that surrounds the image. Then it uses any visual features that it can detect. Whatever screenshots that can be represented as "visual words" are using indexed using an inverted index that has multiple entries for a particular word. Thirdly, the engine can index screenshots according to embedded text. To improve results the engine will use 3-grams and each word is treated as a visual word. In order to get a better idea about how all of this will work, they created a protoype that had a database collection of about 102 documents from various sources that would help with giving explanations on different applications. After creating this they did a user study to test two hypothesis, "(1) screenshot queries are faster to specify than keywords queries, and (2) results of screenshot and keyword search have roughly the same relevance as judged by users" (185). Each participant was given a random dialog box and were asked to do different queries depending on the dialog box they received. After querying each user was asked to identify the top 5 results as relevant, or irrelevant. After all of these tasks were completed they were then asked to fill out and answer some questions that were written for them.

They used this study to test their application for keyword queries against the screenshot queries and found that the average time was less than half as long for screenshots as it was for keywords. The number of relevant results on the other hand were very close when compared and wasn't a significant difference (which can be good). They noticed that throughout this study some users learned how to do screenshot queries very quickly. In order to evaluate their application they used precision and recall and examined the top 10 matches for both the screenshot and keyword queries. Now that they had an understanding of how Sikuli would work, they developed an editor in order for users to write visual scripts in order to do certain tasks on their computer. One task was minimizing all active windows that someone would have open on their computer. Another one was deleting documents of multiple types. It could search for items that share the same icon and then delete them all if wanted. There was also a tracking bus movement and navigating a map application that would search for images on the map find a similar pattern. One of the more interesting, and possibly more convenient use would be to respond to message boxes automatically. Vista gives you pop ups every time asking you to choose if you really want to do something, and this script would allow you to make automatic responses to all of them that could pop up. The last one was image recognition to see whether a baby has rolled over or not. I wasn't sure if they actually marked the baby's forehead with a marker, or if it was digitally enhanced, but the script would monitor a baby/the image of the baby in order to see if it could detect the special marker placed on the baby's head. If it could not it would alert the user that it needs to check the baby to see if had rolled over or not. Below are pictures that were provided with each example.

There were a few problems that were discussed with application. One problem is the different themes that can be used on an operating system and the different backgrounds that users may have on their computer. When using a screenshot query there would need to be something initialized that would ignore the unnecessary portions of the picture so that it can better optimize a search. The second problem they mentioned was the visible screen. Sikuli can only see the screen that is visible to a person and not any of the windows that are open behind other windows. They thought that this could be worked on by creating a platform or application specific technique in order to overcome this. They didn't really go into any details on how to fix these problems yet, so I'm assuming this will be some of their future work.

Discussion:

I thought this paper was very interesting because it could allow searching to become very simple and possibly much more accurate than having to come up with queries on your own. I think that this work could be expanded in it's possible uses. Such as creating different scripts in order to do different tasks on a computer that someone does every day. Such as a script to open a browser and immediately open tabs of emails that a person checks everyday or a website that they always go to, like facebook, so that it is more convenient to have all of the tabs open at once. Thats just a small idea, I'm sure there are much better uses for this. It would be very interesting to see if we can make searching faster and more accurate by using screenshots instead of keyword queries. I don't think there are really faults in this, but one way to possibly better optimize the indexing portion of the search would be to use k-nearest neighbors or possibly a larger number of k-grams.

1 comment:

Patrick WebsterJanuary 26, 2010 at 1:56 AM
This sounds pretty interesting. Searching for topics about your screenshots would be useful, though I would imagine they will need to gain a lot of support from software companies to make the database as useful as their paper documentation. The desktop automation type of actions could be handy for those who aren't technically inclined. Though I do wonder how much processing is involved while searching your desktop for these images, especially since it looks like it can do partial matches.

CSCE 436: Thoughts on readings...

Sunday, January 24, 2010

Sikuli: Using GUI Screenshots for Search and Automation

1 comment:

Blog Archive

About Me

Followers