Observatorio de Tecnologías de la Lengua, de Voz y Multimedia

http://thenoisychannel.com/2010/08/15/exploring-nuggetize/
13/09/2010 - 10:50

I’ve been exchanging emails with Dhiti co-founder Bharath Mohan about Nuggetize, an intriguing interface that surfaces “nuggets” from a site to reduce the user’s cost of exploring a document collection. Specifically Nuggetize targets research scenarios where users are likely to assemble a substantial reading list before diving into it. You can try Nuggetize on the general web or on a particular site that has been “nuggetized”, e.g., a blog like this one or Chris Dixon’s.

I’m always happy to see people building systems that explicitly support exploratory search (and am looking forward to seeing the HCIR Challenge entries in a week!). Regular readers may recall my coverage of Cuil, Kosmix, and Duck Duck Go. And of course I helped build a few of my own at Endeca. So what’s special about Nuggetize?

Mohan describes it as a faceted search interface for the web. I’ll quibble here–the interface offers grouped refinement options, but the groups don’t really strike me as facets. Moreover, the interface isn’t really designed to explore intersections of the refinement options–rather, at any given time, you see the intersection of the initial search and a currently selected refinement. But it is certainly an interface that supports query refinement and exploration.

The more interesting features are the nuggets and the support for relevance feedback.

The nuggets are full sentences, and thus feel quite different from conventional search-engine snippets. Conventional snippets serve primarily to provide information scent, helping users quickly determine the utility of a search result without the cost of clicking through to it and reading it. In contrast the nuggets are document fragments that are sufficiently self-contained to communicate a coherent thought. The experience suggests passage retrieval rather than document retrieval.

The relevance feedback is explicit: users can thumbs-up or thumbs-down results. After supplying feedback, users can refresh their results (which re-ranks them) and are also presented with suggested categories to use for feedback (both positive and negative). Unfortunately, the research on relevance feedback tells us that, helpful as it could be to improving user experience, users don’t bite. But perhaps users in research scenarios will give it a chance–especially with the added expressiveness and transparency of combining document and category feedback.

Overall it is a slick interface, and it’s nice seeing the various ideas Mohan and his colleagues put together. There’s certainly room for improvement–particularly in the quality of the categories, which sometimes feel like victims of polysemy. Open-domain information extraction is hard! Some would even call it a grand challenge.

Mohan reads this blog (he reached out to me a few months ago via a comment), and I’m sure he’d be happy to answer questions here.

http://thenoisychannel.com/2010/08/27/hcir-2010-bigger-and-better-than-ever/
13/09/2010 - 10:50

Last Sunday was HCIR 2010, the Fourth Annual Workshop on Human-Computer Interaction and Information Retrieval, held at Rutgers University in New Brunswick, collocated with the Information Interaction in Context Symposium (IIiX 2010).

With 70 registered attendees, it was the biggest HCIR workshop we have held. Rutgers was a gracious host, providing space not only for the all-day workshop but also for a welcome reception the night before.

And, based on an informal survey of participants, I can say with some semblance of objectivity that this was the best HCIR workshop to date.

The opening “poster boaster” session was particularly energetic. There was no award for best boaster, but Cathal Hoare won an ovation by delivering his boaster as a poem:

If a picture is worth a thousand words
Surely to query formulation a photo affords
The ability to ask ‘what is that’ in ways that are many
But for years we have asked how can-we
Narrow the search space so that in reasonable time
We can use images to answer questions that are yours and mine
In my humble poster I will describe
How recent technology and users prescribe
A solution that allows me to point and click
And get answers so that I don’t feel so thick
About my location and my environment
And to my touristic explorations bring some enjoyment
Now if after all that you feel rather dazed
Please come by my poster and see if you are amazed….

As in past years, we enlisted a rock-star keynote speaker–this time, Google UX researcher Dan Russell. His slides hardly do justice to his talk–especially without the audio and video–but I’ve embedded them here so that you can get a flavor for his presentation on how we need to do more to improve the searcher.

We accepted six papers for the presentation sessions–sadly, one of the presenters could not make it because of visa issues. The five presentations covered a variety of topics relating to tools, models, and evaluation for HCIR. The most intriguing of these (to me, at least) was a presentation by Max Wilson about “casual-leisure searching”–which he argues breaks our current models of exploratory search. Check out the slides below, as well as Erica Naone’s article in Technology Review on “Searching for Fun“.

As always, the poster session was the most interactive. Part of the energy came from HCIR Challenge participants showing off their systems in advance of the final session that would decide which of them would win. In any case, I felt like a heel having to walk through the hall of poster three times in order to herd people back to their seats.
Which brings us to the Challenge. When I first suggested the idea of a competition or challenge to my co-organizers back in February, I wasn’t sure we could pull it off. Indeed, even after we managed to obtain the use of the New York Times Annotated Corpus (thank you, LDC!) and a volunteer to set up a baseline system in Solr (thank you, Tommy!), I still worried that we’d have a party and no one would come. So I was delighted to see six very credible entries competing for the “people’s choice” award.

All of the participants offered interesting ideas: custom facets, visualization of the associations between relevant terms, multi-document summarization to catch up on a topic, and combining topic modeling with sentiment analysis to analyzing competing perspectives on a controversial issue. The winning entry, presented by Michael Matthews of Yahoo! Labs Bareclona, was the Time Explorer. As its name suggests, it allows users see the evolution of a topic over time. A cool feature is that it parses absolute and relative dates from article test–in some cases references to past or future times outside the publication span of the collection. Moreover, the temporal visualization of topics allows users to discover unexpected relationships between entities at particular points in time, e.g., between Slobodan Milosevic and Saddam Hussein. You can read more about it in Tom  Simonite’s Technology Review article, “A Search Service that Can Peer into the Future“.

In short, HCIR 2010 will be a tough act to follow. But we’re already working on it. Watch this space…

http://thenoisychannel.com/2010/09/09/david-petrou-presents-google-goggles-at-ny-tech-meetup/
13/09/2010 - 10:50

Image recognition is one of those problems that has presented long-standing challenges to computer scientists, despite being taken for granted by science fiction writers. Google Goggles represents one of the most audacious efforts to implement image recognition on on a massive scale.

Tonight, I had the pleasure of watching my colleague, David Petrou, present a live demo of Goggles to about 800 people who filled the NYU Skirball Center to attend the NY Tech Meetup. Many thanks to Nate Westheimer and Brandon Diamond for giving Google the opportunity to present this cool technology to a very engaged audience and in particular to show off some of the technology that Googlers are building here in New York City.

You can’t see the live demo in the slides, so I encourage you to view a recording of the presentation here.

Also, if you’re in the New York area and interested in hearing about upcoming Google NYC events, please sign up at http://bit.ly/googlenycevents.

http://thenoisychannel.com/2010/09/11/new-web-site-for-hcir-workshop/
13/09/2010 - 10:50

In 2007, I persuaded MIT graduate students Michael Bernstein and Robin Stewart (who was interning at Endeca that summer) to help organize the first Workshop on Human-Computer Information and Information Retrieval (HCIR 2007), which we held at MIT and Endeca. Its success convinced us to keep going, and we enjoyed record attendance at this year’s HCIR 2010, held at Rutgers University.

As the workshop has grown, we as organizers have realized that we need to invest a little in its online presence. A first step in that direction is a new site for the workshop: http://hcir.info/. The site contains all of the proceedings from the four annual workshops in one place. It is powered by Google Sites, which will make it easy for a bunch of us (and perhaps some of you) to collaboratively maintain it.

I hope everyone here finds the new site useful. Please feel free to come forward with ideas for improving it! But be warned–if you have a great idea, I might ask you to implement it yourself.

http://khassanali-nlp-research.blogspot.com/2009/08/writing-for-computer-science-by-justin.html
13/09/2010 - 10:50

I have absolutely fallen in love with the book Writing for Computer Science by Justin Zobel. I have been going about my research all this years learning through experience things that are specified in the book. Had i read the book, it would have been a much easier task for me.

This book is good for newbies to research as well as experienced researchers. It has many checklists that i am using for my research and find it easier and although i am yet to read the entire book, it provides on how to write a paper for computer science. According to me, its a must have for anyone who needs to do some writing for computer science and conduct research.

http://khassanali-nlp-research.blogspot.com/2008/01/nltk.html
13/09/2010 - 10:15

I used this toolkit for my NLP project and although there were many features that did not work as i expected it to i found it really useful. The toolkit is written in python and python is a very easy and user-friendly language to learn.

Although, i knew a bit of python and used it extensively in the first semester for all the NLP assignments, i realised the actual utility and convenience of python w.r.t NLP tasks when i read the guide provided with the NLTK toolkit.

Although i am yet to use all the features provided in the NLTK , i have used the stemmers and different types of probability distributions.The learning curve for me was around a week including learning python part. Initially, i wondered if it really was worth all the effort as i could easily have implemented the algorithms in python or any other language.

The plus point was once i learnt how to use the toolkit, making enhancements took no longer than 5 minutes and in the end i could get quite a lot done.

The clean_html API of NLTK did not work. I either found the output contained the HTML tags or the text had disappeared! Further, since it uses the underlying HTML parser, its not resilient to malformed pages on the internet.

I found it easier to write my own code for implementing the Naive Bayes method. The NLTK provides many methods too. I would say its definitely been worth trying out the natural language toolkit and recommend it!

You can download NLTK at the following site:
http://nltk.sourceforge.net/

http://khassanali-nlp-research.blogspot.com/2008/01/extracting-text-from-html.html
13/09/2010 - 10:15

This has been a task that i have been at for so many months trying to find the perfect solution to extract text from an HTML webpage. I have tried so many options of which for Windows Emsa HTMLRem is definitely good. However, since most of my work is in Linux i was not too thrilled with the idea of extracting data on Windows and thereafter ftping it to Linux.

Yesterday was therefore spent trying to look at many options. The NLTK toolkits clean_html API works for a few websites and also used HTML Tidy before using the clean_html API. This approach worked for some websites and did not for other websites.

I now have to try some other technique probably regular expressions... As they say the data collection and cleaning part is the most difficult part for any task.

http://khassanali-nlp-research.blogspot.com/2008/02/named-entity-recognition-tools.html
13/09/2010 - 10:15

The past few weeks i have been experimenting with Named Entity Recognition tools. In particular, i tried out the opennlp tool suite and the name recognizer was pretty dismal. It really didnt recognise everything well and i wasnt sure if i should use the same in my research purposes. I guess i will either have to develop my own tools or use something else.

A shame since i did spend a little effort on trying to figure out how to use these tools and thereafter only to see that they dont perform as well as i expected it to. Perhaps i expected a lot for i do know that named entity recognition is not easy and of course there will always be an ambiguity in recognizing names.

Lets see how it works out. My adviser has asked me to read a few papers on the named entity recognition and i need to see if this will lead me to an overall idea on named entity recognition and also on how easy or difficult it is to get the kind of results that i am expecting.

http://khassanali-nlp-research.blogspot.com/2009/08/opinion-mining-and-sentiment-analysis.html
13/09/2010 - 10:15

For those of you interested in opinion mining and sentiment analysis, the book Opinion Mining and Sentiment Analysis by Bo Pang and Lillian Lee seems to be interesting and explains the concepts very simply.

There's an author formatted version available at the following link:

http://www.cs.cornell.edu/home/llee/omsa/omsa.pdf

Google Books allows for an access to parts of the book. Worth reading.

http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=4161&copyownerid=2
13/09/2010 - 10:15

Machine Translation Summit XII [Chateau Laurier, Ottawa, Ontario, Canada] [Aug 26, 2009 - Aug 30, 2009]

Distribuir contenido