Observatorio de Tecnologías de la Lengua, de Voz y Multimedia

http://prospero.bluescarf.net/stuart/2009/07/cyberling_2009.html
13/09/2010 - 10:50

Last week I attended Cyberling 2009, "a workshop exploring how computational methods can enhance traditional linguistic inquiry". The workshop was organized by panels on different topics, ranging from tools to funding models. I co-chaired (along with Mary Beckman) the panel on "Annotation Standards". Overall, it was fun, if for no other reason that I had the opportunity to meet a few people who I knew by reputation by not by personal acquaintance (e.g., Mark Lieberman). But I fear that it might have been a case of preaching to the converted. I know from personal experience that a lot will have to change within the culture of academic linguistics before we can expect computational tools to be fully integrated into working practice as a matter of course. (Among linguists who do fieldwork, for example, a misguided neo-luddite machismo is depressingly prevalent. ) But at least there are linguists pushing the field in that direction.

http://prospero.bluescarf.net/stuart/2010/01/nlp_and_the_semantic_web_1.html
13/09/2010 - 10:50

Last night Powerset hosted the SF Semantic Web Meetup, organized by Marco Neumann, where we gave a talk about NLP and the Semantic Web. I presented some work in progress on instant answers (slides) and my colleague Scott Waterman presented ongoing work by our group (Text Processing for Semantic Applications) on triples extraction using our natural language pipeline. In addition, Bill Flitter from Dlvr.it talked about the publishing industry in the emerging world of the realtime web.

We enjoyed giving our presentations and appreciated the feedback that we received. One of the questions that came up during the question time after my talk was one that I frequently encounter, which is why finite state transducers aren't just regular expressions. What I normally say, but perhaps failed to say clearly enough on this occasion, is that finite state transducers are regular expression, but they do a lot more. As my former colleague Brendan O'Connor used to say, "Finite state transducers are regular expressions on steroids!" I think that's right, and at some point I should write up a more detailed and technical explanation of what that means.

http://feedproxy.google.com/~r/StreamHacker/~3/HcVf5lH-Kwc/
13/09/2010 - 10:50

If you liked the NLTK demos, then you'll love the text processing APIs. They provide all the functionality of the demos, plus a little bit more, and return results in JSON. Requests can contain up to 10,000 characters, instead of the 1,000 character limit on the demos, and you can do up to 100 calls per day. These limits may change in the future depending on usage & demand. If you'd like to do more, please fill out this survey to let me know what your needs are.

http://feedproxy.google.com/~r/StreamHacker/~3/CEoX5CRsuzg/
13/09/2010 - 10:50

When your classification model has hundreds or thousands of features, as is the case for text categorization, it's a good bet that many (if not most) of the features are low information. These are features that are common across all classes, and therefore contribute little information to the classification process. Individually they are harmless, but in aggregate, low information features can decrease performance.

Eliminating low information features gives your model clarity by removing noisy data. It can save you from overfitting and the curse of dimensionality. When you use only the higher information features, you can increase performance while also decreasing the size of the model, which results in less memory usage along with faster training and classification. Removing features may seem intuitively wrong, but wait till you see the results.

High Information Feature Selection

Using the same evaluate_classifier method as in the previous post on classifying with bigrams, I got the following results using the 10000 most informative words:

evaluating best word features
accuracy: 0.93
pos precision: 0.890909090909
pos recall: 0.98
neg precision: 0.977777777778
neg recall: 0.88
Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
              astounding = True              pos : neg    =     10.3 : 1.0
             fascination = True              pos : neg    =     10.3 : 1.0
                 idiotic = True              neg : pos    =      9.8 : 1.0

Contrast this with the results from the first article on classification for sentiment analysis, where we use all the words as features:

evaluating single word features
accuracy: 0.728
pos precision: 0.651595744681
pos recall: 0.98
neg precision: 0.959677419355
neg recall: 0.476
Most Informative Features
         magnificent = True              pos : neg    =     15.0 : 1.0
         outstanding = True              pos : neg    =     13.6 : 1.0
           insulting = True              neg : pos    =     13.0 : 1.0
          vulnerable = True              pos : neg    =     12.3 : 1.0
           ludicrous = True              neg : pos    =     11.8 : 1.0
              avoids = True              pos : neg    =     11.7 : 1.0
         uninvolving = True              neg : pos    =     11.7 : 1.0
          astounding = True              pos : neg    =     10.3 : 1.0
         fascination = True              pos : neg    =     10.3 : 1.0
             idiotic = True              neg : pos    =      9.8 : 1.0

The accuracy is over 20% higher when using only the best 10000 words and pos precision has increased almost 24% while neg recall improved over 40%. These are huge increases with no reduction in pos recall and even a slight increase in neg precision. Here's the full code I used to get these results, with an explanation below.

import collections, itertools
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews, stopwords
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist

def evaluate_classifier(featx):
	negids = movie_reviews.fileids('neg')
	posids = movie_reviews.fileids('pos')

	negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
	posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

	negcutoff = len(negfeats)*3/4
	poscutoff = len(posfeats)*3/4

	trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
	testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

	classifier = NaiveBayesClassifier.train(trainfeats)
	refsets = collections.defaultdict(set)
	testsets = collections.defaultdict(set)

	for i, (feats, label) in enumerate(testfeats):
			refsets[label].add(i)
			observed = classifier.classify(feats)
			testsets[observed].add(i)

	print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
	print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
	print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
	print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
	print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
	classifier.show_most_informative_features()

def word_feats(words):
	return dict([(word, True) for word in words])

print 'evaluating single word features'
evaluate_classifier(word_feats)

word_fd = FreqDist()
label_word_fd = ConditionalFreqDist()

for word in movie_reviews.words(categories=['pos']):
	word_fd.inc(word.lower())
	label_word_fd['pos'].inc(word.lower())

for word in movie_reviews.words(categories=['neg']):
	word_fd.inc(word.lower())
	label_word_fd['neg'].inc(word.lower())

# n_ii = label_word_fd[label][word]
# n_ix = word_fd[word]
# n_xi = label_word_fd[label].N()
# n_xx = label_word_fd.N()

pos_word_count = label_word_fd['pos'].N()
neg_word_count = label_word_fd['neg'].N()
total_word_count = pos_word_count + neg_word_count

word_scores = {}

for word, freq in word_fd.iteritems():
	pos_score = BigramAssocMeasures.chi_sq(label_word_fd['pos'][word],
		(freq, pos_word_count), total_word_count)
	neg_score = BigramAssocMeasures.chi_sq(label_word_fd['neg'][word],
		(freq, neg_word_count), total_word_count)
	word_scores[word] = pos_score + neg_score

best = sorted(word_scores.iteritems(), key=lambda (w,s): s, reverse=True)[:10000]
bestwords = set([w for w, s in best])

def best_word_feats(words):
	return dict([(word, True) for word in words if word in bestwords])

print 'evaluating best word features'
evaluate_classifier(best_word_feats)

def best_bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
	bigram_finder = BigramCollocationFinder.from_words(words)
	bigrams = bigram_finder.nbest(score_fn, n)
	d = dict([(bigram, True) for bigram in bigrams])
	d.update(best_word_feats(words))
	return d

print 'evaluating best words + bigram chi_sq word features'
evaluate_classifier(best_bigram_word_feats)

Calculating Information Gain

To find the highest information features, we need to calculate information gain for each word. Information gain for classification is a measure of how common a feature is in a particular class compared to how common it is in all other classes. A word that occurs primarily in positive movie reviews and rarely in negative reviews is high information. For example, the presence of the word "magnificent" in a movie review is a strong indicator that the review is positive. That makes "magnificent" a high information word. Notice that the most informative features above did not change. That makes sense because the point is to use only the most informative features and ignore the rest.

One of the best metrics for information gain is chi square. NLTK includes this in the BigramAssocMeasures class in the metrics package. To use it, first we need to calculate a few frequencies for each word: its overall frequency and its frequency within each class. This is done with a FreqDist for overall frequency of words, and a ConditionalFreqDist where the conditions are the class labels. Once we have those numbers, we can score words with the BigramAssocMeasures.chi_sq function, then sort the words by score and take the top 10000. We then put these words into a set, and use a set membership test in our feature selection function to select only those words that appear in the set. Now each file is classified based on the presence of these high information words.

Signficant Bigrams

The code above also evaluates the inclusion of 200 significant bigram collocations. Here are the results:

evaluating best words + bigram chi_sq word features
accuracy: 0.92
pos precision: 0.913385826772
pos recall: 0.928
neg precision: 0.926829268293
neg recall: 0.912
Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
       ('matt', 'damon') = True              pos : neg    =     12.3 : 1.0
          ('give', 'us') = True              neg : pos    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
    ('absolutely', 'no') = True              neg : pos    =     10.6 : 1.0

This shows that bigrams don't matter much when using only high information words. In this case, the best way to evaluate the difference between including bigrams or not is to look at precision and recall. With the bigrams, you we get more uniform performance in each class. Without bigrams, precision and recall are less balanced. But the differences may depend on your particular data, so don't assume these observations are always true.

Improving Feature Selection

The big lesson here is that improving feature selection will improve your classifier. Reducing dimensionality is one of the single best things you can do to improve classifier performance. It's ok to throw away data if that data is not adding value. And it's especially recommended when that data is actually making your model worse.

http://ramslifeofalinguist.blogspot.com/2009/01/google-and-nlu.html
13/09/2010 - 10:50

Page has said the following:

The ultimate search engine would understand exactly what you mean and give back exactly what you want.

 
Thank God he also admits that we're not there yet (although Google no doubt works hard toward this goal).
Natural language understanding (NLU) is so much more than a word for word "decoding" of the linguistic meaning. Understanding "exactly what [one] mean[s]" requires full-blown NLU (rather than simply NLP) techniques and approaches. Linguistic and pragmatic context for instance figure big in NLU. And so are some "usability" aspects of the query for instance the intentions of the querent, assumptions and underlying inferences.
The search engines of the future will allow for a query to actually organize matching knowledge they mine from the internet instead of simply match against some web text. So when you plug in a query like "what is the cost of buying a house in Costa Rica in 2009?", you will expect something more specific and on-point than a list of "relevant" documents.

http://blog.outerthoughts.com/2008/08/explaining-computational-linguistics-to-friends-and-family/
13/09/2010 - 10:50

It is hard enough to explain what we are doing to our professors; explaining it in plain English to our friends and family is nearly impossible.

So it is always good to see people who can explain what POS tagger is and why it is important without having to throw around references to Norvig or Jurafsky.

Markus Dickinson has managed to do exactly such explanation in his non-linguistic primer to a serious research paper on Detecting Errors in Part-of-Speech Annotation. The writing is quite old (2003), but it reads well and still feels relevant. Of course, his research page contains more recent papers on the same topic too.

(via Hacklog)

http://blog.outerthoughts.com/2009/01/where-are-all-legal-computational-linguistics-resources/
13/09/2010 - 10:50

I am frustrated. I know my corpus (resolutions of the United Nations General Assembly) shares a lot in common with biomedical and legal domain. And I can find interesting articles in biomedical domain dealing with similar issues of complex tokenization, long named entity mentions (though mine are much longer), etc. But I see nothing in legal domain.

I have just gone through all of Jurix‘ proceedings as well as all of Artificial Intelligence and Law and all I got is between 2 and 4 articles worth following-up.

There must be somebody actually trying to parse real legal texts and figuring out to deal with complex organisation, people and group names. But all I can see is articles dealing with levels from ontology and up.

There might even be money in it!

One of the crazy business ideas I had was to parse all the web-based terms of use and privacy notices and annotate/crowd-vote them for how bad they are. So, before creating a web-based account, I could check it against database/parser and it would highlight and rate for me passages that I really should pay attention to (e.g. we sell your contact details to every spammer we know ). Since the language of those notices is often ritualistically formulaic, extracting interesting and useful summary would actually be simpler than it looks.

And the business model would center on providing automatic notification option if a notice from subscribed website sneakily changed and became much worse. That way one would pay money for peace of mind that there were no unexpected service rule changes.

http://blog.outerthoughts.com/2009/01/new-mailing-list-to-discuss-junction-of-nlp-and-software-engineering/
13/09/2010 - 10:50

Dr. René Witte has just created a new mailing list (SENLP) to discuss applying NLP techniques to Software Engineering and also to discuss general Software Engineering issues in developing NLP systems.

I am interested in both topics. I did 3 years as senior technical support at BEA and could see how applying NLP techniques on written notes in support cases could have improved quality of technical support. I did not get to do any of that, but some interest remains.

The second topic is even more interesting and important to me. It can build on current discussions currently held on blogs (see ‘The USES Issue‘ at Niels Ott’s blog) and in journals (see: ‘Empiricism Is Not a Matter of Faith‘ by Ted Pedersen). While some of the issues are discussed on mailing lists for individual pieces of software, a place to discuss cross-cutting concerns is very welcome.

I have joined the list and hope to see at least some of my readers there as well.

http://thenoisychannel.com/2010/08/06/taking-blekko-out-for-a-spin/
13/09/2010 - 10:50

If you’re a search engine junkie like me, you’ve probably heard about Blekko, a search engine that has been percolating for over two years and recently launched a private beta. If not, I encourage you to watch the TechCrunch video I’ve embedded above. You can join the beta by following them on Twitter. I did that earlier this week, and my invitation arrived via a direct message the next day.

Blekko’s main differentiating feature is that it supports “slashtags”. These aren’t the same as the Twitter microsyntax proposed by Chris Messina and named by Chris Blow. Rather, they are a way for users to “spin” their search results using a variety of filters. For example, [climate /liberal] and [climate /conservative] return very different results, because they are restricted to different sets of sites.

In addition to providing a set of curated slashtags, Blekko allows users to define their own slashtags by specifying the sets of sites to be included. There’s a social aspect here too: you can use (and follow) other users’ slashtags. Blekko also has some special slashtags that don’t act as site filters, e.g., /date shows recent results and /seo offers indexing information about web sites.

Blekko emphasizes two characteristics that I find very appealing: transparency and user control. While they do not disclose their relevance ranking algorithm, they do expose some of the information they use to compute it. More significantly, their emphasis on slashtags de-emphasizes default ranking, but rather encourages users to take more responsibility in the information seeking process. Very HCIR!

I like the concept. But I’m not sure how I feel about the execution. I have three main concerns.

First, the set of slashtags is somewhat haphazard–to be expected in a beta, but I’m not sure how it will evolve. I’d love to see a vocabulary collectively (and transparently) curated like Wikipedia, but I fear it will look more like social tagging site Delicious, which is a case study in the “vocabulary problem“. As any information scientist can tell you, managing vocabularies is hard!

Second, I’m not sure if site filters are the right model. What happens to sites with heterogeneous content? Or to sites that have one-hit wonders and therefore are unlikely to show up in any slashtags? I’d prefer to see the sites used as seeds to train classifiers that could then be applied to the entire index. Something a bit more like what Miles Efron implemented in this research–only on a much larger scale and applied at a page rather than site level.

Third, I think there’s a third ingredient that is essential to complement transparency and user control: guidance. As a user, I need to know what slashtags would lead me to interesting results, and ideally I’d want some kind of preview to make exploration as low-cost as possible.

I know I’m asking for a lot–especially from an ambitious startup that has just launched its private beta. But I think the stakes are high in this space, and going easy on a newcomer is no favor. I offer the tough love of a critic who would really like to see this kind of vision succeed.

Distribuir contenido