2009-01-27

引用关系可靠吗?

Sergei Maslov等在论文中介绍了一种使用PageTanke的论文评价方法,然而更令我感兴趣的是其对引用分析问题的几点归纳:
  1. 引用关系不能反映所引论文的重要程度,实际在论文中不同引用的份量彼此差异很大;
  2. 在不同学科间,论文引用数据量不具比较性;
  3. 对于突破性的论文,因为在其早期研究圈子较小,所以通常该论文被引数量较少;
  4. 论文的研究工作一旦被课本绍后,通常论文的引用会停止或大幅减少。

2009-01-24

相信自己愿意相信的东西

相信自己愿意相信的东西,所有人都这样
 《Lie to Me》S01E01的经典台词。

2009-01-21

“匿名编辑”与“署名创作”之争

为与Wikipedia竞争,Google推出Knol产品,并从2008年7月份开始运营,至今已有六个月。最近,据Goolge宣称Knol的文章总量已达到10万份,尽管离英文维基的271万份相差很远,但就6个月时间来说已是相当不容易。值得注意,人们曾担心的Knol运营模式,这个问题也开始出现,作者署名及评审的机制并没有发挥预计的作用,似乎这一问题会继续纠缠Knol:阅读量低,内容质量低,重复内容不少,按照Nate Anderson 的分析:
Take "Barack Obama," for instance. A search for his name brings up 809 entries; since most Knol users appear to write their own entries rather than add to others (for which no compensation is forthcoming), the proliferation of entries is inevitable. And it's not at all clear that the best ones are rising to the top.

遇到分歧时,多数Knol的用户倾向写自己的文章而不是将自己的意见补充到他人的文章内,内容扩散是难以避免的。有意思的是,学术论文也具有这个特点。Wikipedia倡导的“匿名编辑”与Knol倡导的“署名创作”,这场争斗会继续下去,而且也会蔓延到学术写作上。不过有点不变,读者总是希望能够阅读到“稳定”、“可信”的内容。

毛线编织的大脑模型

报道美国马萨诸塞州剑桥的美国国家经济研究部(National Bureau of Economic Research in Cambridge, Massachusetts)的精神科医生Karen Norberg花了一年用毛线编织了一个大脑模型。







这几幅图说不准将来可用来当yarn theory的素材,:)。

2009-01-19

柔——生生不息

恶是善的镜子

善意的强制比赤裸裸的恶还可怕,
后者容易分辨,而前者却有道德的遮掩

无恶无善

2009-01-18

它是如何将简单的写复杂的

这是一份IBM所提交的专利申请(US 2007/0282849 A1),是关于trim()的,就是那种去掉首尾空白字符的技术。估计多数程序员都了解trim()或者strip(),也正为大家都清楚该“技术”,所以这份专利申请也就成为一份极有价值的“专利申请模板”,如何将几句话交代清楚的技术,写成一份14页的专利申请,而且有20个claims。

2009-01-11

神秘的连分数

推荐一篇介绍连分数(Continued Fraction)的文章《An Introduction to the Continued Fraction》,这篇文章易读而且内容也相当丰富。

2009-01-09

检讨书——原来我很低俗

2009年才近10天,在『不折腾』的号召下,在一片“胡折腾”、“瞎折腾”的氛围感染下,我才发现常去转转的网易、新浪,常用的搜索引擎google,它们都被列为了低俗网站,那些平时偶尔关注的博客也因低俗被永久关闭、删除。

在这一场声势浩大、伟大而正确的“不折腾、反低俗、抓生产”整风运动中,我深刻认识到,我很低俗。说实话,这令我非常郁闷,我没什么大的优点,尽管小错不断但大错误还是鲜有发生。然而,新年伊始就发现自己很低俗,这极大打击了我的自信心、积极性。

我很郁闷,但请组织放心,情绪还算稳定,需要散散步,放放松,活动活动筋骨,调整调整状态。还好问题发现的早,错误可以及时补正。保证在今后,在忘情于革命生产同时,一定不忘多阅读些学习材料,提高自己的理论认识,理论指导实践。

2009-01-08

网易的跟贴

网易新闻做了一个2008年度专栏“无跟贴,不新闻”,颇有新意,煽情回顾了“盖楼”的历史。但更令人感兴趣的是里面摆的数据:

(网易)2008年,
2,397,339篇新闻
41,658,635条跟贴
550名编辑
290,000,000名网民


如果网易、新浪们能够做个公开数据集,open data, 以供学院派们研究,这就太好了。

Compressed Sensing

维基百科这样定义Compressed Sensing:
Compressed sensing is a technique for acquiring and reconstructing a signal utilizing the prior knowledge that it is sparse or compressible.

The main idea behind compressed sensing is to exploit that there is some structure and redundancy in most interesting signals -- they are not pure noise. In particular, most signals are sparse, that is, they contain many coefficients close to or equal to zero, when represented in some domain.

有新东西学啦,有兴趣请从Terence Tao博客上的介绍开始。

The thing is that while the space of all images has 2MB worth of “degrees of freedom” or “entropy”, the space of all interesting images is much smaller, and can be stored using much less space, especially if one is willing to throw away some of the quality of the image.

...

What if the camera selects a completely different set of 100,000 (or 300,000) wavelets, and thus loses all the interesting information in the image?

The solution to this problem is both simple and unintuitive. It is to make 300,000 measurements which are totally unrelated to the wavelet basis - despite all that I have said above regarding how this is the best basis in which to view and compress images. In fact, the best types of measurements to make are (pseudo-)random measurements - generating, say, 300,000 random “mask” images and measuring the extent to which the actual image resembles each of the masks. Now, these measurements (or “correlations”) between the image and the masks are likely to be all very small, and very random. But - and this is the key point - each one of the 2 million possible wavelets which comprise the image will generate their own distinctive “signature” inside these random measurements, as they will correlate positively against some of the masks, negatively against others, and be uncorrelated with yet more masks.

But (with overwhelming probability) each of the 2 million signatures will be distinct; furthermore, it turns out that arbitrary linear combinations of up to a hundred thousand of these signatures will still be distinct from each other (from a linear algebra perspective, this is because two randomly chosen 100,000-dimensional subspaces of a 300,000 dimensional ambient space will be almost certainly disjoint from each other). Because of this, it is possible in principle to recover the image (or at least the 100,000 most important components of the image) from these 300,000 random measurements. In short, we are constructing a linear algebra analogue of a hash function.


这里也有Tao的Compressed Sensing讲座,共7段.