２，现有一数据文件内含有10 billion 整数，请设计一个高效的排序方法。 -eagle_no1(瞎起哄) @拉斯维加斯华人论坛:拉斯维加斯枫下论坛 The Rolia Forum of Las Vegas

２，现有一数据文件内含有10 billion 整数，请设计一个高效的排序方法。

eagle_no1(瞎起哄)

where are those " 10 billion 整数" stored ?
let assuming a file there are 10 billion 整数, represented as array:

int Source[10,000,000,000]; where source[i] is the integer value in i location,

the destination file is represented as int Destination[10,000,000,000] initialized as 0 ;

here is the sort, you might call hash algorithm with a counter to address the conflict

for( i=0;i<10,000,000,000;i++)
Destination[Source[i]] ++;

the key : the maximum value of a integer(32 bits) is less than 10,000,000,000;

am i right?

(#6044984@0)
Last Updated: 2010-5-1
This post has been archived. It cannot be replied.

就是看知不知道“链表”和“递归”吧。 -ao-ao(ao-ao); 2010-4-30 {273} (#6043795@0)

１，设计一个数据结构，该数据结构必须具有Stack 的 push and pop 功能。另外该数据结构必须有constant time 的search功能。
这个应当就是让你做个“链表”而已。

２，现有一数据文件内含有10 billion 整数，请设计一个高效的排序方法。
这个应当就是做个递归，我想应当分段Parallel处理。

多句嘴，你要是programmer的话得找本数据结构看看了，基本概念都不对啊 -yossarian(running); 2010-4-30 (#6044544@0)

唉，你呢其一，没有瞄准，其二呢，还没有弹药。让我觉得好笑。再给你一次机会，瞄准（指出对方的错误）然后上弹药（提出自己的见解）。至少我以后看见你不会笑话你，还有可能尊重你。fair? -ao-ao(ao-ao); 2010-5-1 (#6045101@0)

呵呵，算我没说吧 -yossarian(running); 2010-5-1 (#6045138@0)
人家话说得很实在，信言不美：） -frankwoo(柳五随风); 2010-5-1 (#6045696@0)

the best sorting is always O(nlogn), can't be better, even with mult threads -sowen(昂居居); 2010-4-30 (#6043831@0)

1. A hash map + a stack of reference. 2. Presumably, you cannot load all data and sort. The key is to reduce IO and memory usage. One common way is to construct a BTree, like database building an index. -majorhomedepot(马甲后的炮); 2010-4-30 (#6044396@0)

惨了，我连题目都没看懂，什么是整数？billion是多少？ -uwofri(0x2B||!0x2B); 2010-4-30 (#6044634@0)

那你可以应聘manager -michaeljackson(You Rock My World); 2010-5-1 (#6044969@0)

２，现有一数据文件内含有10 billion 整数，请设计一个高效的排序方法。 -eagle_no1(瞎起哄); 2010-5-1 {562} (#6044984@0)

好像错了。 “the maximum value of a integer(32 bits) is less than 10,000,000,000“。负数也是整数，所以Destination[-1]很有可能被system kick out。 -c1xwy(洪兴罩俺去战斗); 2010-5-1 (#6045010@0)

算法没错吧？当无符号正整数也一样满足条件嘛；这个coding/debugging 的事随便找车哥来干干就成；对了，车哥那去了？ -eagle_no1(瞎起哄); 2010-5-1 (#6045021@0)

1。memory 问题。 2。我觉得应该考虑堆排序 -frankwoo(柳五随风); 2010-5-1 (#6045118@0)

谈到实现方法，没memory 问题，可以直接打开目标文件，将值写到相应的位子；其实再看一眼原题，要”高效“，就得用空间换时间了； -eagle_no1(瞎起哄); 2010-5-1 (#6045174@0)

你生成了一个排序文件，但是原始数据并没有排序。 -frankwoo(柳五随风); 2010-5-1 (#6045701@0)

又是一个严谨的，目标文件的排序信息已经完全了，要排好序的文件顺次写回去不得了吗？车哥呢？ -eagle_no1(瞎起哄); 2010-5-1 (#6045713@0)

那还能高效？ -frankwoo(柳五随风); 2010-5-1 (#6045719@0)

一次遍历完成排序，还能更少吗？原题要求生成排序文件吗？是“排序方法”吧； -eagle_no1(瞎起哄); 2010-5-1 (#6045735@0)

所以原体未必强调的是算法，也有可能是偏重数据文件扩展使之可以索引，类似于PDF之类。真要大规模写，即使O(n)也谈不上高效，更何况是10billion次的写。 -frankwoo(柳五随风); 2010-5-1 (#6045743@0)

我也是路过凑热闹，想看看别人的具体解决方案和他们期待的答案； -eagle_no1(瞎起哄); 2010-5-1 (#6045747@0)

:),我倒是觉得这问题人家比较看重思路，而不是具体的做法，对不同的需求，做法也会大相径庭。 -frankwoo(柳五随风); 2010-5-1 (#6045752@0)

瞎猜啊，我觉得出题者的本意是是想看查面试者对Hadoop/MapReduce之类分布式处理系统/环境的理解，尤其是考虑到Amazon近些年在云计算方面投如巨资这个背景。如果是我回答的话，我想我会说用先将这个大的数据文件分割为若干个小的数据集分头排序然后再进行合并排序。 -renjl0810(Virtual Void); 2010-5-1 (#6045459@0)

有可能哈，又看了一眼，我这个算法正好支持分段并行处理，：-）；不过，是一个人做十天，还是十个人做一天，并不改变算法复杂度，是吗？ -eagle_no1(瞎起哄); 2010-5-1 (#6045542@0)

算法本身的复杂度肯定就这样了，关键就像是你所说的，是一个人做十天还是十个人做一天了。当然，现在及今后的趋势肯定是十个人做一天，也就是分布处理，如Hadoop/MapReduce, etc. -renjl0810(Virtual Void); 2010-5-1 {140} (#6045663@0)
所以我又点好奇，“请设计一个高效的排序方法“，其中的方法不知道英文原文是什么，从中文的角度既可以指算法这一级也可是分布式处理的体系结构这一级

好吧，就用这算法，加你的分布式处理的体系结构，这题就这么解了；大家还有意见吗？（当然我的算法描述不严谨，什么目标文件开大了，什么10billion 全是1会让计数器溢出的意见就不要提了） -eagle_no1(瞎起哄); 2010-5-1 (#6045686@0)

you assume there is no duplicated integer, so wrong ... -newstranger(wonderland); 2010-5-2 (#6045860@0)

The counter is to address the duplicated integer, say, there are n "1" in source file, after calculation, the destination[1]=n; i just said, the duplication time not larger than 0xffffffff; otherwise the destination[i] should be 64 bit wide. -eagle_no1(瞎起哄); 2010-5-2 (#6045922@0)

计时分布处理，也需要多次同步，效率上未必好 -- p.s我倒觉得这个问题提出来是要决绝类似于海量索引问题。当然本身产生这样的文件就是一个很不怎么样的设计。如果需要算法上分布，还不如趁早在数据上分布。 -frankwoo(柳五随风); 2010-5-1 (#6045714@0)

这道题考的是海量数据排序，内存有限不能全读入，内存排序算法根本不重要。看一看数据库排序的实现。分割成内存能放下的小块排序，在合并排序。其间要不断地使用硬盘的临时空间。 -informix(); 2010-5-1 (#6045731@0)

是整数，不是记录,用您的方法产生的B/B+树纳的多大呀 -frankwoo(柳五随风); 2010-5-1 (#6045746@0)

两码事，B+树是用于快速查找不是用于排序。 -informix(); 2010-5-1 (#6045750@0)

详细点，内存空间100，排序900个数，读入第一个100，排序，写入临时文件，结下来读第二个100，排序，依次类推，共写入9个临时文件。读入九个文件的第一个数 -informix(); 2010-5-1 {136} (#6045748@0)
选出最小的一个数，写入结果文件，读入最小的那个数的组的第二个数，代替那个最小的数，再选出最小的数，写入结果结果文件，依次类推，得到结果。

数据库就是这样同归分段归并排序的？假设无需B/B++ Tree，那么归并后的datafile还是10G?这还是数据库吗？ -frankwoo(柳五随风); 2010-5-1 (#6045759@0)

数据库的排序实现分为两种，查询结果小于内存容量的直接调入内存hash or 快速排序 or ....等等，(实际上内存排序算法根本不重要）。结果集大于内存的，基本上使用的这种算法。至少db2/oracle 都是这样。 -informix(); 2010-5-2 (#6045765@0)

这种算法的优势是读硬盘的次数少。读一块硬盘的时间，通常至少是读内存时间1000倍，所以在数据库系统的设计上大家都不介意内存的算法。为什么要用Btree作为index，因为一个上百万的table 只需要读两三个硬盘的block就可以allocate到数据。 -informix(); 2010-5-2 (#6045771@0)

那么归并后的datafile还是10G?这还是数据库吗？-------不明白你的意思。 -informix(); 2010-5-2 (#6045789@0)

数据相关性通过存储一级表现/体现，这恐怕不是数据库的本意吧 -frankwoo(柳五随风); 2010-5-4 (#6050838@0)

I think this is the correct answer. I initially thought caching non-leaf B+ tree node in memory can help, but it is definitely not as good as this algorithm. -majorhomedepot(马甲后的炮); 2010-5-2 (#6046595@0)

It is wrong. Why 10,000,000,001 is not an integer? It is just limitation of a certain machine. -majorhomedepot(马甲后的炮); 2010-5-2 (#6046600@0)

i was not quit sure actually; but why they say " integer" explicitely? what the information are they trying to deliver? -eagle_no1(瞎起哄); 2010-5-2 (#6046623@0)

You solution is creative, but I doubt "integer" indicates only 32-bit integers. Even 64-bit machines are so common today. In a real interview, you can always verify. -majorhomedepot(马甲后的炮); 2010-5-2 (#6046813@0)

@Las Vegas

２，现有一数据文件内含有10 billion 整数，请设计一个高效的排序方法。

Replies, comments and Discussions:

More Topics