l******9 发帖数: 579 | 1 May I do this by boost/thread ?
thanks |
|
|
M********u 发帖数: 42 | 3 如果是简单的for loop,就用openmp包在loop外。如果logic比较复杂,需要手动
create thread。没有一个简单的办法把一个单线程的程序改成多线程的 |
|
C**5 发帖数: 202 | 4 您的问题我也很感兴趣啊 有解决方案一定给我发一份? |
|
l******9 发帖数: 579 | 5 I am also thinking about openMP.
But, how to make sure that openMP take full use of available
cores ?
Suppose that I have 24 CPUs, each of them has 6 cores (each core
supports hyperthreading).
I have 10,000 computing tasks, each of them needs 0.001 second.
Some of the tasks need to exchange data, which is very small.
Which task needs to send/receive data to/from which task is pre-defined. It
is known before the program in run.
But, the exchange frequency may be very high.
I want to schedule task... 阅读全帖 |
|
|
S*A 发帖数: 7142 | 7 You just need to make sure your program has at least 144 threads.
The kernel will try to schedule to each core depend on the need. |
|
l*****o 发帖数: 473 | 8 好像pthread_setaffinity_np是唯一的方法吧。
I think that for the case here, it would be hard to ask OS to assign a
different core for a thread.
In linux, thread will be assigned to the same core in the beginning. But it
will be migrated to other cores if the thread is a long-lived thread.
If a thread is too short, then it possibly stay in the same core. I have an
experience that using process can be faster than threads for this short
lived threads. |
|
l*****o 发帖数: 473 | 9 lz的机器好cool,是NUMA的吗? 什么时候可以借来Try一下程序吗,我们老板不给买这
么好的机器。 |
|
l******9 发帖数: 579 | 10 Hi,
I am trying to do parallelization for a computing intensive problem.
I am working on a Linux cluster where each node is a multicore processor.
e.g. 2 or 4 quad-core processor per node.
I want to reduce latency and improve performance as much as possible.
I plan to use multiprocessing and multithreading at the same.
Each process run on a distinct node and each process spawn many threads
on each node. This is a 2 level parallelism.
For multiprocessing, I would like to choose MPI.
For multithre... 阅读全帖 |
|
w*s 发帖数: 7227 | 11 i concern about performance for this HW/SW interrupt driven embedded system,
should i use multiprocess, not multithread,
assuming memory is not the problem ?
also this is multiple core imx6 chip. |
|
b*******s 发帖数: 5216 | 12 it varies. some real-time unix uses processes as the basic units for
scheduling. but linux are used to be optimized for threads.
and if you want stability, processes are better. but for performance,
context switching could be more expensive with processes |
|
w*s 发帖数: 7227 | 13 大牛,
regarding performance, assuming all processes have same priority.
say system has 10 processes running already,
1. process way
i start my 2 processes to deal with 2 pieces of HW,
each HW process is consistently polling info from all devices from 485 bus.
2. thread way,
i start 1 process, it has 2 threads for 2 pieces of HW.
don't you think process way gives HW process more chance to run ? |
|
k****n 发帖数: 1334 | 14 现在就是主流了
~~~~~~~~~~~~无数公司现在在做这个,multicore manycore的 |
|
r*********r 发帖数: 3195 | 15 i heard the JDT can do parallel compiling on multicore machine now... not
sure if it's true.
i only use the CDT, don't know if anything new for this. |
|
a****l 发帖数: 8211 | 16 你在multicore上用mpi就好比是你通过邮局写信给你的roommate. |
|
p***o 发帖数: 1252 | 17 You need to declare the global boolean to be 'volatile' for this to work
with modern compilers on most modern multicore processors. Roughly speaking,
you need to tell the compiler and the processor NOT to reorder the writes
in the background and the read/write in the main thread. Search for the
keyword 'memory barrier', and that's why it's better for him to learn
from some decent books ...
very
,
, |
|
f******k 发帖数: 297 | 18 is there a way in Windows to know which logical processor a thread is
running on? does it happen often that a thread is switched between different
processors by OS scheduling?
also in a hyperthreaded multicore environment, does OS have a preference to
first schedule thread on different physical processors, or it treat every
logical processor equal?
thx. |
|
l******9 发帖数: 579 | 19 Hi,
I am trying to do parallelization for a computing intensive problem.
I am working on a Linux cluster where each node is a multicore processor.
e.g. 2 or 4 quad-core processor per node.
I want to reduce latency and improve performance as much as possible.
I plan to use multiprocessing and multithreading at the same.
Each process run on a distinct node and each process spawn many threads
on each node. This is a 2 level parallelism.
For multiprocessing, I would like to choose MPI.
For multithre... 阅读全帖 |
|
r****t 发帖数: 10904 | 20 python 的文档挺好啊,我最近也听另外一个人 complain python 文档了,有这么差么。
python/numpy 好的地方是比 matlab 使用默认直观写法的时候省内存,matlab 也可以
做到,但是必须用专门的运算函数,语法上面完全不直观了。另外一方面,matlab 用
在 multicore batch system 上面不现实(一个 CPU 一个 lincense) 一个 cluster 一
般上百个的,python 随便多少个核没有附加成本。并行交互性也比 matlab 好。总之
只要出了单机范围 matlab 就不现实了。 |
|
r****t 发帖数: 10904 | 21 python 的文档挺好啊,我最近也听另外一个人 complain python 文档了,有这么差么。
python/numpy 好的地方是比 matlab 使用默认直观写法的时候省内存,matlab 也可以
做到,但是必须用专门的运算函数,语法上面完全不直观了。另外一方面,matlab 用
在 multicore batch system 上面不现实(一个 CPU 一个 lincense) 一个 cluster 一
般上百个的,python 随便多少个核没有附加成本。并行交互性也比 matlab 好。总之
只要出了单机范围 matlab 就不现实了。 |
|
k**********g 发帖数: 989 | 22
Grand Central Dispatch (GCD)
True. This is due to increased processor power consumption in multicore
processing - overheat, battery running out faster, etc.
good lol-logic, lol. |
|
k**********g 发帖数: 989 | 23
The correct approach is to use actor model.
It may not have good performance, but any other multicore approach will be
10x more difficult than using actor model.
The fundamental constructs inside an actor model framework:
1. thread-safe queue (preferably lock-free or obstruction-free)
2. thread pool
3. worker pool (each worker runs on one thread on the thread pool)
4. task queue / task pool (a collection of ready-to-run tasks)
(remark: it is called task pool because it is not necessarily a ... 阅读全帖 |
|
k**********g 发帖数: 989 | 24
我认为最大贡献其实是 immutability 和 data-flow thinking (value-based
thinking)。
These have contributed changes to database architectures, as well as
application programming. They also make multicore programming easier.
These ideas can be brought back into imperative (procedural) or OO languages.
To bring them back into OO languages, there need to be a kind of immutable
classes: (similar to C++ const keyword)
(1) all class fields are marked "immutable" keyword
(2) if class fields contain references to anot... 阅读全帖 |
|
T********i 发帖数: 2416 | 25 成天发贴的那帮,又不是有任务,也不是拿钱发贴的,那么激动地党同伐异为了啥?
有这功夫,好好修炼一下基本功好不好?
双CPU单机的性能,40G全双工没问题,Solarflare 7122F号称每秒2000万messages。我
用6122F每秒500万72 bytes的message,两年从来没丢过一次包。我用的是UDP包。当然
用TCP性能几乎一样,而且更靠谱。
Multicore concurrency,公开资料根本没有,知道的都不会说。
我这里可以透漏一些常识:
1. NIC的socket API是专用的,可以完全Kernel bypass,现有的使用socket IO的程序
甚至不用重编译。参见OpenOnLoad。
2. 系统的网卡不是越多越好,在最新的Sandy Bridge及以后的架构下,每个CPU挂一个
网卡最优。
3. Socket I/O用一个线程操作最优。一个线程的Socket I/O throughput是最大的。道
理是什么,自己去想。但是我可以肯定地讲,本版跳的最起劲那几个Java大牛根本没有
认清这个问题的基础知识。
4. 双CPU一共16个Core (8X... 阅读全帖 |
|
g*****g 发帖数: 34805 | 26 The performance gain on C++ over Java comes from startup time, JIT warm up,
JIt binary code compilation, but not memory reclamation. As a matter of fact
, Java memory reclamation would be faster than C++ unless you work hard to
optimize memory reclamation on C++ side. The reason is because:
1. Java runs garbage collection on a separate thread or threads, C++ code
typically runs in the main thread. In a multicore environment as the
commonplace today. Java has the advantage.
2. When CPUs are loade... 阅读全帖 |
|
g*****g 发帖数: 34805 | 27 http://electronicdesign.com/analog/memory-wall-ending-multicore
No, I am talking about super computers that really should be cpu bound and
can't due to relative slow speed of data transfer. And I don't think this
situation will change any time soon. Network infrastructure evolves even
slower than memory size/speed. |
|
w*s 发帖数: 7227 | 28 【 以下文字转载自 Linux 讨论区 】
发信人: wds (中原一点红:心开运就通,运通福就来), 信区: Linux
标 题: when should use multiprocess not multithread: embedded multicore linux
发信站: BBS 未名空间站 (Mon Feb 3 23:42:30 2014, 美东)
i concern about performance for this HW/SW interrupt driven embedded system,
should i use multiprocess, not multithread,
assuming memory is not the problem ?
also this is multiple core imx6 chip. |
|
m*******8 发帖数: 183 | 29 楼主是BSO吗?偶是C++牛人啊,偶开发一个仿真器,就是QEMU,Android Emulator那种
,start from scratch,全部用C++,跑multicore Linux杠杠的,一个人四个月搞定。
木有300K啊,连150K都木有啊。 |
|
f******2 发帖数: 2455 | 30 Faint, 一不小心我老逆行了。。。
啥叫 scale out这事儿搞multicore的说我这个叫,你们搞 ilp的不是
搞互联网堆机器的说我们mpp集群叫,说你们搞smp的不叫
多读书少judge人是老祖宗的美德,很多人来美国后就忘了 |
|
k**********g 发帖数: 989 | 31
楼主说明是HPC Architecture,和光 HPC 不一样。全世界估计就不到十个大户(I, A,
Nv, Qc, A(uk), Xx)和墙街会用上HPCA(注意是Architecture)
就只一个 A 字应该是差之千里吧?我推测是 CPU architecture, cache coherence, ,
multicore interconnect, memory interconnect 之类?要不楼主可否列个内容清单
让我见识见识?感激不尽。。
我承认我孤陋寡闻,从未任职上述大企,亦没有和学术界联系。但给我的感觉是大企希
望各大学的大牛免费奉上新意念,发论文,交流会,大企内部再东施效颦重做一次,不
等於这个门槛会对学生(尤其是非公民)开放。(不是说大企内部不创新,而是说大企
内部自行创新的意念从不对外开放。) |
|
n*******7 发帖数: 181 | 32 数据依赖性也是有一些方法可以减弱。有些方面和cache coherency的设计是想通的。
multicore系统内每一个core看到的memory的值可被其它所有cores改变,这也是一个强
耦合关系,逻辑上这就是分布式。硬件实现cache coherency的一套成熟的protocol,
软件分布式也可以照做。 |
|
b***i 发帖数: 3043 | 33 我觉得不是memory barrier的问题。memory barrier前面有很多人提过,就是会防止打
乱次序。可是我代码里面设置quit=true;的语句前后没有其他代码。总不能把if的条件
和结果给打乱了吧?
UI回叫
void UIcallback(...){
quit=true;//这里没有任何其他语句
}
或者TCP处理的回叫
void TCPcallback(...){
...A
if (字符串=="QUIT")
quit=true;//这里没有任何其他语句
else
...B
...C
}
有个if在那里总不能把...A和quit=true;交换顺序吧?在这里即使有A/C这两个语句,
也是和quit=true互斥的,就是说我不可能在线程需要退出了还要继续进行其他的操作
。如果有人问那一定要进行其他的操作怎么办,比如释放资源。办法很简单,在线程的
while结束后操作。所以我说很多人引经据典都是对普通的情况的建议。对我这种具体
的情况,就一个bool,所以不需要critical section,就一... 阅读全帖 |
|
d******c 发帖数: 2407 | 34 pandas rule of thumb: have 5 to 10 times as much RAM as the size of your
dataset
There are additional, hidden memory killers in the project, like the way
that we use Python objects (like strings) for many internal details, so it's
not unusual to see a dataset that is 5GB on disk take up 20GB or more in
memory. It's an overall bad situation for large datasets.
The 10 (really 11) things are (paraphrasing my own words):
Internals too far from "the metal"
No support for memory-mapped datasets
Poor p... 阅读全帖 |
|
d******c 发帖数: 2407 | 35 pandas rule of thumb: have 5 to 10 times as much RAM as the size of your
dataset
There are additional, hidden memory killers in the project, like the way
that we use Python objects (like strings) for many internal details, so it's
not unusual to see a dataset that is 5GB on disk take up 20GB or more in
memory. It's an overall bad situation for large datasets.
The 10 (really 11) things are (paraphrasing my own words):
Internals too far from "the metal"
No support for memory-mapped datasets
Poor p... 阅读全帖 |
|
发帖数: 1 | 36 讓我給你科普一下自動駕駛到底需不需要OS:
现有车系:
Waymo/Google/Intel => Linux
Tesla/Nvidia => Linux
Denso/Toyota => QNX
GM/Delphi => QNX
Baidu => QNX
自動駕駛系統不可能是一個廠商、一個團隊、一個人寫。
不用OS你怎麼可能避免others shit on your code?
未來標準:Multicore + Hypervisor
用hypervisor來隔離任務,光靠你寫幾個中斷能屏蔽別人的代碼麼?
目前同時支持hypervisor和gpu的只有linux/qnx/vxworks,你認為nvidia會開放架構讓
你自己寫soc的驅動麼?
你認為一個人可以寫自動駕駛系統,能同時懂nvidia、arm、mobile eye的所有shit麼? |
|
发帖数: 1 | 37 你買車時聽過這個詞沒:Variable Valve Timing,這是ECU控制的,ECU crash了你的
車會熄火,方向盤當然就失去助力了。現在的汽車ECU就是電腦控制點火,未來的汽車
電腦還會控制駕駛。系統只會越來越複雜。
Hypervisor是未來的MCU方向,你去看看Cortex-M33。
Core根本沒有隔離任何東西,程序是什麼?程序是存儲器狀態,你SoC裡面的memory都
是共享的(我就不跟你講multicore coherency了,估計你也不懂),光有core能隔離
啥?
說memory隔離簡單的,回去看虛擬內存、物理地址、MMU和paging的實現,看你能自己
寫一個不?
調用sensor數據簡單,你給我寫個程序,實時合成6個1080p攝像頭的MIPI CSI數據?然
後分析裡面的障礙物?
你是不是只做過溫度傳感器或GPS這種小項目?
結論:你可以吹牛,但我不相信你不用OS可以搞出自動駕駛系統。。。 |
|
T********i 发帖数: 2416 | 38 这种engine computer都是高度隔离的。怎么可能和其他系统放在同一个CPU里面?
MPU在Cortex M3里面就有了。我刚刚给你讲的FreeRTOS里面就有实现。
年轻人要谦虚谨慎。做multicore多核的,同步全靠core之间通信,全靠coherency。偶
尔还要猜一猜cache agent如何工作的。因为这些都是技术机密,只能通过白皮书假设
,再验证才行。这些玩意儿我10年前就已经玩坏了。
你要这样想,这些sensor数据都是在一个critical path上的。任何一个环节跑飞,整
个系统就必须failsafe了。memory隔离其实没大用。
这世界的问题,不是数据量大就难,mipi CSI,一路和100路有啥区别啊?这种视频的
,一般CPU都跑不动,还要靠额外的处理器。这些又不难。就像一个core,管理一个
state machine,或者管理100个state machine,每个state machine有不同上的优先级
,其实也差不多一样。 |
|
g****t 发帖数: 31659 | 39 这世上难的是创造。不是背书。
我们小学学的多位数乘法很简单。
中国几千年就是没人发明出来。
: 这种engine computer都是高度隔离的。怎么可能和其他系统放在同一个CPU里面?
: MPU在Cortex M3里面就有了。我刚刚给你讲的FreeRTOS里面就有实现。
: 年轻人要谦虚谨慎。做multicore多核的,同步全靠core之间通信,全靠
coherency。偶
: 尔还要猜一猜cache agent如何工作的。因为这些都是技术机密,只能通过白皮
书假设
: ,再验证才行。这些玩意儿我10年前就已经玩坏了。
: 你要这样想,这些sensor数据都是在一个critical path上的。任何一个环节跑
飞,整
: 个系统就必须failsafe了。memory隔离其实没大用。
: 这世界的问题,不是数据量大就难,mipi CSI,一路和100路有啥区别啊?这种
视频的
: ,一般CPU都跑不动,还要靠额外的处理器。这些又不难。就像一个core,管理
一个
: state machine,或者管... 阅读全帖 |
|
发帖数: 1 | 40 讓我給你科普一下自動駕駛到底需不需要OS:
现有车系:
Waymo/Google/Intel => Linux
Tesla/Nvidia => Linux
Denso/Toyota => QNX
GM/Delphi => QNX
Baidu => QNX
自動駕駛系統不可能是一個廠商、一個團隊、一個人寫。
不用OS你怎麼可能避免others shit on your code?
未來標準:Multicore + Hypervisor
用hypervisor來隔離任務,光靠你寫幾個中斷能屏蔽別人的代碼麼?
目前同時支持hypervisor和gpu的只有linux/qnx/vxworks,你認為nvidia會開放架構讓
你自己寫soc的驅動麼?
你認為一個人可以寫自動駕駛系統,能同時懂nvidia、arm、mobile eye的所有shit麼? |
|
发帖数: 1 | 41 你買車時聽過這個詞沒:Variable Valve Timing,這是ECU控制的,ECU crash了你的
車會熄火,方向盤當然就失去助力了。現在的汽車ECU就是電腦控制點火,未來的汽車
電腦還會控制駕駛。系統只會越來越複雜。
Hypervisor是未來的MCU方向,你去看看Cortex-M33。
Core根本沒有隔離任何東西,程序是什麼?程序是存儲器狀態,你SoC裡面的memory都
是共享的(我就不跟你講multicore coherency了,估計你也不懂),光有core能隔離
啥?
說memory隔離簡單的,回去看虛擬內存、物理地址、MMU和paging的實現,看你能自己
寫一個不?
調用sensor數據簡單,你給我寫個程序,實時合成6個1080p攝像頭的MIPI CSI數據?然
後分析裡面的障礙物?
你是不是只做過溫度傳感器或GPS這種小項目?
結論:你可以吹牛,但我不相信你不用OS可以搞出自動駕駛系統。。。 |
|
T********i 发帖数: 2416 | 42 这种engine computer都是高度隔离的。怎么可能和其他系统放在同一个CPU里面?
MPU在Cortex M3里面就有了。我刚刚给你讲的FreeRTOS里面就有实现。
年轻人要谦虚谨慎。做multicore多核的,同步全靠core之间通信,全靠coherency。偶
尔还要猜一猜cache agent如何工作的。因为这些都是技术机密,只能通过白皮书假设
,再验证才行。这些玩意儿我10年前就已经玩坏了。
你要这样想,这些sensor数据都是在一个critical path上的。任何一个环节跑飞,整
个系统就必须failsafe了。memory隔离其实没大用。
这世界的问题,不是数据量大就难,mipi CSI,一路和100路有啥区别啊?这种视频的
,一般CPU都跑不动,还要靠额外的处理器。这些又不难。就像一个core,管理一个
state machine,或者管理100个state machine,每个state machine有不同上的优先级
,其实也差不多一样。 |
|
g****t 发帖数: 31659 | 43 这世上难的是创造。不是背书。
我们小学学的多位数乘法很简单。
中国几千年就是没人发明出来。
: 这种engine computer都是高度隔离的。怎么可能和其他系统放在同一个CPU里面?
: MPU在Cortex M3里面就有了。我刚刚给你讲的FreeRTOS里面就有实现。
: 年轻人要谦虚谨慎。做multicore多核的,同步全靠core之间通信,全靠
coherency。偶
: 尔还要猜一猜cache agent如何工作的。因为这些都是技术机密,只能通过白皮
书假设
: ,再验证才行。这些玩意儿我10年前就已经玩坏了。
: 你要这样想,这些sensor数据都是在一个critical path上的。任何一个环节跑
飞,整
: 个系统就必须failsafe了。memory隔离其实没大用。
: 这世界的问题,不是数据量大就难,mipi CSI,一路和100路有啥区别啊?这种
视频的
: ,一般CPU都跑不动,还要靠额外的处理器。这些又不难。就像一个core,管理
一个
: state machine,或者管... 阅读全帖 |
|
发帖数: 1 | 44 MacOS sucks, my pinyin is dying recently after the upgrade. Many server
applications need high single core performance, because there is always a
global scheduler, or manager or collector etc somewhere. It may be the real
bottleneck even though hidden from normal application developers. For Java,
it is GC. For GO, it is runtime.sched. Only few senior engineers dare to
change it. If single core performance is bad, its detrimental effect on
overall multicore system's performance will be amplified ... 阅读全帖 |
|
r*******t 发帖数: 8550 | 45 A) Create 144 threads, and OS will arrange them
B) Create threads on specific core 0-143 (you arrange which core to execute
which thread specifically)) |
|