c++ - Why would this search method not be scalable? -
i want parallel search algorithm using openmp, vtree binary search tree, , want apply search algorithm each of point set. below snippet of code. search procedure 2 points totally irrelevant , can parallel. though need read same tree, once constructed, tree wouldn't modified more. read-only.
however, code below shows terrible scalability, on 32-core platform, 2x speed achieved. because vtree read threads? if so, how can further optimize code?
auto results = vector<vector<point>>(particlenum); auto t3 = high_resolution_clock::now(); double radius = 1.6; #pragma omp parallel (decltype(points.size()) = 0; < points.size(); i++) { vtree.search(points[i], radius, results[i]); } auto t4 = high_resolution_clock::now(); double searchtime = duration_cast<duration<double>>(t4 - t3).count(); the type signature search is
void vptree::search(const point& p, double radius, vector<point>& result) const search result put result.
my best guess cache ping-pong'ing on result vectors. assume "search" function uses passed-in result vector place put points , use throughout algorithm insert neighbors encounter them in search tree. whenever add point result vector, internal data of vector object modified. , because of result vectors packed in contiguous memory, different result vectors occupy same cache lines. so, when cpu maintains cache coherence, lock relevant cache lines.
the way solve use internal, temporary vector assign results vector once @ end (which can done cheaply if use move semantics). this:
void vptree::search(const point& p, double radius, vector<point>& result) const { vector<point> tmp_result; // ... add results "tmp_result" result = std::move(tmp_result); return; } or, return vector value (which implicitly using move):
vector<point> vptree::search(const point& p, double radius) const { vector<point> result; // ... add results "result" return result; } welcome joyful world of move-semantics , how awesome can @ solving these types of concurrency / cache-coherence issues.
it conceivable you're experiencing problems related accessing same tree threads, since it's read-only operations, i'm pretty sure on conservative architecture x86 (and other intel / amd cpus) should not pose significant issue, might wrong (maybe kind of "over-subscription" problem might @ play, it's dubious). , other problems might include fact openmp incur quite bit of overhead (spawning threads, synchronizing, etc.) has weighted against computational cost of actual operations doing within parallel loops (and it's not favorable trade-off). , also, if vptree (i imagine stands "vantage-point tree") not have locality of references (e.g., implemented linked-tree), performance going terrible whichever way use (as explain here).
Comments
Post a Comment