Pareto-smoothed importance sampling is a way to approximate the (in your case) leave-one-out posterior. You fit your model once and then for each data point, the loo package re-weights the posterior that you have to approximate the leave-one-out posterior.
This is really awesome because it allows you to check your model on the left-out data point without fitting your model N times, it almost feels like cheating. But like every approximation, it might fail. This happens for “influential” data points, because the importance ratios that loo internally calculates to re-weight your posterior might have infinite variance and ultimately, even trying to smooth these weights does not work. For these cases, it is best to bite the bullet and re-fit the model, because then you get a sample of the “true” leave-one-out posterior and not just samples from the approximated leave-one-out posterior.
Coming to your actual question: elpd is a way to compare models. For each data point, you can use the leave-one-out posterior to calculate how surprised the model is to see the left-out data point (measured by the density at the observed value). This is repeated for every data point and the elpd that loo shows is just the sum of the individual contributions. So what reloo does is use the “true” leave-one-out posterior instead of the approximated one for those points where the approximation is not reliable.
But even if loo_compare says it is the prefered model, how does this change the fact that the original model2 has more influential observations?
Observations with high k are only problematic in the sense of trying to approximate the leave-one-out posterior, there’s nothing inherently bad about them. These points usually influence the posterior stronger than other ones, which in turn means that they also often have a lower elpd than other points, but it still holds true that model2 better predicts your observed data.
edit: I should probably add that many data points with high k can be an indicator of model misspecification (where the definition of many is of course context-dependent). In your case, I wouldn’t be worried at all, but you could check if those points with high k were labelled correctly, sometimes you can catch data-entry errors this way.