The tasks resemble those that lawyers, doctors, financial analysts and management consultants receive life for life. One of them asks for a diagnosis of a six -year -old patient based on nine fragments of multimedia evidence; Another asks for a legal advice to the musician's estate; The third calls for an assessment of part of the technological company of healthcare.
Mercor, who claims to provide the “expert data” of each leading company for artificial art, says that he spent more than $ 500,000 on the development of 200 tasks that check whether AIS can “work with high economic value” according to the law, medicine, finance and management consulting. Obtaining AI performance index (APEX), published on Wednesday, is included in the list of its co -authors of the former global managing director of McKinsey, the former Dean of the Harvard School of Business and Professor of the Harvard Law School, who reported the development and scale of tasks in the relevant domains, according to Mercor. APEX is “focused on a deep deepening,” says Brendan Fuvardi, 22-year-old general director of Brendan Foody. “How can we become very comprehensive regarding what it means to be a consultant, banker, doctor or lawyer?”
To create tasks, Mercor has concluded a contract with white collar workers, which include the best banks (Goldman Sachs, JPMorgan), consulting firms (McKinsey, Boston Consulting Group), law firms (Latham & Watkins) and hospitals (Mount Sinai). They are on average 7.25 years of professional experience, and their payment in Mercor is competitive with its previous, very prestigious employers. The Mercor website advertises the average hourly rate of 81 US dollars per hour, reaching more than $ 200 per hour – equivalent to the annual salary of about $ 400,000 – for “senior domain experts”, which requires professional experience for at least four years.
“It is difficult to imagine the best hourly work in terms of payment,” says Matt O, a former investment banking analyst in Bank of America, who signed a contract with Mercor to write financial tasks similar to those included in the article.
Meaninglessness has long been used to evaluate the capabilities of AI, but they directly determine the ability of the AI models to perform economically useful work, it is a “paradigm shift,” says Oswald Nitska, one of the authors of the article. On the Mercor standard, “receiving 100% will mean that you will have a mainly analyst or partner in the box that you could complete the tasks, and then they bring him to the requirements of the partner or doctor of medicine, or the one who will evaluate the work of this person,” says Nicki.
There are no models yet, but they quickly improve. The GPT-4O Openai, released in May 2024, scored 35.9% per standard. The GPT-5, released a little more than a year later, reached 64.2%of the top score along the standard. Obtaining 64.2% in the standard does not mean that the GPT-5 delivers 64.2% of the value of the human workshop, which has not reached 100% “can be effective useless,” write the authors of the paper. The GPT-5 received full grades in only two of the 200 tasks-one in the law and one in investment banking, that “primarily includes the main reasoning, simple calculations and many basic searches of information”, according to Mercor.
Even if the model reaches 100% on the Mercor standard, it will probably make a bad replacement of people -specialists. The tasks in the reference Mercor Plankmark are focused on “well -wrapped results”, such as diagnoses, or the creation of financial models, and not more open tasks that can allow numerous correct answers. This requires that the descriptions of the tasks include numerous assumptions necessary to ensure that the desired output is well specified. Ais outputs are completely based on the texts, which means that the standard standard does not check the ability of AIS to use the computer, as a person will be. (Mercor says future versions of the peak Will consider these restrictions.) And the preparation of the long tips necessary for the models to complete the tasks, “it would be more tiring than just to do it yourself,” says sec.
Nevertheless, there are signs that AI models become competitive with people. Another standard Published Thursday, September 25, from Openai, showed that human expert appraisers prefer artificial intelligence for human work in 47.6% of cases in 220 tasks, including the development of real estate brochures and assessment of skin lesions. OpenAi also found that the effectiveness of its models increased significantly in a short time, more than doubled in their “victorious level” against people from June 2024 to September 2025.
As the model capabilities grow, and the complexity of the tasks on which they are tested, and the human skill necessary to create quite complex tasks. Earlier tests were measured relatively abstract capabilities on The reasoning of the puzzle And In the style of the exam questions. Tests Before the release of ChatGPT 2022, often received data from CrowdWorker Services, who paid employees A few dollars hour. By 2023, the doctor of philosophy were students He asked To create complex issues with multiple choice in biology, physics and chemistry. In September Xai reportedly 500 of their “universal” data workers were fired as part of the “expansion and priorities” of employees of the “specialized” data of the company. Of course, low -paid data workers still contribute To the development of artificial intelligence models, but the upper boundary of skills and compensation necessary for the development of control indicators is rapidly increasing.
According to Nick, to directly measure the usefulness of artificial intelligence models on economically valuable tasks “very difficult to cope”. Criteria for success in areas such as finance and consulting are more difficult to determine than, for example, in the development of software. Even with ideal criteria in the hands, labeling of AI on a large scale is more difficult than in the development of software, where automatic tests can check whether a piece of code works correctly. This partially explains why tests aimed at measuring the real utility of AI models existed To develop software from at least 2023, but lagged behind other white collar domains. However, since AI improved, they helped solve the problem of evaluating complex tasks. The success criteria for Mercor's tasks are written by human experts, but the labeling is carried out by AIS, which, according to Mercore, agreed with human graders in 89% of cases, helping to scale estimates.
Test development is not only knowledge of good models. In AI, as in business, “what is measured is done” – good tests often accelerate progress in these tests. “Ultimately, this is the same type of data for evaluating and for training,” says Fuvardi. Performance assessment in games such as GO is simple; AI BILL Go Masters by 2016. In 2023 tests It has begun AIS assessment for real tasks in software development. Two years later, labor statistics for younger programmers look doubtful.
“Ay received a doctoral degree,” says Fumov. “Now he begins to enter the labor market.”